feat: resume interrupted dataset generation runs (sync + async engine)#526
feat: resume interrupted dataset generation runs (sync + async engine)#526przemekboruta wants to merge 14 commits intoNVIDIA-NeMo:mainfrom
Conversation
- ArtifactStorage gains a `resume: bool = False` field - resolved_dataset_name skips timestamp logic when resume=True, returning the existing dataset folder name as-is - Raises ArtifactStorageError on resume=True when the target folder is absent or empty (no data to resume from) - New clear_partial_results() removes in-flight partial results left over from an interrupted run Fixes NVIDIA-NeMo#525
DatasetBatchManager.start() now accepts: - start_batch: int = 0 — first batch index to process - initial_actual_num_records: int = 0 — records already on disk Both default to 0 so all existing call sites are unaffected. Fixes NVIDIA-NeMo#525
- build() gains a resume: bool = False parameter - _load_resume_state() reads metadata.json and validates that num_records and buffer_size match the original run - _build_with_resume() skips completed batches, clears in-flight partial results, and continues from the first incomplete batch - Raises DatasetGenerationError with clear messages for: - missing metadata.json (interrupted before first batch completes) - num_records mismatch - buffer_size mismatch - DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported) - Logs a warning and returns early when dataset is already complete Fixes NVIDIA-NeMo#525
- create() gains resume: bool = False - _create_resource_provider() passes resume to ArtifactStorage - builder.build() receives the resume flag Fixes NVIDIA-NeMo#525
Covers: - ArtifactStorage.resolved_dataset_name with resume=True - ArtifactStorage.clear_partial_results() - DatasetBatchManager.start() with start_batch and initial_actual_num_records - DatasetBuilder.build(resume=True): missing metadata, num_records mismatch, buffer_size mismatch, already-complete detection Fixes NVIDIA-NeMo#525
Greptile SummaryThis PR adds The two previously-flagged P1 issues (processors re-running on an already-complete dataset, and stale
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Core resume logic: _build_with_resume (sync) and updated _build_async (async) both correctly gate run_after_generation behind the generated flag; async path correctly sources both counters from filesystem via _find_completed_row_group_ids; _load_resume_state correctly validates run-parameter compatibility. |
| packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py | Adds resume field and clear_partial_results(); resolved_dataset_name correctly short-circuits timestamp logic on resume and raises ArtifactStorageError when no existing folder is found. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py | Adds start_batch and initial_actual_num_records to start(); correctly applies them after reset() so subsequent state is seeded with resume values. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py | Adds initial_actual_num_records and initial_total_num_batches constructor params to seed counters for resumed async runs; straightforward and correct. |
| packages/data-designer/src/data_designer/interface/data_designer.py | Threads resume through to ArtifactStorage and builder.build(); public API changes are minimal and non-breaking. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Comprehensive coverage of resume paths including crash-window scenario, num_records/buffer_size mismatch, already-complete detection, and processor-skip guard; module-level imports appear mid-file (E402) but logic is sound. |
| packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dataset_batch_manager.py | New tests cover all combinations of start_batch and initial_actual_num_records; default-unchanged test guards regressions. |
| packages/data-designer-engine/tests/engine/storage/test_artifact_storage.py | Tests cover resume flag's three states (existing folder, missing folder, empty folder) and clear_partial_results happy/noop paths. |
Sequence Diagram
sequenceDiagram
participant U as User
participant DD as DataDesigner.create()
participant AS as ArtifactStorage
participant DB as DatasetBuilder.build()
participant BM as DatasetBatchManager
participant FS as Filesystem
U->>DD: create(resume=True)
DD->>AS: ArtifactStorage(resume=True)
AS->>FS: check artifact_path/dataset_name exists?
FS-->>AS: exists (resume) or raise ArtifactStorageError
AS-->>DD: storage with resolved_dataset_name
DD->>DB: build(resume=True)
DB->>AS: clear_partial_results()
AS->>FS: rmtree(tmp-partial-parquet-files/) if exists
alt Sync path
DB->>DB: _load_resume_state() → read metadata.json
DB->>BM: start(num_records, buffer_size, start_batch=N, initial_actual_num_records=M)
BM->>BM: reset() then set _current_batch_number=N, _actual_num_records=M
loop batches N..total
DB->>DB: _run_batch(batch_idx)
DB->>BM: finish_batch()
BM->>AS: write_metadata(num_completed_batches=batch_idx+1)
end
DB->>BM: finish()
DB-->>DD: generated=True → run_after_generation()
else Async path
DB->>FS: glob(parquet-files/batch_*.parquet)
FS-->>DB: completed_row_group_ids (filesystem truth)
DB->>DB: compute initial_actual_num_records from completed ids
DB->>BM: start(num_records, buffer_size, start_batch=0)
loop row groups (skipping completed)
DB->>BM: process row group
DB->>AS: finalize_row_group → write_metadata (incremental)
end
DB-->>DD: generated=True → run_after_generation()
end
alt Already complete
DB-->>DD: generated=False → skip run_after_generation()
end
Reviews (8): Last reviewed commit: "fix(builder): derive initial_actual_num_..." | Re-trigger Greptile
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Outdated
Show resolved
Hide resolved
…INE=1) - Add _find_completed_row_group_ids() to scan parquet-files/ for already-written row groups by parsing batch_*.parquet filenames - _build_async() now accepts resume=True: loads metadata, finds completed row groups, clears partial results, and logs progress; returns early if all row groups are done - _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and initial_total_num_batches so the scheduler only processes remaining row groups and RowGroupBufferManager starts from the correct counts - RowGroupBufferManager.__init__ gains initial_actual_num_records and initial_total_num_batches params to seed the counters on resume - finalize_row_group closure now writes incremental metadata after each checkpoint so any run (resume or not) can be resumed if interrupted mid-way - Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1 - Add tests for all new paths
…set already complete _build_with_resume and _build_async now return False when the dataset is already complete (early-return path), True otherwise. build() skips _processor_runner.run_after_generation() on False, preventing processors from calling shutil.rmtree and rewriting an already-finalized dataset. Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
…sync resume Metadata can lag by one row group if a crash occurs between move_partial_result_to_final_file_path and write_metadata. Using len(completed_ids) from the filesystem scan instead of state.num_completed_batches ensures the final metadata reflects the actual number of parquet files present, not the potentially stale metadata count.
|
Issue #525 has been triaged. The linked issue check is being re-evaluated. |
…efore first batch) When a run is interrupted before any row group or batch completes, metadata.json is never written. Previously resume=True would raise DatasetGenerationError in this case. Now build() detects the missing file, logs an info message, clears any leftover partial results and falls back to a clean fresh run. This is the common scenario for small datasets (fewer records than buffer_size) where all records fit in a single row group.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Outdated
Show resolved
Hide resolved
…ync resume In the crash window (row group written to disk but write_metadata crashed before updating the file), both initial_total_num_batches and initial_actual_num_records now use the filesystem-discovered completed_ids as source of truth. Previously initial_actual_num_records was read from potentially stale metadata, causing actual_num_records in the final metadata to be undercounted by one row group. Also adds a test covering the partial-resume crash-window scenario.
Summary
Closes #525
Adds
resume: bool = FalsetoDataDesigner.create()andDatasetBuilder.build(). Whenresume=True, generation picks up from where the interrupted run left off — for both the sync and async engines.Changes
ArtifactStorageresume: bool = Falsefield;resolved_dataset_nameskips timestamp logic on resume; newclear_partial_results()DatasetBatchManager.start()start_batchandinitial_actual_num_recordsparams (default 0, no breakage)DatasetBuilder.build()resumeparam;_load_resume_state()reads and validatesmetadata.json;_build_with_resume()skips completed batches (sync);_build_async()skips completed row groups (async)RowGroupBufferManager.__init__()initial_actual_num_recordsandinitial_total_num_batchesparams to seed counters on resumeDatasetBuilder._find_completed_row_group_ids()parquet-files/forbatch_*.parquetto determine which async row groups are already donefinalize_row_groupclosuremetadata.jsonafter every row-group checkpoint (not just at the end), making all async runs resumable if interruptedDataDesigner.create()resume, passes it through toArtifactStorageandbuilder.build()Validation and error cases
metadata.json→DatasetGenerationError(interrupted before any batch completed)num_recordsmismatch →DatasetGenerationErrorbuffer_sizemismatch →DatasetGenerationErrorTest plan
test_resolved_dataset_name_resume_uses_existing_foldertest_resolved_dataset_name_resume_raises_when_no_existing_foldertest_resolved_dataset_name_resume_raises_when_folder_is_emptytest_clear_partial_results_removes_partial_foldertest_clear_partial_results_is_noop_when_no_partial_foldertest_start_with_start_batchtest_start_with_initial_actual_num_recordstest_start_with_start_batch_and_initial_actual_num_recordstest_start_default_values_unchangedtest_build_resume_raises_without_metadatatest_build_resume_raises_on_num_records_mismatchtest_build_resume_raises_on_buffer_size_mismatchtest_build_resume_logs_warning_when_already_completetest_find_completed_row_group_ids_empty_dirtest_find_completed_row_group_ids_with_filestest_find_completed_row_group_ids_ignores_non_batch_filestest_build_async_resume_logs_warning_when_already_completetest_build_async_resume_raises_without_metadatatest_initial_actual_num_recordstest_initial_total_num_batches_reflected_in_metadata