feat: resume interrupted dataset generation runs (sync + async engine) by przemekboruta · Pull Request #526 · NVIDIA-NeMo/DataDesigner

przemekboruta · 2026-04-13T11:15:26Z

Summary

Closes #525

Adds resume: bool = False to DataDesigner.create() and DatasetBuilder.build(). When resume=True, generation picks up from where the interrupted run left off — for both the sync and async engines.

dd = DataDesigner(...)
dd.add_column(...)

# First run — interrupted mid-way
results = dd.create(config_builder, num_records=10_000)

# After restart — picks up from the last completed batch/row-group
results = dd.create(config_builder, num_records=10_000, resume=True)

Changes

Layer	Change
`ArtifactStorage`	`resume: bool = False` field; `resolved_dataset_name` skips timestamp logic on resume; new `clear_partial_results()`
`DatasetBatchManager.start()`	New `start_batch` and `initial_actual_num_records` params (default 0, no breakage)
`DatasetBuilder.build()`	New `resume` param; `_load_resume_state()` reads and validates `metadata.json`; `_build_with_resume()` skips completed batches (sync); `_build_async()` skips completed row groups (async)
`RowGroupBufferManager.__init__()`	New `initial_actual_num_records` and `initial_total_num_batches` params to seed counters on resume
`DatasetBuilder._find_completed_row_group_ids()`	New helper — scans `parquet-files/` for `batch_*.parquet` to determine which async row groups are already done
`finalize_row_group` closure	Now writes incremental `metadata.json` after every row-group checkpoint (not just at the end), making all async runs resumable if interrupted
`DataDesigner.create()`	Exposes `resume`, passes it through to `ArtifactStorage` and `builder.build()`

Validation and error cases

Missing metadata.json → DatasetGenerationError (interrupted before any batch completed)
num_records mismatch → DatasetGenerationError
buffer_size mismatch → DatasetGenerationError
Dataset already complete → warning logged, returns existing path (both engines)

Test plan

Fixes NVIDIA-NeMo#525

- ArtifactStorage gains a `resume: bool = False` field - resolved_dataset_name skips timestamp logic when resume=True, returning the existing dataset folder name as-is - Raises ArtifactStorageError on resume=True when the target folder is absent or empty (no data to resume from) - New clear_partial_results() removes in-flight partial results left over from an interrupted run Fixes NVIDIA-NeMo#525

DatasetBatchManager.start() now accepts: - start_batch: int = 0 — first batch index to process - initial_actual_num_records: int = 0 — records already on disk Both default to 0 so all existing call sites are unaffected. Fixes NVIDIA-NeMo#525

- build() gains a resume: bool = False parameter - _load_resume_state() reads metadata.json and validates that num_records and buffer_size match the original run - _build_with_resume() skips completed batches, clears in-flight partial results, and continues from the first incomplete batch - Raises DatasetGenerationError with clear messages for: - missing metadata.json (interrupted before first batch completes) - num_records mismatch - buffer_size mismatch - DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported) - Logs a warning and returns early when dataset is already complete Fixes NVIDIA-NeMo#525

- create() gains resume: bool = False - _create_resource_provider() passes resume to ArtifactStorage - builder.build() receives the resume flag Fixes NVIDIA-NeMo#525

Covers: - ArtifactStorage.resolved_dataset_name with resume=True - ArtifactStorage.clear_partial_results() - DatasetBatchManager.start() with start_batch and initial_actual_num_records - DatasetBuilder.build(resume=True): missing metadata, num_records mismatch, buffer_size mismatch, already-complete detection Fixes NVIDIA-NeMo#525

greptile-apps · 2026-04-13T11:23:20Z

Greptile Summary

This PR adds resume: bool = False to DataDesigner.create() and DatasetBuilder.build(), enabling generation to pick up from the last completed batch (sync) or row group (async) after an interrupted run. Both engines are now supported: the sync path reads num_completed_batches from metadata.json and seeds the batch loop with start_batch, while the async path derives ground-truth state from the filesystem via _find_completed_row_group_ids() — correctly handling the write-metadata crash window.

The two previously-flagged P1 issues (processors re-running on an already-complete dataset, and stale initial_actual_num_records from metadata in the async crash window) are now addressed: generated=False guards the run_after_generation call, and the async path sources both counters from the filesystem rather than from potentially-lagging metadata.

Confidence Score: 5/5

This PR is safe to merge — the resume logic is well-structured, the async crash-window is handled via filesystem truth rather than lagging metadata, and both previously-flagged P1 issues have been correctly resolved.

The sync path correctly seeds DatasetBatchManager with start_batch and initial_actual_num_records from metadata.json. The async path avoids the metadata lag by scanning parquet-files/batch_*.parquet directly, computing actual record counts from completed row-group IDs. The generated flag properly guards run_after_generation(), preventing destructive processor re-runs on already-complete datasets. Test coverage is thorough across both paths including the crash-window scenario.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Core resume logic: `_build_with_resume` (sync) and updated `_build_async` (async) both correctly gate `run_after_generation` behind the `generated` flag; async path correctly sources both counters from filesystem via `_find_completed_row_group_ids`; `_load_resume_state` correctly validates run-parameter compatibility.
packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py	Adds `resume` field and `clear_partial_results()`; `resolved_dataset_name` correctly short-circuits timestamp logic on resume and raises `ArtifactStorageError` when no existing folder is found.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py	Adds `start_batch` and `initial_actual_num_records` to `start()`; correctly applies them after `reset()` so subsequent state is seeded with resume values.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py	Adds `initial_actual_num_records` and `initial_total_num_batches` constructor params to seed counters for resumed async runs; straightforward and correct.
packages/data-designer/src/data_designer/interface/data_designer.py	Threads `resume` through to `ArtifactStorage` and `builder.build()`; public API changes are minimal and non-breaking.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py	Comprehensive coverage of resume paths including crash-window scenario, num_records/buffer_size mismatch, already-complete detection, and processor-skip guard; module-level imports appear mid-file (E402) but logic is sound.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dataset_batch_manager.py	New tests cover all combinations of `start_batch` and `initial_actual_num_records`; default-unchanged test guards regressions.
packages/data-designer-engine/tests/engine/storage/test_artifact_storage.py	Tests cover resume flag's three states (existing folder, missing folder, empty folder) and `clear_partial_results` happy/noop paths.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant DD as DataDesigner.create()
    participant AS as ArtifactStorage
    participant DB as DatasetBuilder.build()
    participant BM as DatasetBatchManager
    participant FS as Filesystem

    U->>DD: create(resume=True)
    DD->>AS: ArtifactStorage(resume=True)
    AS->>FS: check artifact_path/dataset_name exists?
    FS-->>AS: exists (resume) or raise ArtifactStorageError
    AS-->>DD: storage with resolved_dataset_name

    DD->>DB: build(resume=True)
    DB->>AS: clear_partial_results()
    AS->>FS: rmtree(tmp-partial-parquet-files/) if exists

    alt Sync path
        DB->>DB: _load_resume_state() → read metadata.json
        DB->>BM: start(num_records, buffer_size, start_batch=N, initial_actual_num_records=M)
        BM->>BM: reset() then set _current_batch_number=N, _actual_num_records=M
        loop batches N..total
            DB->>DB: _run_batch(batch_idx)
            DB->>BM: finish_batch()
            BM->>AS: write_metadata(num_completed_batches=batch_idx+1)
        end
        DB->>BM: finish()
        DB-->>DD: generated=True → run_after_generation()

    else Async path
        DB->>FS: glob(parquet-files/batch_*.parquet)
        FS-->>DB: completed_row_group_ids (filesystem truth)
        DB->>DB: compute initial_actual_num_records from completed ids
        DB->>BM: start(num_records, buffer_size, start_batch=0)
        loop row groups (skipping completed)
            DB->>BM: process row group
            DB->>AS: finalize_row_group → write_metadata (incremental)
        end
        DB-->>DD: generated=True → run_after_generation()
    end

    alt Already complete
        DB-->>DD: generated=False → skip run_after_generation()
    end

_{Reviews (8): Last reviewed commit: "fix(builder): derive initial_actual_num_..." | Re-trigger Greptile}

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py

…INE=1) - Add _find_completed_row_group_ids() to scan parquet-files/ for already-written row groups by parsing batch_*.parquet filenames - _build_async() now accepts resume=True: loads metadata, finds completed row groups, clears partial results, and logs progress; returns early if all row groups are done - _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and initial_total_num_batches so the scheduler only processes remaining row groups and RowGroupBufferManager starts from the correct counts - RowGroupBufferManager.__init__ gains initial_actual_num_records and initial_total_num_batches params to seed the counters on resume - finalize_row_group closure now writes incremental metadata after each checkpoint so any run (resume or not) can be resumed if interrupted mid-way - Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1 - Add tests for all new paths

…set already complete _build_with_resume and _build_async now return False when the dataset is already complete (early-return path), True otherwise. build() skips _processor_runner.run_after_generation() on False, preventing processors from calling shutil.rmtree and rewriting an already-finalized dataset. Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.

…sync resume Metadata can lag by one row group if a crash occurs between move_partial_result_to_final_file_path and write_metadata. Using len(completed_ids) from the filesystem scan instead of state.num_completed_batches ensures the final metadata reflects the actual number of parquet files present, not the potentially stale metadata count.

github-actions · 2026-04-13T23:36:54Z

Issue #525 has been triaged. The linked issue check is being re-evaluated.

…efore first batch) When a run is interrupted before any row group or batch completes, metadata.json is never written. Previously resume=True would raise DatasetGenerationError in this case. Now build() detects the missing file, logs an info message, clears any leftover partial results and falls back to a clean fresh run. This is the common scenario for small datasets (fewer records than buffer_size) where all records fit in a single row group.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py

…ync resume In the crash window (row group written to disk but write_metadata crashed before updating the file), both initial_total_num_batches and initial_actual_num_records now use the filesystem-discovered completed_ids as source of truth. Previously initial_actual_num_records was read from potentially stale metadata, causing actual_num_records in the final metadata to be undercounted by one row group. Also adds a test covering the partial-resume crash-window scenario.

przemekboruta added 6 commits April 13, 2026 13:11

docs: add implementation plan for resume mechanism

25cdb20

Fixes NVIDIA-NeMo#525

feat(interface): expose resume on DataDesigner.create()

dbf0a27

- create() gains resume: bool = False - _create_resource_provider() passes resume to ArtifactStorage - builder.build() receives the resume flag Fixes NVIDIA-NeMo#525

przemekboruta requested a review from a team as a code owner April 13, 2026 11:15

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Outdated Show resolved Hide resolved

przemekboruta changed the title ~~feat: resume interrupted dataset generation runs (sync engine)~~ feat: resume interrupted dataset generation runs (sync + async engine) Apr 13, 2026

przemekboruta added 2 commits April 13, 2026 13:48

andreatgretel added the agent-review Trigger agentic CI review label Apr 13, 2026

Merge branch 'main' into main

da475d9

andreatgretel removed the agent-review Trigger agentic CI review label Apr 13, 2026

andreatgretel mentioned this pull request Apr 14, 2026

fix: use pull_request_target for agentic CI on fork PRs #541

Open

przemekboruta force-pushed the main branch from 2ef10f9 to e2cc156 Compare April 14, 2026 21:00

przemekboruta and others added 2 commits April 14, 2026 23:01

Merge branch 'main' into main

cd1b19c

docs(interface): fix resume docstring — async engine is supported

f04174b

greptile-apps bot reviewed Apr 14, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Outdated Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resume interrupted dataset generation runs (sync + async engine)#526

feat: resume interrupted dataset generation runs (sync + async engine)#526
przemekboruta wants to merge 14 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:main

przemekboruta commented Apr 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

greptile-apps bot commented Apr 13, 2026 •

edited

Loading

Confidence Score: 5/5

Sequence Diagram

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

przemekboruta commented Apr 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation and error cases

Test plan

Uh oh!

greptile-apps bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

przemekboruta commented Apr 13, 2026 •

edited by github-actions bot

Loading

greptile-apps bot commented Apr 13, 2026 •

edited

Loading