Skip to content

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511

Open
przemekboruta wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction
Open

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
przemekboruta wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

Summary

Closes #510

  • Deletes dag.py and moves topologically_sort_column_configs into execution_graph.py as a module-level function
  • Replaces networkx.topological_sort with an inline Kahn's algorithm, consistent with ExecutionGraph.get_topological_order
  • Side-effect resolution is now O(1) via a side_effect_map dict — the previous implementation did a linear scan over sum(side_effect_dict.values(), []) which was O(n²)
  • Updates imports in config_compiler.py and test_dag.py

Design note

The function is intentionally a module-level function, not a @classmethod on ExecutionGraph. ExecutionGraph is an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on raw ColumnConfigT without strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change to add_column.

Test plan

  • Existing test_dag.py tests (test_dag_construction, test_circular_dependencies) cover the migrated function with updated import path
  • ruff check passes on all modified files

@przemekboruta przemekboruta requested a review from a team as a code owner April 8, 2026 16:47
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR eliminates the duplicate dag.py module by merging topologically_sort_column_configs into execution_graph.py as a module-level function, replacing the networkx-based implementation with the same inline Kahn's algorithm already used by ExecutionGraph.get_topological_order. The side-effect resolution is improved from an O(n²) linear scan to O(1) via a prebuilt side_effect_map, and the networkx import is dropped entirely for this path.

Confidence Score: 5/5

Safe to merge — clean refactoring with no functional regressions.

All findings are P2 or lower. The Kahn's algorithm implementation is correct, side-effect resolution is equivalent and faster, the intentional omission of skip.columns edges is properly documented and harmless (ExecutionGraph.create is the authoritative graph), and the test relaxation accurately reflects the non-deterministic tie-breaking of the queue-based algorithm.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Adds module-level topologically_sort_column_configs using Kahn's algorithm; side-effect resolution upgraded to O(1) via side_effect_map; skip.columns edges intentionally omitted with clear justification.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py Deleted — functionality fully migrated to execution_graph.py.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py Single import path update from dag to execution_graph; no logic changes.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py Import updated; ordering assertion correctly relaxed to an unordered set for the two topologically-equivalent positions, matching Kahn's non-deterministic tie-breaking.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["compile_dataset_builder_column_configs"] -->|calls| B["topologically_sort_column_configs\nexecution_graph.py"]
    B --> C{dag_col_dict empty?}
    C -->|yes| D["return non_dag_cols"]
    C -->|no| E["Build side_effect_map O(1)"]
    E --> F["Build upstream/downstream sets\nrequired_columns only"]
    F --> G["Kahn BFS algorithm"]
    G --> H{cycle detected?}
    H -->|yes| I["raise DAGCircularDependencyError"]
    H -->|no| J["return non_dag_cols + ordered"]
    J --> K["ExecutionGraph.create\nauthoritative graph + skip.columns"]
Loading

Reviews (5): Last reviewed commit: "docs(engine): document intentional skip...." | Re-trigger Greptile

@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch 2 times, most recently from b25ebf6 to 775f147 Compare April 8, 2026 16:52
@nabinchha
Copy link
Copy Markdown
Contributor

@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge!

przemekboruta and others added 2 commits April 15, 2026 21:37
…ution_graph.py

Eliminates dag.py and its networkx dependency by moving
topologically_sort_column_configs into execution_graph.py as a
module-level function. Side-effect resolution is now O(1) via a
side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm
is reused in-place rather than leaning on networkx.topological_sort.

Closes NVIDIA-NeMo#510

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_judge and test_code_and_depends_on_validation_reasoning_traces have
no mutual dependency and reach in-degree 0 simultaneously in Kahn's
algorithm. Set iteration order varies with PYTHONHASHSEED, making the
strict list assertion flaky. Assert only the topological invariants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch from 520fbf2 to eddf3a3 Compare April 15, 2026 19:37
@github-actions
Copy link
Copy Markdown
Contributor

Linked Issue Check

Issue #510 has not been triaged yet. A maintainer needs to review
the issue and add the triaged label before this PR can be merged.

You can continue working on the PR in the meantime. The check will
re-run automatically once the issue is triaged.

…ally_sort_column_configs

ExecutionGraph.create handles skip.when ordering edges in its own
two-pass build; the pre-sort function only needs required_columns
to produce a valid ColumnConfigT ordering for config compilation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)

2 participants