refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511przemekboruta wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR eliminates the duplicate
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py | Adds module-level topologically_sort_column_configs using Kahn's algorithm; side-effect resolution upgraded to O(1) via side_effect_map; skip.columns edges intentionally omitted with clear justification. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py | Deleted — functionality fully migrated to execution_graph.py. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py | Single import path update from dag to execution_graph; no logic changes. |
| packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py | Import updated; ordering assertion correctly relaxed to an unordered set for the two topologically-equivalent positions, matching Kahn's non-deterministic tie-breaking. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["compile_dataset_builder_column_configs"] -->|calls| B["topologically_sort_column_configs\nexecution_graph.py"]
B --> C{dag_col_dict empty?}
C -->|yes| D["return non_dag_cols"]
C -->|no| E["Build side_effect_map O(1)"]
E --> F["Build upstream/downstream sets\nrequired_columns only"]
F --> G["Kahn BFS algorithm"]
G --> H{cycle detected?}
H -->|yes| I["raise DAGCircularDependencyError"]
H -->|no| J["return non_dag_cols + ordered"]
J --> K["ExecutionGraph.create\nauthoritative graph + skip.columns"]
Reviews (5): Last reviewed commit: "docs(engine): document intentional skip...." | Re-trigger Greptile
b25ebf6 to
775f147
Compare
|
@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge! |
…ution_graph.py Eliminates dag.py and its networkx dependency by moving topologically_sort_column_configs into execution_graph.py as a module-level function. Side-effect resolution is now O(1) via a side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm is reused in-place rather than leaning on networkx.topological_sort. Closes NVIDIA-NeMo#510 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_judge and test_code_and_depends_on_validation_reasoning_traces have no mutual dependency and reach in-degree 0 simultaneously in Kahn's algorithm. Set iteration order varies with PYTHONHASHSEED, making the strict list assertion flaky. Assert only the topological invariants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
520fbf2 to
eddf3a3
Compare
Linked Issue CheckIssue #510 has not been triaged yet. A maintainer needs to review You can continue working on the PR in the meantime. The check will |
…ally_sort_column_configs ExecutionGraph.create handles skip.when ordering edges in its own two-pass build; the pre-sort function only needs required_columns to produce a valid ColumnConfigT ordering for config compilation.
Summary
Closes #510
dag.pyand movestopologically_sort_column_configsintoexecution_graph.pyas a module-level functionnetworkx.topological_sortwith an inline Kahn's algorithm, consistent withExecutionGraph.get_topological_orderside_effect_mapdict — the previous implementation did a linear scan oversum(side_effect_dict.values(), [])which was O(n²)config_compiler.pyandtest_dag.pyDesign note
The function is intentionally a module-level function, not a
@classmethodonExecutionGraph.ExecutionGraphis an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on rawColumnConfigTwithout strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change toadd_column.Test plan
test_dag.pytests (test_dag_construction,test_circular_dependencies) cover the migrated function with updated import pathruff checkpasses on all modified files