docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) by dhruvnathawani · Pull Request #544 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-04-14T16:22:27Z

📋 Summary

Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.

🔄 Changes

Added the following :

docs/assets/recipes/model_usability/structured_data.py — Five-stage pipeline: samplers → schema generation → user prompt → conversation pairs → best-of-3 structured output across JSON, YAML, XML, Markdown. Demonstrates
SubcategorySamplerParams for conditional topic sampling.
docs/assets/recipes/model_usability/prompt_sensitivity.py — Seed-driven pipeline with 10 regex answer formats × 30 preambles, 7 diversity samplers, 3 LLM paraphrasing stages, and 4 LLM judges (format compliance, regex alignment, order
coherence, preamble quality).
docs/assets/recipes/code_generation/infinibyte.py — Cross-source problem generation using HF streaming, random cross-join, LLMStructuredColumnConfig with Pydantic models for candidate generation/selection/evaluation, and solution generation.
docs/recipes/model_usability/structured_data.md — recipe doc page
docs/recipes/model_usability/prompt_sensitivity.md — recipe doc page
docs/recipes/code_generation/infinibyte.md — recipe doc page

🔧 Changed

docs/recipes/cards.md — three new recipe cards added
mkdocs.yml — nav entries for new Model Usability category and InfiniByte under Code Generation

🧪 Testing

structured_data.py --num-records 2 — 2/2 records, all columns generated
prompt_sensitivity.py --num-records 2 — 2/2 records, all 4 judges scored
infinibyte.py --num-records 2 --limit 100 — pipeline stages execute correctly (streaming + cross-join + structured columns all work; default nvidia-text model times out on long coding problems, documented in prerequisites)
uv run mkdocs build — no errors for new recipe files
make check-all-fix — all checks passed

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (if applicable)

… infinibyte)

github-actions · 2026-04-14T16:23:47Z

Docs preview: https://6ef95b66.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

greptile-apps · 2026-04-15T05:16:57Z

Greptile Summary

This PR adds three new Nemotron Nano recipe files (structured data, prompt sensitivity, InfiniByte), corresponding documentation pages, recipe cards, and mkdocs.yml nav entries. The pipeline logic in infinibyte.py and structured_data.py looks correct and aligns with the described architectures. Several regex pattern bugs in prompt_sensitivity.py that were flagged in prior review threads remain unaddressed.

Confidence Score: 4/5

Safe to merge after resolving the broken regex patterns in prompt_sensitivity.py noted in prior review threads.

The infinibyte and structured_data recipes are logically correct. However, prompt_sensitivity.py carries several regex bugs (fmt_00/fmt_09 word-boundary anchor + character-class issues, fmt_05 mismatched bracket/paren, fmt_08 character-class issue) that were flagged in prior review threads and appear unaddressed; these make the output_regex fields non-functional for their intended formats, breaking the LLM judge's regex alignment scoring.

docs/assets/recipes/model_usability/prompt_sensitivity.py — regex patterns in FORMAT_TEMPLATES need correction.

Important Files Changed

Filename	Overview
docs/assets/recipes/model_usability/prompt_sensitivity.py	Contains multiple broken regex patterns in FORMAT_TEMPLATES (fmt_00, fmt_05, fmt_08, fmt_09) flagged in prior review threads; these make the output_regex fields non-functional for their intended formats.
docs/assets/recipes/code_generation/infinibyte.py	New 5-stage pipeline recipe; cross-join logic, Pydantic models, and column dependency ordering are all correct.
docs/assets/recipes/model_usability/structured_data.py	New 5-stage structured-data recipe; sampler configs, prompt templates, and column dependency ordering look correct.
docs/recipes/cards.md	Three new recipe cards added with correct asset and doc page links.
mkdocs.yml	Nav entries for new Model Usability category and InfiniByte under Code Generation added correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph infinibyte["infinibyte.py — InfiniByte Pipeline"]
        A1["HF Streaming Download\n(OpenCodeReasoning + OpenMathReasoning)"] --> A2["Cross-Join with Random Sampling"]
        A2 --> A3["Seed CSV + combination_type sampler"]
        A3 --> A4["Stage 2: Candidate Generation\nLLMStructuredColumn → NewProblemList"]
        A4 --> A5["Stage 3: Best Problem Selection\nLLMStructuredColumn → NewProblemWithReasoning"]
        A5 --> A6["ExpressionColumn: new_problem"]
        A6 --> A7["Stage 4: Evaluation\nLLMStructuredColumn → NewProblemEvals"]
        A7 --> A8["Stage 5: Solution Generation\nLLMTextColumn"]
    end
    subgraph ps["prompt_sensitivity.py — Prompt Sensitivity Pipeline"]
        B1["Seed CSV\n10 regex formats × 30 preambles"] --> B2["Stage 1: 7 Diversity Samplers"]
        B2 --> B3["Stage 2: Preamble Generation\nLLMTextColumn"]
        B3 --> B4["Stage 3: Format Instruction Generation\nLLMTextColumn"]
        B4 --> B5["Stage 4: User Prompt Composition\nLLMTextColumn"]
        B5 --> B6["Stage 5: 4 LLM Judges\nformat_compliance, regex_alignment\norder_coherence, preamble_quality"]
    end
    subgraph sd["structured_data.py — Structured Data Pipeline"]
        C1["Stage 1: Samplers\nformat, topic/subtopic, schema controls, conversation controls"] --> C2["Stage 2: Schema Generation\nLLMTextColumn"]
        C2 --> C3["Stage 3: User Prompt Generation\nLLMTextColumn"]
        C3 --> C4["Stage 4: Conversation Pairs\nLLMTextColumn"]
        C4 --> C5["Stage 5: Best-of-3 Structured Output\n3× LLMTextColumn"]
    end

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile}

greptile-apps · 2026-04-15T05:17:01Z

+    },
+    {
+        "format_key": "fmt_05",
+        "output_regex": r"\[Answer:\s*([A-Za-z])\)",


Mismatched closing delimiter in fmt_05 regex

The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.

Suggested change

"output_regex": r"\[Answer:\s*([A-Za-z])\)",

"output_regex": r"\[Answer:\s*([A-Za-z])\]",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 103 Comment: **Mismatched closing delimiter in `fmt_05` regex** The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce. ```suggestion "output_regex": r"\[Answer:\s*([A-Za-z])\]", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:02Z

+FORMAT_TEMPLATES = [
+    {
+        "format_key": "fmt_00",
+        "output_regex": r"\boxed{([.*?])}",


Incorrect regex for LaTeX \boxed{} format

r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:

Suggested change

"output_regex": r"\boxed{([.*?])}",

"output_regex": r"\\boxed\{(.*?)\}",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 78 Comment: **Incorrect regex for LaTeX `\boxed{}` format** `r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is: ```suggestion "output_regex": r"\\boxed\{(.*?)\}", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:03Z

+    },
+    {
+        "format_key": "fmt_08",
+        "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",


[.*?] character class captures only ., *, or ?

([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:

Suggested change

"output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",

"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 118 Comment: **`[.*?]` character class captures only `.`, `*`, or `?`** `([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets: ```suggestion "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-15T05:21:03Z

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

This PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in docs/ — no library code is modified.

Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, cards.md, mkdocs.yml)
Lines: +1568, -0

Findings

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

File: docs/assets/recipes/model_usability/prompt_sensitivity.py (lines 63–111 in the diff, the FORMAT_TEMPLATES list)

Several output_regex patterns appear incorrect. Since these are passed to LLM judges to evaluate "regex alignment," incorrect patterns will produce unreliable evaluation scores:

\boxed is not escaped correctly (fmt_00, fmt_09): In Python regex, \b is the word-boundary anchor, so r"\boxed{...}" matches word-boundary + oxed{...}, not the literal string \boxed{...}. To match the LaTeX \boxed{} literally, use r"\\boxed\{..." or a double backslash.
[.*?] is a character class, not a wildcard (fmt_00, fmt_08, fmt_09): [.*?] matches a single character that is ., *, or ?. The likely intent is (.*?) (non-greedy capture group) or .+?.
fmt_05 has mismatched brackets: The regex r"\[Answer:\s*([A-Za-z])\)" opens with \[ (literal [) but closes with \) (literal )). The seed_format_instruction says "end with [Answer: X]" — so the closing delimiter should be \], not \).
fmt_00 and fmt_09 are duplicates: Both use the identical regex r"\boxed{([.*?])}". Their seed_format_instruction values differ, but the regex and format_key should likely differ too, or one should be removed.

Impact: These patterns are seed data for an LLM pipeline. The regex_alignment LLM judge evaluates whether generated format instructions match the output_regex. If the regex itself is wrong, the judge's scoring reference is unreliable, degrading the quality signal in the generated dataset.

Recommendation: Verify these patterns against the original Nemotron Nano pipeline. If they were copied verbatim from the training codebase, document that in a comment. If they are new to this recipe, fix them.

Low — Doc Page Format Inconsistency

Files: docs/recipes/code_generation/infinibyte.md, docs/recipes/model_usability/structured_data.md, docs/recipes/model_usability/prompt_sensitivity.md

The new recipe doc pages include a # heading and a description paragraph before the download button:

# Nemotron Nano InfiniByte

Generate more diverse and complex training problems...

[Download Code ...]

Most existing recipe doc pages (e.g., text_to_python.md, product_info_qa.md) have only the download button and code include — no heading or description. Some newer ones (e.g., enterprise_text_to_sql.md) do include a heading and notes.

Impact: Minor visual inconsistency in docs. The added heading/description is arguably an improvement, providing better context to readers. Not blocking.

Informational — `hashlib.md5` in `infinibyte.py`

File: docs/assets/recipes/code_generation/infinibyte.py, line ~102 (within fetch_hf_dataset_to_df)

rec_id = rec.get("id") or hashlib.md5(text.encode("utf-8")).hexdigest()

MD5 is used as a fallback ID when a HuggingFace record has no id field. This is fine for deduplication/identification (not security), but some linters and security scanners flag hashlib.md5 usage. Consider hashlib.sha256 for forward-compatibility if the pipeline is adapted to stricter environments.

Informational — Single Strategy Defined in InfiniByte

File: docs/assets/recipes/code_generation/infinibyte.py, lines ~55–57

STRATEGIES = {
    "ocr_omr": ("ocr", "omr"),
}

The --strategy CLI arg accepts choices=list(STRATEGIES.keys()) but only one strategy (ocr_omr) is defined. This is fine for a recipe — it demonstrates extensibility — but the CLI help could note that additional strategies can be added by extending the STRATEGIES dict.

Positive Observations

Well-structured pipeline designs with clear ASCII architecture diagrams in each recipe's docstring.
Proper use of DataDesigner APIs: LLMStructuredColumnConfig with Pydantic models (infinibyte), LLMJudgeColumnConfig with Score rubrics (prompt_sensitivity), SubcategorySamplerParams for conditional sampling (structured_data), ExpressionColumnConfig for extracting structured fields.
SPDX license headers present on all new files.
PEP 723 inline script metadata (# /// script) correctly specified for uv run compatibility.
from __future__ import annotations included in all three recipe files (consistent with project style guide).
Consistent CLI interface across all three recipes (--model-alias, --num-records, --artifact-path).
Recipe cards in cards.md follow the established grid pattern with icons, descriptions, "Demonstrates" sections, and action buttons.
mkdocs.yml nav entries are properly structured, creating a new "Model Usability" category cleanly.

Verdict

Approve with suggestions. The recipes are well-crafted, demonstrate advanced DataDesigner features, and follow established patterns. The regex pattern issues in prompt_sensitivity.py are the primary concern — they should be verified against the original Nemotron Nano pipeline and corrected or annotated. The other findings are minor and non-blocking.

docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…

38c2559

… infinibyte)

Merge branch 'main' into dhruv/recipes/nano

b1b34ec

dhruvnathawani changed the title ~~[DRAFT] docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…)~~ docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) Apr 15, 2026

dhruvnathawani marked this pull request as ready for review April 15, 2026 05:13

dhruvnathawani requested a review from a team as a code owner April 15, 2026 05:13

greptile-apps bot reviewed Apr 15, 2026

View reviewed changes

Merge branch 'main' into dhruv/recipes/nano

7a03446

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
dhruvnathawani wants to merge 3 commits intomainfrom
dhruv/recipes/nano

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 15, 2026 •

edited

Loading

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"output_regex": r"\[Answer:\s*([A-Za-z])\)",
	"output_regex": r"\[Answer:\s*([A-Za-z])\]",

	"output_regex": r"\boxed{([.*?])}",
	"output_regex": r"\\boxed\{(.*?)\}",

	"output_regex": r"<final_answer>\s([.?])\s*</final_answer>",
	"output_regex": r"<final_answer>\s(.?)\s*</final_answer>",

Conversation

dhruvnathawani commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔄 Changes

🔧 Changed

🧪 Testing

✅ Checklist

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 15, 2026

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

Findings

High — Regex Patterns in prompt_sensitivity.py Have Multiple Issues

Low — Doc Page Format Inconsistency

Informational — hashlib.md5 in infinibyte.py

Informational — Single Strategy Defined in InfiniByte

Positive Observations

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading

greptile-apps bot commented Apr 15, 2026 •

edited

Loading

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

Informational — `hashlib.md5` in `infinibyte.py`