Skip to content

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544

Open
dhruvnathawani wants to merge 3 commits intomainfrom
dhruv/recipes/nano
Open

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
dhruvnathawani wants to merge 3 commits intomainfrom
dhruv/recipes/nano

Conversation

@dhruvnathawani
Copy link
Copy Markdown
Contributor

@dhruvnathawani dhruvnathawani commented Apr 14, 2026

📋 Summary

Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.

🔄 Changes

Added the following :

  • docs/assets/recipes/model_usability/structured_data.py — Five-stage pipeline: samplers → schema generation → user prompt → conversation pairs → best-of-3 structured output across JSON, YAML, XML, Markdown. Demonstrates
    SubcategorySamplerParams for conditional topic sampling.
  • docs/assets/recipes/model_usability/prompt_sensitivity.py — Seed-driven pipeline with 10 regex answer formats × 30 preambles, 7 diversity samplers, 3 LLM paraphrasing stages, and 4 LLM judges (format compliance, regex alignment, order
    coherence, preamble quality).
  • docs/assets/recipes/code_generation/infinibyte.py — Cross-source problem generation using HF streaming, random cross-join, LLMStructuredColumnConfig with Pydantic models for candidate generation/selection/evaluation, and solution generation.
  • docs/recipes/model_usability/structured_data.md — recipe doc page
  • docs/recipes/model_usability/prompt_sensitivity.md — recipe doc page
  • docs/recipes/code_generation/infinibyte.md — recipe doc page

🔧 Changed

  • docs/recipes/cards.md — three new recipe cards added
  • mkdocs.yml — nav entries for new Model Usability category and InfiniByte under Code Generation

🧪 Testing

  • structured_data.py --num-records 2 — 2/2 records, all columns generated
  • prompt_sensitivity.py --num-records 2 — 2/2 records, all 4 judges scored
  • infinibyte.py --num-records 2 --limit 100 — pipeline stages execute correctly (streaming + cross-join + structured columns all work; default nvidia-text model times out on long coding problems, documented in prerequisites)
  • uv run mkdocs build — no errors for new recipe files
  • make check-all-fix — all checks passed

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

Docs preview: https://6ef95b66.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@dhruvnathawani dhruvnathawani changed the title [DRAFT] docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…) docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) Apr 15, 2026
@dhruvnathawani dhruvnathawani marked this pull request as ready for review April 15, 2026 05:13
@dhruvnathawani dhruvnathawani requested a review from a team as a code owner April 15, 2026 05:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR adds three new Nemotron Nano recipe files (structured data, prompt sensitivity, InfiniByte), corresponding documentation pages, recipe cards, and mkdocs.yml nav entries. The pipeline logic in infinibyte.py and structured_data.py looks correct and aligns with the described architectures. Several regex pattern bugs in prompt_sensitivity.py that were flagged in prior review threads remain unaddressed.

Confidence Score: 4/5

Safe to merge after resolving the broken regex patterns in prompt_sensitivity.py noted in prior review threads.

The infinibyte and structured_data recipes are logically correct. However, prompt_sensitivity.py carries several regex bugs (fmt_00/fmt_09 word-boundary anchor + character-class issues, fmt_05 mismatched bracket/paren, fmt_08 character-class issue) that were flagged in prior review threads and appear unaddressed; these make the output_regex fields non-functional for their intended formats, breaking the LLM judge's regex alignment scoring.

docs/assets/recipes/model_usability/prompt_sensitivity.py — regex patterns in FORMAT_TEMPLATES need correction.

Important Files Changed

Filename Overview
docs/assets/recipes/model_usability/prompt_sensitivity.py Contains multiple broken regex patterns in FORMAT_TEMPLATES (fmt_00, fmt_05, fmt_08, fmt_09) flagged in prior review threads; these make the output_regex fields non-functional for their intended formats.
docs/assets/recipes/code_generation/infinibyte.py New 5-stage pipeline recipe; cross-join logic, Pydantic models, and column dependency ordering are all correct.
docs/assets/recipes/model_usability/structured_data.py New 5-stage structured-data recipe; sampler configs, prompt templates, and column dependency ordering look correct.
docs/recipes/cards.md Three new recipe cards added with correct asset and doc page links.
mkdocs.yml Nav entries for new Model Usability category and InfiniByte under Code Generation added correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph infinibyte["infinibyte.py — InfiniByte Pipeline"]
        A1["HF Streaming Download\n(OpenCodeReasoning + OpenMathReasoning)"] --> A2["Cross-Join with Random Sampling"]
        A2 --> A3["Seed CSV + combination_type sampler"]
        A3 --> A4["Stage 2: Candidate Generation\nLLMStructuredColumn → NewProblemList"]
        A4 --> A5["Stage 3: Best Problem Selection\nLLMStructuredColumn → NewProblemWithReasoning"]
        A5 --> A6["ExpressionColumn: new_problem"]
        A6 --> A7["Stage 4: Evaluation\nLLMStructuredColumn → NewProblemEvals"]
        A7 --> A8["Stage 5: Solution Generation\nLLMTextColumn"]
    end
    subgraph ps["prompt_sensitivity.py — Prompt Sensitivity Pipeline"]
        B1["Seed CSV\n10 regex formats × 30 preambles"] --> B2["Stage 1: 7 Diversity Samplers"]
        B2 --> B3["Stage 2: Preamble Generation\nLLMTextColumn"]
        B3 --> B4["Stage 3: Format Instruction Generation\nLLMTextColumn"]
        B4 --> B5["Stage 4: User Prompt Composition\nLLMTextColumn"]
        B5 --> B6["Stage 5: 4 LLM Judges\nformat_compliance, regex_alignment\norder_coherence, preamble_quality"]
    end
    subgraph sd["structured_data.py — Structured Data Pipeline"]
        C1["Stage 1: Samplers\nformat, topic/subtopic, schema controls, conversation controls"] --> C2["Stage 2: Schema Generation\nLLMTextColumn"]
        C2 --> C3["Stage 3: User Prompt Generation\nLLMTextColumn"]
        C3 --> C4["Stage 4: Conversation Pairs\nLLMTextColumn"]
        C4 --> C5["Stage 5: Best-of-3 Structured Output\n3× LLMTextColumn"]
    end
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile

},
{
"format_key": "fmt_05",
"output_regex": r"\[Answer:\s*([A-Za-z])\)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Mismatched closing delimiter in fmt_05 regex

The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.

Suggested change
"output_regex": r"\[Answer:\s*([A-Za-z])\)",
"output_regex": r"\[Answer:\s*([A-Za-z])\]",
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103

Comment:
**Mismatched closing delimiter in `fmt_05` regex**

The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.

```suggestion
        "output_regex": r"\[Answer:\s*([A-Za-z])\]",
```

How can I resolve this? If you propose a fix, please make it concise.

FORMAT_TEMPLATES = [
{
"format_key": "fmt_00",
"output_regex": r"\boxed{([.*?])}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Incorrect regex for LaTeX \boxed{} format

r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:

Suggested change
"output_regex": r"\boxed{([.*?])}",
"output_regex": r"\\boxed\{(.*?)\}",
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78

Comment:
**Incorrect regex for LaTeX `\boxed{}` format**

`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:

```suggestion
        "output_regex": r"\\boxed\{(.*?)\}",
```

How can I resolve this? If you propose a fix, please make it concise.

},
{
"format_key": "fmt_08",
"output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [.*?] character class captures only ., *, or ?

([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:

Suggested change
"output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",
"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118

Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**

`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:

```suggestion
        "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

This PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in docs/ — no library code is modified.

Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, cards.md, mkdocs.yml)
Lines: +1568, -0

Findings

High — Regex Patterns in prompt_sensitivity.py Have Multiple Issues

File: docs/assets/recipes/model_usability/prompt_sensitivity.py (lines 63–111 in the diff, the FORMAT_TEMPLATES list)

Several output_regex patterns appear incorrect. Since these are passed to LLM judges to evaluate "regex alignment," incorrect patterns will produce unreliable evaluation scores:

  1. \boxed is not escaped correctly (fmt_00, fmt_09): In Python regex, \b is the word-boundary anchor, so r"\boxed{...}" matches word-boundary + oxed{...}, not the literal string \boxed{...}. To match the LaTeX \boxed{} literally, use r"\\boxed\{..." or a double backslash.

  2. [.*?] is a character class, not a wildcard (fmt_00, fmt_08, fmt_09): [.*?] matches a single character that is ., *, or ?. The likely intent is (.*?) (non-greedy capture group) or .+?.

  3. fmt_05 has mismatched brackets: The regex r"\[Answer:\s*([A-Za-z])\)" opens with \[ (literal [) but closes with \) (literal )). The seed_format_instruction says "end with [Answer: X]" — so the closing delimiter should be \], not \).

  4. fmt_00 and fmt_09 are duplicates: Both use the identical regex r"\boxed{([.*?])}". Their seed_format_instruction values differ, but the regex and format_key should likely differ too, or one should be removed.

Impact: These patterns are seed data for an LLM pipeline. The regex_alignment LLM judge evaluates whether generated format instructions match the output_regex. If the regex itself is wrong, the judge's scoring reference is unreliable, degrading the quality signal in the generated dataset.

Recommendation: Verify these patterns against the original Nemotron Nano pipeline. If they were copied verbatim from the training codebase, document that in a comment. If they are new to this recipe, fix them.

Low — Doc Page Format Inconsistency

Files: docs/recipes/code_generation/infinibyte.md, docs/recipes/model_usability/structured_data.md, docs/recipes/model_usability/prompt_sensitivity.md

The new recipe doc pages include a # heading and a description paragraph before the download button:

# Nemotron Nano InfiniByte

Generate more diverse and complex training problems...

[Download Code ...]

Most existing recipe doc pages (e.g., text_to_python.md, product_info_qa.md) have only the download button and code include — no heading or description. Some newer ones (e.g., enterprise_text_to_sql.md) do include a heading and notes.

Impact: Minor visual inconsistency in docs. The added heading/description is arguably an improvement, providing better context to readers. Not blocking.

Informational — hashlib.md5 in infinibyte.py

File: docs/assets/recipes/code_generation/infinibyte.py, line ~102 (within fetch_hf_dataset_to_df)

rec_id = rec.get("id") or hashlib.md5(text.encode("utf-8")).hexdigest()

MD5 is used as a fallback ID when a HuggingFace record has no id field. This is fine for deduplication/identification (not security), but some linters and security scanners flag hashlib.md5 usage. Consider hashlib.sha256 for forward-compatibility if the pipeline is adapted to stricter environments.

Informational — Single Strategy Defined in InfiniByte

File: docs/assets/recipes/code_generation/infinibyte.py, lines ~55–57

STRATEGIES = {
    "ocr_omr": ("ocr", "omr"),
}

The --strategy CLI arg accepts choices=list(STRATEGIES.keys()) but only one strategy (ocr_omr) is defined. This is fine for a recipe — it demonstrates extensibility — but the CLI help could note that additional strategies can be added by extending the STRATEGIES dict.

Positive Observations

  • Well-structured pipeline designs with clear ASCII architecture diagrams in each recipe's docstring.
  • Proper use of DataDesigner APIs: LLMStructuredColumnConfig with Pydantic models (infinibyte), LLMJudgeColumnConfig with Score rubrics (prompt_sensitivity), SubcategorySamplerParams for conditional sampling (structured_data), ExpressionColumnConfig for extracting structured fields.
  • SPDX license headers present on all new files.
  • PEP 723 inline script metadata (# /// script) correctly specified for uv run compatibility.
  • from __future__ import annotations included in all three recipe files (consistent with project style guide).
  • Consistent CLI interface across all three recipes (--model-alias, --num-records, --artifact-path).
  • Recipe cards in cards.md follow the established grid pattern with icons, descriptions, "Demonstrates" sections, and action buttons.
  • mkdocs.yml nav entries are properly structured, creating a new "Model Usability" category cleanly.

Verdict

Approve with suggestions. The recipes are well-crafted, demonstrate advanced DataDesigner features, and follow established patterns. The regex pattern issues in prompt_sensitivity.py are the primary concern — they should be verified against the original Nemotron Nano pipeline and corrected or annotated. The other findings are minor and non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant