docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544dhruvnathawani wants to merge 3 commits intomainfrom
Conversation
|
Docs preview: https://6ef95b66.dd-docs-preview.pages.dev
|
Greptile SummaryThis PR adds three new Nemotron Nano recipe files (structured data, prompt sensitivity, InfiniByte), corresponding documentation pages, recipe cards, and
|
| Filename | Overview |
|---|---|
| docs/assets/recipes/model_usability/prompt_sensitivity.py | Contains multiple broken regex patterns in FORMAT_TEMPLATES (fmt_00, fmt_05, fmt_08, fmt_09) flagged in prior review threads; these make the output_regex fields non-functional for their intended formats. |
| docs/assets/recipes/code_generation/infinibyte.py | New 5-stage pipeline recipe; cross-join logic, Pydantic models, and column dependency ordering are all correct. |
| docs/assets/recipes/model_usability/structured_data.py | New 5-stage structured-data recipe; sampler configs, prompt templates, and column dependency ordering look correct. |
| docs/recipes/cards.md | Three new recipe cards added with correct asset and doc page links. |
| mkdocs.yml | Nav entries for new Model Usability category and InfiniByte under Code Generation added correctly. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph infinibyte["infinibyte.py — InfiniByte Pipeline"]
A1["HF Streaming Download\n(OpenCodeReasoning + OpenMathReasoning)"] --> A2["Cross-Join with Random Sampling"]
A2 --> A3["Seed CSV + combination_type sampler"]
A3 --> A4["Stage 2: Candidate Generation\nLLMStructuredColumn → NewProblemList"]
A4 --> A5["Stage 3: Best Problem Selection\nLLMStructuredColumn → NewProblemWithReasoning"]
A5 --> A6["ExpressionColumn: new_problem"]
A6 --> A7["Stage 4: Evaluation\nLLMStructuredColumn → NewProblemEvals"]
A7 --> A8["Stage 5: Solution Generation\nLLMTextColumn"]
end
subgraph ps["prompt_sensitivity.py — Prompt Sensitivity Pipeline"]
B1["Seed CSV\n10 regex formats × 30 preambles"] --> B2["Stage 1: 7 Diversity Samplers"]
B2 --> B3["Stage 2: Preamble Generation\nLLMTextColumn"]
B3 --> B4["Stage 3: Format Instruction Generation\nLLMTextColumn"]
B4 --> B5["Stage 4: User Prompt Composition\nLLMTextColumn"]
B5 --> B6["Stage 5: 4 LLM Judges\nformat_compliance, regex_alignment\norder_coherence, preamble_quality"]
end
subgraph sd["structured_data.py — Structured Data Pipeline"]
C1["Stage 1: Samplers\nformat, topic/subtopic, schema controls, conversation controls"] --> C2["Stage 2: Schema Generation\nLLMTextColumn"]
C2 --> C3["Stage 3: User Prompt Generation\nLLMTextColumn"]
C3 --> C4["Stage 4: Conversation Pairs\nLLMTextColumn"]
C4 --> C5["Stage 5: Best-of-3 Structured Output\n3× LLMTextColumn"]
end
Reviews (2): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile
| }, | ||
| { | ||
| "format_key": "fmt_05", | ||
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", |
There was a problem hiding this comment.
Mismatched closing delimiter in
fmt_05 regex
The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", | |
| "output_regex": r"\[Answer:\s*([A-Za-z])\]", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103
Comment:
**Mismatched closing delimiter in `fmt_05` regex**
The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.
```suggestion
"output_regex": r"\[Answer:\s*([A-Za-z])\]",
```
How can I resolve this? If you propose a fix, please make it concise.| FORMAT_TEMPLATES = [ | ||
| { | ||
| "format_key": "fmt_00", | ||
| "output_regex": r"\boxed{([.*?])}", |
There was a problem hiding this comment.
Incorrect regex for LaTeX
\boxed{} format
r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:
| "output_regex": r"\boxed{([.*?])}", | |
| "output_regex": r"\\boxed\{(.*?)\}", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78
Comment:
**Incorrect regex for LaTeX `\boxed{}` format**
`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:
```suggestion
"output_regex": r"\\boxed\{(.*?)\}",
```
How can I resolve this? If you propose a fix, please make it concise.| }, | ||
| { | ||
| "format_key": "fmt_08", | ||
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", |
There was a problem hiding this comment.
[.*?] character class captures only ., *, or ?
([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", | |
| "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118
Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**
`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:
```suggestion
"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```
How can I resolve this? If you propose a fix, please make it concise.
Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)SummaryThis PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, FindingsHigh — Regex Patterns in
|
📋 Summary
Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.
🔄 Changes
Added the following :
SubcategorySamplerParams for conditional topic sampling.
coherence, preamble quality).
🔧 Changed
🧪 Testing
✅ Checklist