Skip to content

Enforce single-line records in sweep state CSV writers#2684

Merged
brendancol merged 1 commit into
mainfrom
issue-2680
May 30, 2026
Merged

Enforce single-line records in sweep state CSV writers#2684
brendancol merged 1 commit into
mainfrom
issue-2680

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #2680

What

Adds an _oneline() sanitizer to every sweep state-CSV writer so each record is always one physical line, even when a note contains newlines.

Why

The .claude/sweep-*-state.csv files use merge=union so parallel sweep agents can append rows. Union merge is line-based, but the writers used csv.DictWriter with QUOTE_MINIMAL, which is valid CSV yet writes a newline-containing note as a quoted field spanning several physical lines. Concurrent appends then split those records on their internal newlines, producing phantom rows and duplicate headers. PR #2679 repaired one such corruption by hand; this stops it recurring at the source.

The previous "single-line notes or empty" placeholder was a hint, not enforcement, so nothing prevented a multi-line note.

How

Each writer (six sweep-*.md commands plus the deep-sweep reset path) now routes every field through _oneline(), which replaces CR/LF with " | " right before writerow.

Verification

Simulated the patched writer and a union merge:

  • A note containing \n, \r\n, and \r writes to exactly one physical line and round-trips through csv.DictReader.
  • Control: the old writer produced 4 physical lines for the same note (the bug).
  • A union merge of two concurrent single-line writes yields clean records with no phantom rows.

No pytest test is added: these are instruction templates in .claude/commands/, not importable code, so there is no unit to import in xrspatial/tests/. Verification is the simulation above.

Scope note

.claude/commands/sweep-style.md is untracked in the working tree and not part of this branch, so it is not patched here. It needs the same _oneline() change when it is committed.

The sweep state files use merge=union so parallel agents can append rows.
Union merge is line-based, but the writers used csv.DictWriter with
QUOTE_MINIMAL, which writes a newline-containing note as a quoted field
spanning multiple physical lines. Concurrent appends then split those
records on their internal newlines, producing phantom rows and duplicate
headers (repaired once already in #2679).

Add an _oneline() sanitizer to every writer (six sweep-*.md commands plus
the deep-sweep reset path) that collapses embedded CR/LF to " | " right
before writerow, so each record is always one physical line regardless of
note content. This enforces the invariant in code instead of relying on
the unenforced "single-line notes" placeholder hint.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Enforce single-line records in sweep state CSV writers

This is a tooling change to instruction templates in .claude/commands/, so the spatial-domain checks (algorithms, backend parity, dask/cupy, NaN, benchmarks, README matrix) do not apply. The review covers the sanitizer logic and the writer change.

Blockers

None.

Suggestions

None blocking.

Nits

  • The "notes": "<single-line notes ...>" placeholder hints are now redundant, since _oneline() enforces single-line output regardless of what the agent writes. Not worth a churny edit across files, but a future pass could simplify those hints to just describe the content and drop the "replace newlines" instruction.
  • .claude/commands/sweep-style.md is untracked and not on this branch, so it keeps the old unguarded writer. The PR body already flags this. It needs the same _oneline() change when it is committed.

What looks good

  • The replace order (\r\n before \r/\n) avoids double-expanding a CRLF into " | | ". Verified.
  • Applying the sanitizer to every field via the dict comprehension, rather than just notes, is cheap defense in depth.
  • Nice side effect: because existing rows are read back through DictReader and rewritten through _oneline(), the next write self-heals any rows that were already corrupted with embedded newlines.
  • deep-sweep's reset path gets the same guard, so a reset also normalizes.

Verification performed

  • A note containing \n, \r\n, and \r writes to one physical line and round-trips through DictReader.
  • Control: the pre-fix writer produced 4 physical lines for the same note.
  • A simulated merge=union of two concurrent single-line writes produced clean records with no phantom rows.

@brendancol brendancol merged commit 66eda19 into main May 30, 2026
9 checks passed
brendancol added a commit that referenced this pull request Jun 8, 2026
* Drop merge=union for sweep state CSVs (#2754)

.gitattributes registered .claude/sweep-*-state.csv and
.codex/sweep-*-state.csv with merge=union. These files are keyed by
module (one row per module) with a single header, not an append-only
log. The union driver concatenates both sides of a conflicting hunk
instead of raising a conflict. When a branch was based on a stale copy,
the merge silently produced duplicate header lines and duplicate module
rows. It happened on main with the #2712 squash merge: the file went
from 9 lines to 43.

The earlier single-line-record fix (#2680/#2684) kept embedded newlines
from splitting a record, but union still duplicated whole rows and still
mixed formats across stale branches.

Switch both paths to merge=text (git's built-in 3-way text driver) so a
real divergence surfaces as a normal conflict to resolve by hand rather
than corrupting the file silently. I considered -merge (always-conflict,
take-one-side) but rejected it: in a git sandbox it silently dropped the
other branch's rows, trading one silent-corruption mode for a
silent-data-loss mode.

Also updates the merge-driver references across the sweep command docs
to match, and adds a git-sandbox regression test that reproduces the
stale-branch merge and asserts it conflicts instead of unioning. The
single-line-record writer stays (clean diffs, predictable conflicts);
only its rationale comment changed to drop the removed driver name.

* Address review: tighten merge regression test (#2754)

The slope-row count assertion did not distinguish a conflict from a
union (both leave two slope, lines), so it implied protection it did not
provide. Replace it with a header-count assertion: union duplicates the
header line, a conflict keeps one. Drop the unused tmp_path parameter
from the gitattributes check, and note in _make_repo that merge=text is
git's built-in driver rather than a custom one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sweep state CSV writers emit multi-line records that merge=union corrupts

1 participant