Skip to content

feat: add pgsql-parse package with comment and whitespace preservation#290

Open
pyramation wants to merge 6 commits intomainfrom
devin/1775205353-pgsql-parse-package
Open

feat: add pgsql-parse package with comment and whitespace preservation#290
pyramation wants to merge 6 commits intomainfrom
devin/1775205353-pgsql-parse-package

Conversation

@pyramation
Copy link
Copy Markdown
Collaborator

@pyramation pyramation commented Apr 3, 2026

Summary

Adds a new self-contained packages/parse/ package that preserves SQL -- line comments and vertical whitespace (blank lines) through parse→deparse round trips. No existing packages are modified.

How it works:

  1. The scanner (scanner.ts) uses PostgreSQL's real lexer via @libpg-query/parser's scanSync() to extract SQL_COMMENT (275) tokens with exact byte positions. Whitespace detection uses token gaps to find blank lines between statements/comments. All string literal types (single-quoted, dollar-quoted, escape strings, etc.) are handled correctly by the actual PostgreSQL scanner — no custom TypeScript reimplementation.
  2. Enhanced parse/parseSync call the standard parser, then interleave synthetic RawComment and RawWhitespace nodes into the stmts array based on byte position.
  3. deparseEnhanced() dispatches on node type — real RawStmt entries go through the standard deparser, while synthetic nodes emit their comment text or blank lines directly.

Key design decisions:

  • interleave() uses a unified sort with priority levels (comment < whitespace < statement) to handle ties when stmt_location overlaps with preceding comments
  • findActualSqlStart() iteratively skips whitespace and scanned elements within a statement's stmt_location range to find the actual SQL keyword position — needed because PostgreSQL's parser includes preceding stripped content in stmt_location
  • Only -- line comments are supported (not /* */ block comments). This was a deliberate decision — block comments are not used in our PostgreSQL workflow.
  • 28 tests across scanner (15) and integration (13) suites, including an idempotency test (parse→deparse→parse→deparse = identical output)

Updates since last revision

  • Removed pnpm patch entirely: The previous approach used a pnpm patch on @libpg-query/parser to fix an upstream JSON serialization bug, but this caused CI failures due to pnpm v9/v10 lockfile incompatibility. The patch has been removed — no patches/ directory, no patchedDependencies config.
  • Inline JSON fix via safeScanSync(): The upstream _wasm_scan bug (unescaped \n, \r, \t in token text fields breaking JSON.parse) is now handled inline in scanner.ts. safeScanSync() tries the normal scanSync() first; if it throws, it retries with a temporarily monkey-patched JSON.parse that escapes control characters via fixScanJson(). The patch is synchronous and restored in a finally block, so there are no concurrency concerns.
  • No lockfile/workspace config changes: The only substantive lockfile change is the new @libpg-query/parser dependency for packages/parse. The rest of the lockfile diff is quoting style changes from pnpm v10.

Previous updates (still apply):

  • Removed all block comment (/* */) support: RawComment.type is now 'line' only.
  • Simplified loadModule(): Now a simple re-export from @libpg-query/parser.
  • Simplified deparseComment(): Always emits --{text}.

Review & Testing Checklist for Human

  • fixScanJson regex correctness (scanner.ts:23-32): The regex /"(?:[^"\\]|\\.)*"/g matches JSON string values and replaces literal \n, \r, \t with their escape sequences. This only runs on the retry path (when scanSync throws). Verify it handles edge cases: token text containing backslashes, already-escaped sequences (\\n should not become \\\\n), and empty strings. If the regex is wrong, scan results on the retry path could be silently corrupted.
  • JSON.parse monkey-patch safety (scanner.ts:40-54): safeScanSync() temporarily replaces the global JSON.parse during the retry call. This is restored in a finally block and is synchronous (single-threaded), so it should be safe. However, verify that no other code path could be affected if scanSync internally does something unexpected.
  • README.md is stale: The README still references /* */ block comments in multiple places (e.g. "line and /* */ block", "A pure TypeScript scanner"). This should be updated to reflect that only -- line comments are supported and the scanner uses the WASM lexer.
  • CI does not run this package's tests: The CI test matrix (run-tests.yaml) does not include pgsql-parse. The parser-tests jobs passed because the package builds successfully, but the 28 Jest tests are not executed in CI. Consider adding pgsql-parse to the matrix before merging.
  • findActualSqlStart() correctness (parse.ts:28-59): This function walks forward from stmt_location skipping whitespace and scanned elements. Verify it handles: multiple adjacent comments before a statement, a comment immediately followed by a statement with no whitespace, and the first statement at position 0.

Suggested test plan: Clone the branch, run cd packages/parse && npx jest to verify 28/28 pass. Then try parsing your own SQL files with -- comments through parseSyncdeparseEnhanced and inspect the output, especially files with PGPM headers and PL/pgSQL function bodies containing comments inside dollar-quoted strings (these should NOT be extracted as comments — the WASM scanner handles this correctly).

Notes

  • Inline comments (comments on the same line after a statement, e.g. SELECT 1; -- note) are extracted by the scanner but will be repositioned to their own line after deparsing, since the deparser emits statements without trailing content.
  • The package depends on workspace packages (pgsql-parser, pgsql-deparser, @pgsql/types) via workspace:* protocol. tsconfig.test.json has path mappings so tests resolve TypeScript source directly without requiring a build step.
  • The upstream JSON escaping bug should eventually be fixed in libpg-query-node's build_scan_json() C function. Once fixed, the safeScanSync fallback path becomes dead code and can be simplified to a direct scanSync call.
  • The pnpm-lock.yaml diff is large but mostly quoting style changes ("@scope/pkg"'@scope/pkg'). The only substantive change is the new @libpg-query/parser dependency for packages/parse.

Link to Devin session: https://app.devin.ai/sessions/67facbcfe0ae424bad3eafb4e6ca9059
Requested by: @pyramation

New package that preserves SQL comments and vertical whitespace through
parse→deparse round trips by scanning source text for comment tokens
and interleaving synthetic RawComment and RawWhitespace AST nodes into
the stmts array by byte position.

Features:
- Pure TypeScript scanner for -- line and /* block */ comments
- Handles string literals, dollar-quoted strings, escape strings
- RawWhitespace nodes for blank lines between statements
- Enhanced deparseEnhanced() that emits comments and whitespace
- Idempotent: parse→deparse→parse→deparse produces identical output
- Drop-in replacement API (re-exports parse, deparse, loadModule)
- 36 tests across scanner and integration test suites

No changes to any existing packages.
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

…query

Replace the custom TypeScript comment scanner with PostgreSQL's real
lexer via libpg-query's WASM _wasm_scan function. This eliminates the
risk of bugs from reimplementing PostgreSQL's lexer in TypeScript.

Key changes:
- scanner.ts now loads the WASM module directly and calls _wasm_scan
  with proper JSON escaping (works around upstream bug where control
  characters in token text are not escaped)
- Dependency changed from libpg-query to @libpg-query/parser (full
  build with scan support)
- Unified loadModule() initializes both parse/deparse and scanner WASM
- All 36 tests passing including multi-line block comments and
  dollar-quoted string handling
- Remove all block comment (/* */) handling — only -- line comments supported
- Simplify scanner.ts to use @libpg-query/parser scanSync directly
- Add pnpm patch for upstream scanSync JSON serialization bug (control chars in token text)
- Update types.ts: RawComment.type is now just 'line'
- Update deparse.ts: remove block comment case
- Update index.ts: re-export loadModule from @libpg-query/parser directly
- Remove block comment tests from scanner.test.ts and parse.test.ts
- All 28 tests passing
Instead of patching @libpg-query/parser via pnpm patch (which caused
CI issues with pnpm v9/v10 lockfile incompatibility), handle the
upstream JSON serialization bug inline in scanner.ts.

The approach: try scanSync normally, and if it throws due to unescaped
control characters in the JSON output, retry with a temporarily
monkey-patched JSON.parse that escapes control chars before parsing.
This is synchronous so there are no concurrency concerns.

All 28 tests pass. No changes to lockfile format or workspace config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant