Skip to content

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)#25817

Merged
manerow merged 6 commits intomainfrom
feat/parallel-data-assets-workflow-25808
Mar 9, 2026
Merged

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)#25817
manerow merged 6 commits intomainfrom
feat/parallel-data-assets-workflow-25808

Conversation

@manerow
Copy link
Copy Markdown
Contributor

@manerow manerow commented Feb 11, 2026

Fixes #25808

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads, reducing wall-clock time by ~2.6x on a dataset of 8,292 entities.

I worked on improving the performance of the Data Insights pipeline because the DataAssetsWorkflow was executing sequentially and spending significant time in blocking database calls during entity enrichment. Since enrichment is heavily I/O-bound (multiple DB round-trips per entity), virtual threads allow efficient concurrency without exhausting platform threads or the DB connection pool.


What changed

  • DataAssetsWorkflow now processes entities concurrently using a virtual-thread-per-task executor with a semaphore-based concurrency budget:

    Math.max(4, Math.min(cores * 2, poolSize / 2))
    
    • Primary signal: cores × 2
    • Hard cap: poolSize / 2
    • Minimum: 4

    This scales with machine capacity while keeping half of the DB pool free for REST/API traffic and other jobs.

  • Added enrichSingle() to DataInsightsEntityEnricherProcessor so individual entities can be enriched independently on virtual threads.

  • Enriched results are collected in a ConcurrentLinkedQueue and bulk-flushed to the search index after each batch.

  • Made updateStats() methods synchronized across:

    • DataInsightsElasticSearchProcessor
    • DataInsightsOpenSearchProcessor
    • DataInsightsEntityEnricherProcessor
    • ElasticSearchIndexSink
    • OpenSearchIndexSink
      to ensure thread-safe stat accumulation.
  • Added graceful stop support: DataInsightsApp.stop() now propagates to the active DataAssetsWorkflow, which shuts down its executor and sets a stopped flag.


Why virtual threads instead of reusing SearchIndexApp’s producer-consumer model

SearchIndexExecutor is optimized for:

Read → Index (1 entity → 1 document)

Its bottleneck is Elasticsearch I/O, and it uses platform thread pools, blocking queues, adaptive batching, and async bulk sinks.

DataAssetsWorkflow differs:

  1. I/O-bound enrichment per entity
    Each entity performs 3–5+ blocking DB calls (version history + owner/team resolution).

  2. 1:N data amplification
    One entity can produce 30+ daily snapshot documents, making fixed queue sizing awkward.

  3. 4-stage pipeline

    Read → Enrich → Process → Sink
    The bottleneck is enrichment (middle stage), not read or sink.

  4. Less complexity
    Virtual threads + semaphore add ~100 LOC with no queue tuning, no adaptive batching, and no new configuration surface.


Concurrency Budget Design (Brief Rationale)

The budget is intentionally based on CPU cores, not just DB pool size.

Formula:

Math.max(4, Math.min(cores * 2, poolSize / 2))

Why cores × 2 is the primary driver:

  • On MySQL, virtual threads pin to carrier OS threads during blocking JDBC I/O.
  • Effective parallelism is therefore bounded by available carrier threads (≈ CPU cores), not by the number of DB connections.
  • Increasing permits beyond cores × 2 does not increase real throughput.

Why poolSize / 2 is a cap, not the signal:

  • JDBI onDemand acquires/releases connections per call.

  • Connections are typically held for only 1–5ms.

  • Pool exhaustion is not the limiting factor in practice.

  • poolSize / 2 acts as a safety belt, reserving capacity for:

    • REST API traffic
    • Other background jobs

Example budgets:

  • 4 cores → 8 threads
  • 8 cores → 16 threads
  • 16 cores → 32 threads

Benchmark confirmation:

  • 75 virtual threads → ~39s
  • 16 virtual threads (cores × 2) → ~36s

Equivalent performance confirms that carrier thread pinning (CPU-bound parallelism), not pool size, is the true concurrency limit.


Performance Results

Dataset: 8,292 entities (load-test-data.sh --quick)
Environment: Clean Docker, identical dataset and config.

Metric main (sequential) feature (parallel)
DataAssetsWorkflow duration ~94 seconds ~36 seconds
DI documents indexed 8,368 8,368
Job status success (0 failed) success (0 failed)
Concurrency budget N/A 16 virtual threads (cores × 2)
Speedup baseline ~2.6x faster

Both runs produced identical results with zero failures.


How did you test your changes?

  • Full A/B test in a clean Docker environment.

  • Ran both main and feature branch on the same dataset (8,292 entities).

  • Triggered Data Insights pipeline.

  • Compared:

    • Log timestamps
    • DI document counts
    • Job stats
  • Verified identical indexed document counts and zero failures.

  • Verified that the updated concurrency budget (16 threads, down from 75) produces identical results and equivalent performance, confirming that carrier thread pinning on MySQL was the actual concurrency limit.


Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document.
  • My PR title is Fixes #25808: Parallelize DataAssetsWorkflow using Java 21 virtual threads
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

@manerow manerow self-assigned this Feb 11, 2026
@manerow manerow requested a review from a team as a code owner February 11, 2026 12:25
@manerow manerow added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch backend labels Feb 11, 2026
@manerow manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from f491a98 to 5778830 Compare February 11, 2026 13:44
@TeddyCr TeddyCr removed the To release Will cherry-pick this PR into the release branch label Feb 11, 2026
@TeddyCr TeddyCr requested a review from Copilot February 11, 2026 15:41
TeddyCr
TeddyCr previously approved these changes Feb 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads to improve performance. The workflow processes 8,292 entities with a ~2.6x speedup (from ~94 seconds to ~36 seconds) by converting sequential entity enrichment into concurrent processing with semaphore-based concurrency control.

Changes:

  • Introduced parallel entity processing using Executors.newVirtualThreadPerTaskExecutor() with a concurrency budget calculated as Math.max(4, Math.min(cores * 2, poolSize / 2)) to balance CPU parallelism with database connection pool capacity
  • Added enrichSingle() method to DataInsightsEntityEnricherProcessor for independent single-entity enrichment in parallel contexts
  • Made updateStats() methods synchronized across sink and processor classes to ensure thread-safe statistics accumulation during concurrent processing
  • Implemented graceful shutdown support with stop() methods that propagate stop signals to active workflows and shut down executors

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
DataAssetsWorkflow.java Core parallelization logic with virtual thread executor, semaphore-based concurrency control, ConcurrentLinkedQueue for bulk operations, and graceful shutdown support
DataInsightsEntityEnricherProcessor.java New enrichSingle() method for per-entity enrichment without batch error wrapping, and synchronized updateStats() for thread safety
DataInsightsElasticSearchProcessor.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
DataInsightsOpenSearchProcessor.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
ElasticSearchIndexSink.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
OpenSearchIndexSink.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
DataInsightsApp.java Override stop() method to propagate shutdown signals to active DataAssetsWorkflow instance

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@harshach
Copy link
Copy Markdown
Collaborator

@manerow are you looking at the recent chagnes to search idnexing, it uses quartz to distribute across OM servers too. That will give you more leverage in truly making distributed indexing.
Secondly not every indexing should full delete and re-index, we should be able to specify past few days to only index the data from there

@manerow
Copy link
Copy Markdown
Contributor Author

manerow commented Feb 18, 2026

@harshach Thanks for the pointers.

Distributed indexing with Quartz: I've looked at the DistributedSearchIndexExecutor and the partition-based coordination model. The reason I didn't reuse it here is that the two pipelines work differently. In search reindexing, each entity produces one document and the bottleneck is ES/OS bulk I/O, partitioning by offset ranges maps cleanly and distributing across servers helps because the sink is the constraint. In the Data Assets workflow, the bottleneck is entity enrichment (3-5 DB round-trips per entity, fanning out into 30+ daily snapshots each), not the read or the sink, so parallelizing that I/O-bound enrichment with virtual threads on a single server is what gives us the speedup here.

Virtual threads with a semaphore parallelize that enrichment within a single server for ~100 lines of code and no new config. This isn't a replacement for distributed processing, it's the first step. Distribution would decide which entities each server handles; virtual threads speed up the work within each node. The two layers are complementary, and adapting the Quartz coordination to split entity types across OM instances would be a natural follow-up. This PR is the first step, distributed coordination would be the next.

Incremental indexing: Agreed, already tracked in #25809. The plan is to filter entities by updatedAt > lastSuccessfulRun and switch report data to upsert instead of delete-reinsert. For a 100K-entity deployment with ~1% daily change, that drops processed entities from 100K to ~1K per run. Also complementary to this PR.

@harshach
Copy link
Copy Markdown
Collaborator

@manerow if you are planning on doing distributed job in another PR, that works for me. Even with virutal threads if its doing long enough of days of loopback then we will lock those tables for a while. here its better distribute based on no.of days that user want tor eindex from

@manerow
Copy link
Copy Markdown
Contributor Author

manerow commented Feb 18, 2026

@harshach Sounds good. I'll create a task for the distributed approach with date-range partitioning for backfills and tackle it in a separate PR.

TeddyCr
TeddyCr previously approved these changes Mar 4, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 6, 2026

🔍 CI failure analysis for 224bdd3: The `playwright-ci-postgresql (3, 6)` job has 1 hard failure (Bulk Import/Export › Database) and 7 flaky tests (DataQuality, Permissions) caused by timeouts and browser crashes unrelated to this PR's changes. A previously reported SonarCloud authentication failure also appears unrelated to these changes.

Issue

Two CI failures have been observed across different jobs for this PR.


1. playwright-ci-postgresql (3, 6) — Playwright E2E Test Failures

Result: 1 failed, 7 flaky, 650 passed

Failing test (hard failure):

  • [chromium] › playwright/e2e/Features/BulkImport.spec.ts:394:7 › Bulk Import Export › Database

Flaky tests (passed on retry):

  • DataQuality/ColumnLevelTests.spec.ts — Column Values Sum To Be Between
  • DataQuality/DataQualityPermissions.spec.ts — Admin can see Data Quality UI controls
  • DataQuality/IncidentManagerDateFilter.spec.ts — Select preset date range
  • DataQuality/TableLevelTests.spec.ts — Custom SQL Query
  • Permissions/GlossaryPermissions.spec.ts — Team-based permissions
  • Permissions/ServiceEntityPermissions.spec.ts — SearchIndex Service allow common operations permissions

Root Cause:

The failures are characterized by:

  • page.waitForResponse: Target page, context or browser has been closed — browser crash/context loss
  • Test timeout of 60000ms exceeded while running "beforeEach" hook — infrastructure timeout
  • locator.click: Target page, context or browser has been closed — browser crash mid-test
  • expect(locator).toBeVisible() failededit-description element not found (likely a timing/race condition)

All failing tests are in DataQuality, Permissions, and BulkImport feature areas. None of these tests are related to the DataAssetsWorkflow parallelization changes in this PR, which only touch backend Java files (DataAssetsWorkflow, DataInsightsEntityEnricherProcessor, DataInsightsElasticSearchProcessor, DataInsightsOpenSearchProcessor, ElasticSearchIndexSink, OpenSearchIndexSink, DataInsightsApp). These failures are consistent with flaky infrastructure-level issues (browser crashes, connection drops) in the shared CI environment.


2. maven-sonarcloud-ci — SonarCloud Authentication Failure (previously reported)

Root Cause:

The SonarCloud Maven plugin failed with:

[ERROR] Project not found. Please check the 'sonar.projectKey' and 'sonar.organization' properties,
the 'SONAR_TOKEN' environment variable, or contact the project administrator

And:

[WARNING] Failed to check if project 'open-metadata_OpenMetadata' is bound
[INFO] Detected project binding: ERROR

This is a CI infrastructure/authentication issue with the SONAR_TOKEN secret or sonar.projectKey/sonar.organization configuration for the openmetadata-java-client module. It is not related to this PR's changes.


Summary

Both failures are unrelated to this PR. The Playwright failures are browser-crash/timeout flakiness in the shared CI environment, and the SonarCloud failure is a token/configuration issue in the CI infrastructure. No code changes are needed.

Code Review ✅ Approved 1 resolved / 1 findings

DataAssetsWorkflow now leverages virtual threads for parallelization with improved shutdown handling that ensures updateWorkflowStats executes even if drainAndFlush fails. Uncaught CancellationException on shutdown has been addressed.

✅ 1 resolved
Bug: Uncaught CancellationException on shutdown skips batch drain

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:292
When stop() calls executor.shutdownNow(), tasks that were submitted but haven't started are cancelled. Calling future.get() on a cancelled Future throws java.util.concurrent.CancellationException (extends IllegalStateException), which is not a subclass of ExecutionException.

The catch block at line 296 only catches ExecutionException, so CancellationException escapes the future-iteration loop. This means:

  1. Remaining futures in the batch are never awaited — virtual threads may still be running and adding to operationsQueue while the main thread moves on.
  2. drainAndFlush() at line 306 is skipped for that batch, potentially losing already-completed work from successful tasks in the same batch.
  3. source.updateStats() at line 309 is skipped, causing inaccurate workflow statistics.

The exception propagates as an unchecked exception, exits the while loop, hits the finally block (setting executor to null), then reaches the final drainAndFlush at line 325 — but by this time, virtual threads from the batch may still be running concurrently with the drain.

Fix: catch CancellationException alongside ExecutionException in the future-iteration loop, or use a broader catch (Exception e) pattern.

Bug: Executor null-ed before try-with-resources close

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:322
At line 322, this.executor = null is set inside the try-with-resources block but before the implicit close() call on sourceExecutor. This creates a small window where stop() cannot reach the executor via shutdownNow() because the field is already null, but the executor hasn't actually been closed yet.

While the stopped flag provides a secondary check, the canonical pattern would be to let the try-with-resources handle cleanup and null the field in a finally block or after the try-with-resources block:

try (ExecutorService sourceExecutor = Executors.newVirtualThreadPerTaskExecutor()) {
    this.executor = sourceExecutor;
    // ... processing loop ...
} finally {
    this.executor = null;
}

This also ensures the field is nulled even if close() throws (though unlikely for virtual thread executors).

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Mar 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data Insight: Parallelize DataAssetsWorkflow entity enrichment with virtual threads

5 participants