feat: Parallelize DataAssetsWorkflow with virtual threads (#25808) by manerow · Pull Request #25817 · open-metadata/OpenMetadata

manerow · 2026-02-11T12:25:04Z

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads, reducing wall-clock time by ~2.6x on a dataset of 8,292 entities.

I worked on improving the performance of the Data Insights pipeline because the DataAssetsWorkflow was executing sequentially and spending significant time in blocking database calls during entity enrichment. Since enrichment is heavily I/O-bound (multiple DB round-trips per entity), virtual threads allow efficient concurrency without exhausting platform threads or the DB connection pool.

What changed

DataAssetsWorkflow now processes entities concurrently using a virtual-thread-per-task executor with a semaphore-based concurrency budget:
```
Math.max(4, Math.min(cores * 2, poolSize / 2))
```
- Primary signal: cores × 2
- Hard cap: poolSize / 2
- Minimum: 4
This scales with machine capacity while keeping half of the DB pool free for REST/API traffic and other jobs.
Added enrichSingle() to DataInsightsEntityEnricherProcessor so individual entities can be enriched independently on virtual threads.
Enriched results are collected in a ConcurrentLinkedQueue and bulk-flushed to the search index after each batch.
Made updateStats() methods synchronized across:
- DataInsightsElasticSearchProcessor
- DataInsightsOpenSearchProcessor
- DataInsightsEntityEnricherProcessor
- ElasticSearchIndexSink
- OpenSearchIndexSink
  to ensure thread-safe stat accumulation.
Added graceful stop support: DataInsightsApp.stop() now propagates to the active DataAssetsWorkflow, which shuts down its executor and sets a stopped flag.

Why virtual threads instead of reusing `SearchIndexApp`’s producer-consumer model

SearchIndexExecutor is optimized for:

Read → Index (1 entity → 1 document)

Its bottleneck is Elasticsearch I/O, and it uses platform thread pools, blocking queues, adaptive batching, and async bulk sinks.

DataAssetsWorkflow differs:

I/O-bound enrichment per entity
Each entity performs 3–5+ blocking DB calls (version history + owner/team resolution).
1:N data amplification
One entity can produce 30+ daily snapshot documents, making fixed queue sizing awkward.
4-stage pipeline

Read → Enrich → Process → Sink
The bottleneck is enrichment (middle stage), not read or sink.
Less complexity
Virtual threads + semaphore add ~100 LOC with no queue tuning, no adaptive batching, and no new configuration surface.

Concurrency Budget Design (Brief Rationale)

The budget is intentionally based on CPU cores, not just DB pool size.

Formula:

Math.max(4, Math.min(cores * 2, poolSize / 2))

Why cores × 2 is the primary driver:

On MySQL, virtual threads pin to carrier OS threads during blocking JDBC I/O.
Effective parallelism is therefore bounded by available carrier threads (≈ CPU cores), not by the number of DB connections.
Increasing permits beyond cores × 2 does not increase real throughput.

Why poolSize / 2 is a cap, not the signal:

JDBI onDemand acquires/releases connections per call.
Connections are typically held for only 1–5ms.
Pool exhaustion is not the limiting factor in practice.
poolSize / 2 acts as a safety belt, reserving capacity for:
- REST API traffic
- Other background jobs

Example budgets:

4 cores → 8 threads
8 cores → 16 threads
16 cores → 32 threads

Benchmark confirmation:

75 virtual threads → ~39s
16 virtual threads (cores × 2) → ~36s

Equivalent performance confirms that carrier thread pinning (CPU-bound parallelism), not pool size, is the true concurrency limit.

Performance Results

Dataset: 8,292 entities (load-test-data.sh --quick)
Environment: Clean Docker, identical dataset and config.

Metric	main (sequential)	feature (parallel)
DataAssetsWorkflow duration	~94 seconds	~36 seconds
DI documents indexed	8,368	8,368
Job status	success (0 failed)	success (0 failed)
Concurrency budget	N/A	16 virtual threads (cores × 2)
Speedup	baseline	~2.6x faster

Both runs produced identical results with zero failures.

How did you test your changes?

Full A/B test in a clean Docker environment.
Ran both main and feature branch on the same dataset (8,292 entities).
Triggered Data Insights pipeline.
Compared:
- Log timestamps
- DI document counts
- Job stats
Verified identical indexed document counts and zero failures.
Verified that the updated concurrency budget (16 threads, down from 75) produces identical results and equivalent performance, confirming that carrier thread pinning on MySQL was the actual concurrency limit.

Type of change:

Checklist:

I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document.
My PR title is Fixes #25808: Parallelize DataAssetsWorkflow using Java 21 virtual threads
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Copilot

Pull request overview

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads to improve performance. The workflow processes 8,292 entities with a ~2.6x speedup (from ~94 seconds to ~36 seconds) by converting sequential entity enrichment into concurrent processing with semaphore-based concurrency control.

Changes:

Introduced parallel entity processing using Executors.newVirtualThreadPerTaskExecutor() with a concurrency budget calculated as Math.max(4, Math.min(cores * 2, poolSize / 2)) to balance CPU parallelism with database connection pool capacity
Added enrichSingle() method to DataInsightsEntityEnricherProcessor for independent single-entity enrichment in parallel contexts
Made updateStats() methods synchronized across sink and processor classes to ensure thread-safe statistics accumulation during concurrent processing
Implemented graceful shutdown support with stop() methods that propagate stop signals to active workflows and shut down executors

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
DataAssetsWorkflow.java	Core parallelization logic with virtual thread executor, semaphore-based concurrency control, ConcurrentLinkedQueue for bulk operations, and graceful shutdown support
DataInsightsEntityEnricherProcessor.java	New `enrichSingle()` method for per-entity enrichment without batch error wrapping, and synchronized `updateStats()` for thread safety
DataInsightsElasticSearchProcessor.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
DataInsightsOpenSearchProcessor.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
ElasticSearchIndexSink.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
OpenSearchIndexSink.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
DataInsightsApp.java	Override `stop()` method to propagate shutdown signals to active DataAssetsWorkflow instance

...ps/bundles/insights/workflows/dataAssets/processors/DataInsightsEntityEnricherProcessor.java

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java

Need to address copilot comments

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java

github-actions · 2026-02-17T16:17:42Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

harshach · 2026-02-18T17:01:46Z

@manerow are you looking at the recent chagnes to search idnexing, it uses quartz to distribute across OM servers too. That will give you more leverage in truly making distributed indexing.
Secondly not every indexing should full delete and re-index, we should be able to specify past few days to only index the data from there

manerow · 2026-02-18T19:28:04Z

@harshach Thanks for the pointers.

Distributed indexing with Quartz: I've looked at the DistributedSearchIndexExecutor and the partition-based coordination model. The reason I didn't reuse it here is that the two pipelines work differently. In search reindexing, each entity produces one document and the bottleneck is ES/OS bulk I/O, partitioning by offset ranges maps cleanly and distributing across servers helps because the sink is the constraint. In the Data Assets workflow, the bottleneck is entity enrichment (3-5 DB round-trips per entity, fanning out into 30+ daily snapshots each), not the read or the sink, so parallelizing that I/O-bound enrichment with virtual threads on a single server is what gives us the speedup here.

Virtual threads with a semaphore parallelize that enrichment within a single server for ~100 lines of code and no new config. This isn't a replacement for distributed processing, it's the first step. Distribution would decide which entities each server handles; virtual threads speed up the work within each node. The two layers are complementary, and adapting the Quartz coordination to split entity types across OM instances would be a natural follow-up. This PR is the first step, distributed coordination would be the next.

Incremental indexing: Agreed, already tracked in #25809. The plan is to filter entities by updatedAt > lastSuccessfulRun and switch report data to upsert instead of delete-reinsert. For a 100K-entity deployment with ~1% daily change, that drops processed entities from 100K to ~1K per run. Also complementary to this PR.

harshach · 2026-02-18T19:51:24Z

@manerow if you are planning on doing distributed job in another PR, that works for me. Even with virutal threads if its doing long enough of days of loopback then we will lock those tables for a while. here its better distribute based on no.of days that user want tor eindex from

manerow · 2026-02-18T22:16:35Z

@harshach Sounds good. I'll create a task for the distributed approach with date-range partitioning for backfills and tackle it in a separate PR.

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java

Co-authored-by: manerow <manerow@users.noreply.github.com>

gitar-bot · 2026-03-06T11:07:06Z

🔍 CI failure analysis for 224bdd3: The `playwright-ci-postgresql (3, 6)` job has 1 hard failure (Bulk Import/Export › Database) and 7 flaky tests (DataQuality, Permissions) caused by timeouts and browser crashes unrelated to this PR's changes. A previously reported SonarCloud authentication failure also appears unrelated to these changes.

Issue

Two CI failures have been observed across different jobs for this PR.

1. `playwright-ci-postgresql (3, 6)` — Playwright E2E Test Failures

Result: 1 failed, 7 flaky, 650 passed

Failing test (hard failure):

[chromium] › playwright/e2e/Features/BulkImport.spec.ts:394:7 › Bulk Import Export › Database

Flaky tests (passed on retry):

DataQuality/ColumnLevelTests.spec.ts — Column Values Sum To Be Between
DataQuality/DataQualityPermissions.spec.ts — Admin can see Data Quality UI controls
DataQuality/IncidentManagerDateFilter.spec.ts — Select preset date range
DataQuality/TableLevelTests.spec.ts — Custom SQL Query
Permissions/GlossaryPermissions.spec.ts — Team-based permissions
Permissions/ServiceEntityPermissions.spec.ts — SearchIndex Service allow common operations permissions

Root Cause:

The failures are characterized by:

page.waitForResponse: Target page, context or browser has been closed — browser crash/context loss
Test timeout of 60000ms exceeded while running "beforeEach" hook — infrastructure timeout
locator.click: Target page, context or browser has been closed — browser crash mid-test
expect(locator).toBeVisible() failed — edit-description element not found (likely a timing/race condition)

All failing tests are in DataQuality, Permissions, and BulkImport feature areas. None of these tests are related to the DataAssetsWorkflow parallelization changes in this PR, which only touch backend Java files (DataAssetsWorkflow, DataInsightsEntityEnricherProcessor, DataInsightsElasticSearchProcessor, DataInsightsOpenSearchProcessor, ElasticSearchIndexSink, OpenSearchIndexSink, DataInsightsApp). These failures are consistent with flaky infrastructure-level issues (browser crashes, connection drops) in the shared CI environment.

2. `maven-sonarcloud-ci` — SonarCloud Authentication Failure (previously reported)

Root Cause:

The SonarCloud Maven plugin failed with:

[ERROR] Project not found. Please check the 'sonar.projectKey' and 'sonar.organization' properties,
the 'SONAR_TOKEN' environment variable, or contact the project administrator

And:

[WARNING] Failed to check if project 'open-metadata_OpenMetadata' is bound
[INFO] Detected project binding: ERROR

This is a CI infrastructure/authentication issue with the SONAR_TOKEN secret or sonar.projectKey/sonar.organization configuration for the openmetadata-java-client module. It is not related to this PR's changes.

Summary

Both failures are unrelated to this PR. The Playwright failures are browser-crash/timeout flakiness in the shared CI environment, and the SonarCloud failure is a token/configuration issue in the CI infrastructure. No code changes are needed.

Code Review ✅ Approved 1 resolved / 1 findings

DataAssetsWorkflow now leverages virtual threads for parallelization with improved shutdown handling that ensures updateWorkflowStats executes even if drainAndFlush fails. Uncaught CancellationException on shutdown has been addressed.

✅ 1 resolved

✅ Bug: Uncaught CancellationException on shutdown skips batch drain

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:292
When stop() calls executor.shutdownNow(), tasks that were submitted but haven't started are cancelled. Calling future.get() on a cancelled Future throws java.util.concurrent.CancellationException (extends IllegalStateException), which is not a subclass of ExecutionException.

The catch block at line 296 only catches ExecutionException, so CancellationException escapes the future-iteration loop. This means:

Remaining futures in the batch are never awaited — virtual threads may still be running and adding to operationsQueue while the main thread moves on.

drainAndFlush() at line 306 is skipped for that batch, potentially losing already-completed work from successful tasks in the same batch.

source.updateStats() at line 309 is skipped, causing inaccurate workflow statistics.

The exception propagates as an unchecked exception, exits the while loop, hits the finally block (setting executor to null), then reaches the final drainAndFlush at line 325 — but by this time, virtual threads from the batch may still be running concurrently with the drain.

Fix: catch CancellationException alongside ExecutionException in the future-iteration loop, or use a broader catch (Exception e) pattern.

✅ Bug: Executor null-ed before try-with-resources close

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:322
At line 322, this.executor = null is set inside the try-with-resources block but before the implicit close() call on sourceExecutor. This creates a small window where stop() cannot reach the executor via shutdownNow() because the field is already null, but the executor hasn't actually been closed yet.

While the stopped flag provides a secondary check, the canonical pattern would be to let the try-with-resources handle cleanup and null the field in a finally block or after the try-with-resources block:
try (ExecutorService sourceExecutor = Executors.newVirtualThreadPerTaskExecutor()) {
    this.executor = sourceExecutor;
    // ... processing loop ...
} finally {
    this.executor = null;
}
This also ensures the field is nulled even if close() throws (though unlikely for virtual thread executors).

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-03-06T12:02:44Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

manerow self-assigned this Feb 11, 2026

manerow requested a review from a team as a code owner February 11, 2026 12:25

manerow added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch backend labels Feb 11, 2026

manerow temporarily deployed to test February 11, 2026 12:25 — with GitHub Actions Inactive

manerow had a problem deploying to test February 11, 2026 12:25 — with GitHub Actions Failure

manerow temporarily deployed to test February 11, 2026 12:25 — with GitHub Actions Inactive

manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from f491a98 to 5778830 Compare February 11, 2026 13:44

manerow temporarily deployed to test February 11, 2026 13:52 — with GitHub Actions Inactive

TeddyCr removed the To release Will cherry-pick this PR into the release branch label Feb 11, 2026

TeddyCr requested a review from Copilot February 11, 2026 15:41

Copilot started reviewing on behalf of TeddyCr February 11, 2026 15:42 View session

TeddyCr previously approved these changes Feb 11, 2026

View reviewed changes

Copilot AI reviewed Feb 11, 2026

View reviewed changes

manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from 5778830 to bd87379 Compare February 12, 2026 09:55

manerow had a problem deploying to test February 12, 2026 09:55 — with GitHub Actions Error

gitar-bot bot reviewed Feb 17, 2026

View reviewed changes

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java Show resolved Hide resolved

manerow had a problem deploying to test February 17, 2026 16:10 — with GitHub Actions Error

gitar-bot bot temporarily deployed to test February 17, 2026 16:17 Inactive

gitar-bot bot had a problem deploying to test February 17, 2026 16:17 Failure

gitar-bot bot temporarily deployed to test February 17, 2026 16:17 Inactive

gitar-bot bot reviewed Feb 18, 2026

View reviewed changes

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java Outdated Show resolved Hide resolved

gitar-bot bot reviewed Feb 23, 2026

View reviewed changes

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java Show resolved Hide resolved

TeddyCr previously approved these changes Mar 4, 2026

View reviewed changes

manerow and others added 6 commits March 6, 2026 11:50

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)

74baab1

fix: address PR review comments for parallel DataAssetsWorkflow

7bf3184

fix: null executor field in finally block to avoid race with stop()

e329f35

chore: sync pr branch and update pr context documentation

4a0b2cc

Co-authored-by: manerow <manerow@users.noreply.github.com>

Delete pr_context.md

8dea4c5

fix: ensure updateWorkflowStats runs even if final drainAndFlush throws

224bdd3

harshach approved these changes Mar 9, 2026

View reviewed changes

siddhant1 mentioned this pull request Mar 31, 2026

Add release notes for 1.12.4 #26879

Closed

Conversation

manerow commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why virtual threads instead of reusing SearchIndexApp’s producer-consumer model

Concurrency Budget Design (Brief Rationale)

Performance Results

How did you test your changes?

Type of change:

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

harshach commented Feb 18, 2026

Uh oh!

manerow commented Feb 18, 2026

Uh oh!

harshach commented Feb 18, 2026

Uh oh!

manerow commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

gitar-bot bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

1. playwright-ci-postgresql (3, 6) — Playwright E2E Test Failures

2. maven-sonarcloud-ci — SonarCloud Authentication Failure (previously reported)

Summary

Uh oh!

sonarqubecloud bot commented Mar 6, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

manerow commented Feb 11, 2026 •

edited

Loading

Why virtual threads instead of reusing `SearchIndexApp`’s producer-consumer model

gitar-bot bot commented Mar 6, 2026 •

edited

Loading

1. `playwright-ci-postgresql (3, 6)` — Playwright E2E Test Failures

2. `maven-sonarcloud-ci` — SonarCloud Authentication Failure (previously reported)