watson: perform async index updates#13152
Conversation
|
This pull request introduces middleware/tasks that dynamically load models by name (built from obj._meta) with no whitelist or validation, so an attacker who can trigger modifications to sensitive models (e.g., auth.User) could cause those records to be indexed and potentially expose sensitive data. It also accumulates all modified object primary keys in memory before batching, meaning a single request that modifies a very large number of objects could exhaust worker memory and cause a denial-of-service.
Arbitrary Model Loading in
|
| Vulnerability | Arbitrary Model Loading |
|---|---|
| Description | The update_watson_search_index_for_model task dynamically loads a Django model using apps.get_model() based on the model_name parameter. This model_name is constructed in AsyncSearchContextMiddleware from obj._meta.app_label and obj._meta.model_name of any modified object. There is no whitelist or validation to restrict which models can be indexed. An attacker who can trigger a modification on a sensitive model (e.g., their own auth.User object) could cause that model's data to be indexed by the search engine. If django-watson indexes sensitive fields by default and these are exposed via search results, this could lead to information disclosure. |
django-DefectDojo/dojo/tasks.py
Lines 224 to 276 in 1a303c5
Uncontrolled Resource Consumption in dojo/middleware.py
| Vulnerability | Uncontrolled Resource Consumption |
|---|---|
| Description | The _extract_tasks_for_async method in AsyncSearchContextMiddleware aggregates all primary keys (PKs) of modified objects into an in-memory dictionary (model_groups) before any batching or asynchronous processing occurs. If a single web request were to modify a very large number of database objects (e.g., millions), this could lead to excessive memory consumption within the web worker process, potentially causing a Denial of Service (DoS) due to memory exhaustion. |
django-DefectDojo/dojo/middleware.py
Lines 212 to 287 in 1a303c5
All finding details can be found in the DryRun Security Dashboard.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Conflicts have been resolved. A maintainer will review the pull request shortly. |
428be72 to
b695fd6
Compare
* watson: perform async index updates * watson: perform async index updates * watson: perform async index updates * ruff
While working on some other PRs around performance I noticed that Watson was using a lot of time to update the search index.
Summary
This PR introduces asynchronous Watson search index updates to significantly improve API response times during large data imports while maintaining search functionality. Instead of blocking API responses while updating search indexes synchronously, updates are now processed in the background via Celery tasks.
🏗️ Implementation
Core Components
AsyncSearchContextMiddleware- Inherits from Watson'sSearchContextMiddlewareto intercept and defer index updates.update_watson_search_index_for_model- Celery task that processes batched index updates using Watson's own bulk processing logicThe coupling with
django-watsoninternals is limited to keep the code maintainable and this PR justifiable. Please note we're considering alternative search engines asdjango-watsonis pretty basic and doesn't utilize advanced postgres features.The upstream watson middleware keeps track of all model instances that are changed during a request. It does this by storing the model instance in a
SearchContextMiddlewareinstance. It does this via apost_savesignal. At the end of the request these instances are used to update the search index by calling theend()method on theSearchContextMiddleware.Our implementation still tracks all the model instances in a
SearchContextMiddlewareinstance. But at the end of the request we extract the models name and pk. These are send to the celery task. This task retrieves the model instances via the model name and pk. It then populates aSearchContextMiddlewareinstance again. It then calls theend()method theSearchContextMiddlewareto update the search index. So we reuse most of the watson logic, we just pass some model names and pks around. It does mean the celery task will perform an extra query to retrieve the model instances from the database. This is done in 1 query per batch. The performance impact of this "read" query is negligible compared to the writes happening to the search index that happens right afterwards.⚙️ Configuration
Environment Variables
DD_WATSON_ASYNC_INDEX_UPDATE_THRESHOLD100DD_WATSON_ASYNC_INDEX_UPDATE_BATCH_SIZE1000Configuration Examples
🎛️ Behavior
Smart Threshold Logic
threshold < 0→ All updates synchronous (async disabled)instances <= threshold→ Synchronous updates (fast, immediate)instances > threshold→ Asynchronous updates (prevents blocking)📊 Performance Impact
When using the JFrog Unified Very Many sample file the importer would spend ~50-60s on updating the search index. This is 10-15% of the total import time. With this PR this all happens in a celery task making the import 10-15% faster.
I didn't add a test to
test_importers_performance.pyas that is not using the API and won't go through the middleware. I did add a testcase to make sure the async index update doesn't break in the future.