Skip to content

lineage parser's handling issue of CTEs in INSERT ... WITH statements. #24416

@SarinJickson02

Description

@SarinJickson02

Affected module
Ingestion Framework — Lineage Parser (SQL lineage extraction)

Describe the bug
When parsing lineage for a BigQuery SQL statement containing an INSERT … WITH clause, the OpenMetadata lineage parser incorrectly identifies a Common Table Expression (CTE) name (tgt_m_bu) as a source table.

The expected behavior is for the parser to extract the real underlying tables referenced inside the CTEs, not the CTEs themselves.

To Reproduce
Parse lineage for the SQL statement below using the OpenMetadata lineage parser:
INSERT INTO [my_project.my](http://my_project.my/)_[dataset.my](http://dataset.my/)_temp_table(
ab_cd
,sls_ord_num
,sns_ord_dtl_num
,ec_sls_ord_type
,sku_cd
,cl_id
,sz_id
,lgth_id
,itm_cd
,sls_ord_date
,sls_ord_sts
,sls_ord_qty
,crnt_prc
,sell_prc
,bk_stk_sts_name
,dlk_delete_date
,dlk_trace_id
,dlk_create_date
,dlk_create_task_qn
,dlk_create_task_run_id
,dlk_update_date
,dlk_update_task_qn
,dlk_update_task_run_id
)
WITH tgt_m_bu AS (
SELECT ab_cd
FROM
[my_staging_project.my](http://my_staging_project.my/)_dataset_maint.bu_table
WHERE
LOWER(bu_name) = 'uqcd'
),
ltst_m_plu_dy_bu AS (
SELECT
m_plu.ab_cd
, m_plu.sku_cd
, max(m_plu.itm_cd) itm_cd
, m_plu.cl_id
, m_plu.sz_id
, m_plu.lgth_id
FROM (
SELECT * FROM [my_staging_project.my](http://my_staging_project.my/)_dataset.plu_daily_bu_table
WHERE proc_dy = (SELECT MAX(proc_dy) FROM [my_staging_project.my](http://my_staging_project.my/)_dataset.plu_daily_bu_table)
) m_plu
GROUP BY m_plu.ab_cd, m_plu.sku_cd, m_plu.cl_id, m_plu.sz_id, m_plu.lgth_id
),
ocr_ec_store_sts AS (
SELECT
sns_ord_n,
CASE
WHEN usg_lctn_type = "STORE_STOCK" THEN "STORE"
WHEN usg_lctn_type = "EC_STOCK" THEN "EC"
WHEN usg_lctn_type = "" THEN "EC"
ELSE "UNCONFIRM"
END AS bk_stk_sts_name
FROM [my_staging_project.my](http://my_staging_project.my/)_dataset_events.sales_order_reservations
),
forced_returned_tbl AS (
SELECT a.org_sns_ord_n AS sns_ord_n
FROM [my_staging_project.my](http://my_staging_project.my/)_dataset_events.return_item_header a
INNER JOIN [my_staging_project.my](http://my_staging_project.my/)_dataset_events.return_item_control b
ON a.rtrn_num = b.rtrn_num
WHERE cntl_type = "FORCE_REFUND_FLAG"
AND cntl_val = "Y"
)
SELECT
cnv.ab_cd
, ord.sns_ord_n
, odt.sns_ord_dtl_num
, odt.ec_sls_ord_type
, odt.lvl_2_itm_cd AS sku_cd
, cplu.cl_id
, cplu.sz_id
, cplu.lgth_id
, cplu.itm_cd AS itm_cd
, DATE(DATE_ADD(DATETIME(sls_ord_create_date_time,'EST'), INTERVAL 0 MINUTE))
, ord.sls_ord_sts
, odt.itm_qty_amt - odt.rtrn_qty
, odt.dtl_appld_rtl_prc_tx_ex AS crnt_prc
, odt.itm_sls_time_unit_prc_tx_ex AS sell_prc
, CASE WHEN obk.bk_stk_sts_name IS NULL THEN "UNCONFIRM" ELSE obk.bk_stk_sts_name END bk_stk_sts_name
, CAST(NULL AS TIMESTAMP)
, ord.dlk_trace_id
, CURRENT_TIMESTAMP() AS dlk_create_date
, 'sample_task_name' AS dlk_create_task_qn
, 'sample_run_id_uuid' AS dk_create_task_run_id
, CURRENT_TIMESTAMP() AS dlk_update_date
, 'sample_task_name' AS dlk_update_task_qn
, 'sample_run_id_uuid' AS dlk_update_task_run_id
FROM
[my_staging_project.my](http://my_staging_project.my/)_dataset_events.sales_order_header ord
,tgt_m_bu cnv
INNER JOIN [my_staging_project.my](http://my_staging_project.my/)_dataset.sales_order_details odt
ON ord.sns_ord_n = odt.sns_ord_n
LEFT JOIN ocr_ec_store_sts obk
ON ord.sns_ord_n = obk.sns_ord_n
LEFT JOIN ltst_m_plu_dy_bu cplu
ON cplu.sku_cd = odt.lvl_2_itm_cd
AND cplu.ab_cd = cnv.ab_cd
WHERE
LOWER(ord.rgn_name) = 'ca'
AND ord.sns_ord_n not in (SELECT sns_ord_n FROM forced_returned_tbl)
AND ord.sls_ord_sts IS NOT NULL
AND ord.sls_ord_create_date_time IS NOT NULL
AND lower(odt.brnd_name) = 'uq'
AND odt.itm_qty_amt IS NOT NULL
AND ord.str_id IS NOT NULL
AND odt.ec_sls_ord_type != "REM"
AND DATE(DATE_ADD(DATETIME(sls_ord_create_date_time,'EST'), INTERVAL 0 MINUTE)) IN (NULL);

Expected behavior
CTE names should not appear as physical sources.

Only the actual referenced tables should appear, such as:

my_staging_project.my_dataset_maint.bu_table

my_staging_project.my_dataset.plu_daily_bu_table

my_staging_project.my_dataset_events.sales_order_reservations

my_staging_project.my_dataset_events.return_item_header

my_staging_project.my_dataset_events.return_item_control

my_staging_project.my_dataset_events.sales_order_header

my_staging_project.my_dataset.sales_order_details

The only target should be:

my_project.my_dataset.my_temp_table

The lineage parser should resolve CTEs and map lineage to the underlying tables referenced in the SQL, not to CTE aliases.
INSERT INTO

should identify a single target and all fully-qualified physical tables used anywhere in the query (including subqueries inside CTEs) as sources.

Version:

  • OS: ubuntu 22.04
  • Python version: 3.10
  • OpenMetadata version: 1.9.14
  • OpenMetadata Ingestion package version: [e.g. openmetadata-ingestion[docker]==XYZ]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Done ✅

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions