-
Notifications
You must be signed in to change notification settings - Fork 1.8k
lineage parser's handling issue of CTEs in INSERT ... WITH statements. #24416
Description
Affected module
Ingestion Framework — Lineage Parser (SQL lineage extraction)
Describe the bug
When parsing lineage for a BigQuery SQL statement containing an INSERT … WITH clause, the OpenMetadata lineage parser incorrectly identifies a Common Table Expression (CTE) name (tgt_m_bu) as a source table.
The expected behavior is for the parser to extract the real underlying tables referenced inside the CTEs, not the CTEs themselves.
To Reproduce
Parse lineage for the SQL statement below using the OpenMetadata lineage parser:
INSERT INTO [my_project.my](http://my_project.my/)_[dataset.my](http://dataset.my/)_temp_table(
ab_cd
,sls_ord_num
,sns_ord_dtl_num
,ec_sls_ord_type
,sku_cd
,cl_id
,sz_id
,lgth_id
,itm_cd
,sls_ord_date
,sls_ord_sts
,sls_ord_qty
,crnt_prc
,sell_prc
,bk_stk_sts_name
,dlk_delete_date
,dlk_trace_id
,dlk_create_date
,dlk_create_task_qn
,dlk_create_task_run_id
,dlk_update_date
,dlk_update_task_qn
,dlk_update_task_run_id
)
WITH tgt_m_bu AS (
SELECT ab_cd
FROM
[my_staging_project.my](http://my_staging_project.my/)_dataset_maint.bu_table
WHERE
LOWER(bu_name) = 'uqcd'
),
ltst_m_plu_dy_bu AS (
SELECT
m_plu.ab_cd
, m_plu.sku_cd
, max(m_plu.itm_cd) itm_cd
, m_plu.cl_id
, m_plu.sz_id
, m_plu.lgth_id
FROM (
SELECT * FROM [my_staging_project.my](http://my_staging_project.my/)_dataset.plu_daily_bu_table
WHERE proc_dy = (SELECT MAX(proc_dy) FROM [my_staging_project.my](http://my_staging_project.my/)_dataset.plu_daily_bu_table)
) m_plu
GROUP BY m_plu.ab_cd, m_plu.sku_cd, m_plu.cl_id, m_plu.sz_id, m_plu.lgth_id
),
ocr_ec_store_sts AS (
SELECT
sns_ord_n,
CASE
WHEN usg_lctn_type = "STORE_STOCK" THEN "STORE"
WHEN usg_lctn_type = "EC_STOCK" THEN "EC"
WHEN usg_lctn_type = "" THEN "EC"
ELSE "UNCONFIRM"
END AS bk_stk_sts_name
FROM [my_staging_project.my](http://my_staging_project.my/)_dataset_events.sales_order_reservations
),
forced_returned_tbl AS (
SELECT a.org_sns_ord_n AS sns_ord_n
FROM [my_staging_project.my](http://my_staging_project.my/)_dataset_events.return_item_header a
INNER JOIN [my_staging_project.my](http://my_staging_project.my/)_dataset_events.return_item_control b
ON a.rtrn_num = b.rtrn_num
WHERE cntl_type = "FORCE_REFUND_FLAG"
AND cntl_val = "Y"
)
SELECT
cnv.ab_cd
, ord.sns_ord_n
, odt.sns_ord_dtl_num
, odt.ec_sls_ord_type
, odt.lvl_2_itm_cd AS sku_cd
, cplu.cl_id
, cplu.sz_id
, cplu.lgth_id
, cplu.itm_cd AS itm_cd
, DATE(DATE_ADD(DATETIME(sls_ord_create_date_time,'EST'), INTERVAL 0 MINUTE))
, ord.sls_ord_sts
, odt.itm_qty_amt - odt.rtrn_qty
, odt.dtl_appld_rtl_prc_tx_ex AS crnt_prc
, odt.itm_sls_time_unit_prc_tx_ex AS sell_prc
, CASE WHEN obk.bk_stk_sts_name IS NULL THEN "UNCONFIRM" ELSE obk.bk_stk_sts_name END bk_stk_sts_name
, CAST(NULL AS TIMESTAMP)
, ord.dlk_trace_id
, CURRENT_TIMESTAMP() AS dlk_create_date
, 'sample_task_name' AS dlk_create_task_qn
, 'sample_run_id_uuid' AS dk_create_task_run_id
, CURRENT_TIMESTAMP() AS dlk_update_date
, 'sample_task_name' AS dlk_update_task_qn
, 'sample_run_id_uuid' AS dlk_update_task_run_id
FROM
[my_staging_project.my](http://my_staging_project.my/)_dataset_events.sales_order_header ord
,tgt_m_bu cnv
INNER JOIN [my_staging_project.my](http://my_staging_project.my/)_dataset.sales_order_details odt
ON ord.sns_ord_n = odt.sns_ord_n
LEFT JOIN ocr_ec_store_sts obk
ON ord.sns_ord_n = obk.sns_ord_n
LEFT JOIN ltst_m_plu_dy_bu cplu
ON cplu.sku_cd = odt.lvl_2_itm_cd
AND cplu.ab_cd = cnv.ab_cd
WHERE
LOWER(ord.rgn_name) = 'ca'
AND ord.sns_ord_n not in (SELECT sns_ord_n FROM forced_returned_tbl)
AND ord.sls_ord_sts IS NOT NULL
AND ord.sls_ord_create_date_time IS NOT NULL
AND lower(odt.brnd_name) = 'uq'
AND odt.itm_qty_amt IS NOT NULL
AND ord.str_id IS NOT NULL
AND odt.ec_sls_ord_type != "REM"
AND DATE(DATE_ADD(DATETIME(sls_ord_create_date_time,'EST'), INTERVAL 0 MINUTE)) IN (NULL);
Expected behavior
CTE names should not appear as physical sources.
Only the actual referenced tables should appear, such as:
my_staging_project.my_dataset_maint.bu_table
my_staging_project.my_dataset.plu_daily_bu_table
my_staging_project.my_dataset_events.sales_order_reservations
my_staging_project.my_dataset_events.return_item_header
my_staging_project.my_dataset_events.return_item_control
my_staging_project.my_dataset_events.sales_order_header
my_staging_project.my_dataset.sales_order_details
The only target should be:
my_project.my_dataset.my_temp_table
The lineage parser should resolve CTEs and map lineage to the underlying tables referenced in the SQL, not to CTE aliases.
INSERT INTO
Version:
- OS: ubuntu 22.04
- Python version: 3.10
- OpenMetadata version: 1.9.14
- OpenMetadata Ingestion package version: [e.g.
openmetadata-ingestion[docker]==XYZ]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status