feat: expose arrow_field, arrow_try_cast, cast_to_type, with_metadata#1568
Conversation
Adds Python bindings for five scalar functions from datafusion::functions::expr_fn that were not previously surfaced: - arrow_field: returns a struct describing an expression's Arrow field (name, data_type, nullable, metadata). - arrow_try_cast: like arrow_cast but yields NULL on cast failure. - cast_to_type / try_cast_to_type: casts a value to the type of a reference expression. These are exposed as a single Python entry point cast_to_type(value, type_ref, *, try_cast=False); the kwarg switches between the strict and try variants. - with_metadata: attach Arrow field metadata; the inverse of arrow_metadata. Accepts a dict[str, str] for ergonomics. Updates skills/datafusion_python/SKILL.md to list the new functions and documents the cast_to_type kwarg behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit exposed cast_to_type and try_cast_to_type as two separate pyo3 bindings and unified them in the Python wrapper via a try_cast kwarg. That left try_cast_to_type in datafusion._internal without a matching public Python name, breaking test_datafusion_missing_exports. Move the dispatch into the rust binding: cast_to_type now takes a try_cast kwarg and selects between functions::expr_fn::cast_to_type and try_cast_to_type internally. Only one pyo3 binding is registered, so the wrapper-coverage check passes and the Python entrypoint is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors arrow_cast: arrow_try_cast now accepts `pa.DataType` in addition to `str` and `Expr`. Adds `Expr.try_cast(pa.DataType)` PyO3 binding for the pyarrow-type routing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empty `metadata` dict now returns the input expression unchanged (previously bubbled an opaque DataFusion error about minimum arg count). Empty keys raise `ValueError` to match the docstring contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous doctest set metadata on the input field but only checked the name — the metadata setup was dead. Now the example asserts the full returned struct (name, data_type, nullable, metadata) so the demo shows what the function actually produces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ith_metadata Mirrors the existing test_arrow_cast pattern. Covers: - arrow_try_cast: string-syntax, pa.DataType, and null-on-failure paths - arrow_field: full returned struct shape (name, data_type, nullable, metadata) - cast_to_type: type-from-expr happy path and try_cast=True null behavior - with_metadata: round-trip through arrow_metadata, empty-dict no-op, and empty-key ValueError Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the previous four cast tests (arrow_cast + arrow_try_cast × str + pyarrow target type) into a single parameterized test that runs both functions across all five target-type variants. Collapses the two cast_to_type tests (happy path + try_cast=True) into one parameterized test, and parameterizes arrow_try_cast null-on-failure over both target-type syntaxes. 7 test functions, 19 cases — net less code, same coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a one-line cross-reference so users with a known target type reach for arrow_cast / arrow_try_cast instead of building a sentinel expression to feed cast_to_type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @timsaucer, I leave a suggestion below related to splitting cast_to_type.
| #[pyfunction] | ||
| #[pyo3(signature = (arg_1, reference, *, try_cast = false))] | ||
| fn cast_to_type(arg_1: PyExpr, reference: PyExpr, try_cast: bool) -> PyExpr { | ||
| if try_cast { | ||
| functions::expr_fn::try_cast_to_type(arg_1.into(), reference.into()).into() | ||
| } else { | ||
| functions::expr_fn::cast_to_type(arg_1.into(), reference.into()).into() | ||
| } | ||
| } |
There was a problem hiding this comment.
Wouldn't it be better to have separate cast_to_type and try_cast_to_type functions like in upstream? This way it would also be consistent with, e.g., arrow_cast and arrow_try_cast.
There was a problem hiding this comment.
Good call. Split into separate cast_to_type and try_cast_to_type functions in a8d3a5e, matching upstream and the arrow_cast / arrow_try_cast pair.
| @pytest.mark.parametrize("data_type", ["Float64", pa.float64()]) | ||
| def test_arrow_try_cast_null_on_failure(data_type): | ||
| ctx = SessionContext() | ||
| batch = pa.RecordBatch.from_arrays([pa.array(["1.5", "oops", "3"])], names=["s"]) | ||
| df = ctx.create_dataframe([[batch]]) | ||
|
|
||
| result = df.select(f.arrow_try_cast(column("s"), data_type).alias("c")).collect()[0] | ||
|
|
||
| assert result.column(0).to_pylist() == [1.5, None, 3.0] |
There was a problem hiding this comment.
Since the assert is static, is the parameter necessary?
There was a problem hiding this comment.
Agreed, the data_type parametrization was redundant here since the assert does not depend on it and the str-vs-pyarrow distinction is already covered by test_arrow_cast_variants. Dropped it in a8d3a5e.
Replace the try_cast bool flag with separate cast_to_type and try_cast_to_type functions, matching upstream DataFusion and the arrow_cast / arrow_try_cast pair. Also drop the redundant data_type parametrization on test_arrow_try_cast_null_on_failure, since the str-vs-pyarrow distinction is already covered by test_arrow_cast_variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
No tracking issue; gap surfaced during the v54 upstream coverage audit.
Rationale for this change
Five scalar functions from
datafusion::functions::expr_fn(DataFusion 54) were not exposed through the Python bindings. They round out the Arrow type-introspection and casting surface alongside the existingarrow_typeof,arrow_cast, andarrow_metadatawrappers.What changes are included in this PR?
skills/datafusion_python/SKILL.md: list the new functions and document thecast_to_typekwarg behavior so users understand the single-entry-point design.Are there any user-facing changes?
Yes. Five new public functions in
datafusion.functions:arrow_field(expr)arrow_try_cast(expr, data_type)cast_to_type(value, type_ref, *, try_cast=False)with_metadata(expr, metadata)No breaking changes.