Skip to content

Support transformers v5#481

Draft
jlamypoirier wants to merge 13 commits intomainfrom
jlp_transformers_v5
Draft

Support transformers v5#481
jlamypoirier wants to merge 13 commits intomainfrom
jlp_transformers_v5

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

✨ Description

jlamypoirier and others added 12 commits April 22, 2026 20:34
- Widen transformers version constraint to >=4.57.3,<6.0.0
- Version-gate PretrainedConfig init (__init__ vs __post_init__) and dtype attribute (torch_dtype vs dtype) using dataclasses.is_dataclass detection
- Fall back to transformers.modeling_utils.no_init_weights for 4.x
- Support both rope_parameters (5.x) and rope_theta/rope_scaling (4.x) in Llama import/export config
- Handle both attribute paths for vision_tower in multimodal HF model test
- Fix mtp_llama LlamaRotaryEmbedding to handle both rope config formats
- Add _gdn_fla_available and _kda_fla_available flags to apriel2; use them to properly skip backup SSM tests when fla kernels are absent
- Update CLAUDE.md with redirect-to-file and external model test guidance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…compatibility

- apriel2/modeling_apriel2.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys
  to dict format for 5.x (list for 4.x); add rope_parameters to PixtralRotaryEmbedding
  SimpleNamespace config
- mtp_llama/modeling_mtp_llama.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys
- apriel2/conversion/llava/config.py: handle 5.x rope_parameters dict in text and
  vision configs alongside 4.x rope_theta
- apriel2/conversion/llava/plan.py: version-conditional source weight key prefixes
  (5.x LlavaForConditionalGeneration adds model. prefix to submodules)
- test_cache_contracts.py: update DynamicLayer.get_mask_sizes calls to pass int in 5.x
  (query_length) vs tensor in 4.x; update sdpa_mask signature for 5.x (q_length/q_offset)
- test_convert_from_llava.py: use version-conditional embed_tokens source key
- test_equivalence.py: fix get_image_features handling — 5.x returns BaseModelOutput
  with projected features in pooler_output (not last_hidden_state)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix num_blocks off-by-one in import_config (was subtracting 1)
- Fix num_hidden_layers off-by-one in export_config (was adding 1)
- Fix mtp_heads index off-by-one in get_converters (was prediction_distance - 1)
- Fix hidden state collection order in MTPLlamaModel: add embedding before
  trunk loop and add trunk layer outputs inside the loop, consistent with
  standard transformers @capture_outputs behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update TOKENIZER_NAME from "bigcode/santacoder" to "gpt2" and update all
hardcoded token values in data tests to match the gpt2 vocabulary.
Also fix deprecated huggingface_hub.HfFolder.get_token() → get_token().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ERS_V4

- Deduplicate rope-type dispatch in LlamaAttentionConverter.import_config by
  normalizing rope_params/rope_theta from either checkpoint format first
- Rename _TRANSFORMERS_V5 → _TRANSFORMERS_V4 (inverted flag) so v4 compat
  code is in `if _TRANSFORMERS_V4:` blocks — grep-and-delete to drop v4
- Flip all if/else so v5 code is the default path and v4 is the guarded branch
- Import _TRANSFORMERS_V4 from config.py in huggingface.py; replace try/except
  with explicit if/else
- Add comments for v5 changes that can't use the flag (TYPE_CHECKING guard,
  checkpoint format detection, model.model structure)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use tuple prefixes unpacked into W(...) instead of the / operator,
keeping the _TRANSFORMERS_V4 branching for the path prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep llava_layer/apriel_layer intermediate variables (with / operator)
in loops; only the layer root W() calls use *prefix unpacking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- apriel2: override tie_weights() in Apriel2PreTrainedModel to recompute
  MistralRotaryEmbedding.inv_freq after v5 meta-device loading zeroes it
  (non-persistent buffers not in checkpoint are materialized as zeros)
- apriel2: hardcode _attn_implementation="eager" in preprocess mask config
  so an explicit float mask is always built (v5 sdpa returns None otherwise)
- apriel2: use version-conditional kwarg name for create_causal_mask
  (inputs_embeds in v5, input_embeds in v4)
- apriel2 test: compare only non-padding positions in test_logits_match
- mtp_llama: update LlamaRotaryEmbedding and config for v5 compatibility
- gpt/huggingface: remove dead code line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This reverts commit 39196c6.
- Replace removed Qwen3NextDynamicCache with DynamicCache; remove dropped
  cache_position kwarg from GDN forward; fix recurrent_states access path
- Fix rope_theta KeyError (moved to rope_parameters dict in v5)
- Fix attention_mask device mismatch in integration test
- Expand+contiguous attn mask before SDPA to satisfy CUDA kernel contiguity
- Use model._attn_implementation for causal mask creation so pure-causal
  inputs get None mask and SDPA uses is_causal=True (matching Qwen2 numerics)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant