Hymba
paperHybrid-head architecture running attention heads and SSM heads in parallel within the same layer, plus learnable meta tokens and cross-layer KV sharing. Hymba-1.5B outperforms Llama-3.2-3B by 1.32% on commonsense reasoning with 11x smaller cache and 3.49x faster inference. Architectural precursor to the Nemotron 3 hybrid approach.
Paper
arXiv: 2411.13676