Efficient scientific multimodal foundation model from InternLM, continued pretrained from Qwen3.5. Per the HuggingFace model card: 36B parameters (35B + utilities), positioned as comparable to the trillion-scale Intern-S1-Pro on core scientific tasks while using a fraction of the parameter budget.

Architecture (from the model's config.json): the text backbone is a Qwen3.5-MoE variant with 256 experts and a hybrid attention stack — 30 linear-attention layers and 10 full-attention layers at a 3:1 ratio (every fourth layer is full-attention), 2048 hidden dim, GQA with 16 query / 2 KV heads. Vision module is a 27-layer ViT-style encoder (1152 → 2048 hidden, patch 16). Native max position embedding is 262,144 tokens; the model card caps recommended inference at 128K tokens for text reasoning / 64K for multimodal. Default thinking-mode is on.

Reported benchmarks (HF model card): SWE-bench Resolved 64, MMLU-Pro 88, MathArena HMMT Feb 2026 87.31, HLE 21.94, MMMU-Pro Vision 76.88, WildClawBench 39.2. First open-source model reported to do material crystal-structure generation. Apache 2.0.

Released as a "Preview" weights drop on May 22, 2026; no companion technical report on arXiv as of June 2026 — the Intern-S2 paper is presumably forthcoming alongside the non-preview release.

Model Details

Architecture MOE
Parameters 36B
Experts 256
Context window 131,072
License Apache 2.0
Base model qwen3.5

Benchmark Scores

Benchmark Score Mode
SWE-bench Resolved 64
MMLU-Pro 88
MathArena HMMT Feb 2026 87.31
HLE 21.94
MMMU-Pro Vision 76.88
WildClawBench 39.2
frontiersciencemultimodalmoeopen-weightreasoning

Related