Intern-S2-Preview | Lab Index

Efficient scientific multimodal foundation model from InternLM, continued pretrained from Qwen3.5. Per the HuggingFace model card: 36B parameters (35B + utilities), positioned as comparable to the trillion-scale Intern-S1-Pro on core scientific tasks while using a fraction of the parameter budget.

Architecture (from the model's config.json): the text backbone is a Qwen3.5-MoE variant with 256 experts and a hybrid attention stack — 30 linear-attention layers and 10 full-attention layers at a 3:1 ratio (every fourth layer is full-attention), 2048 hidden dim, GQA with 16 query / 2 KV heads. Vision module is a 27-layer ViT-style encoder (1152 → 2048 hidden, patch 16). Native max position embedding is 262,144 tokens; the model card caps recommended inference at 128K tokens for text reasoning / 64K for multimodal. Default thinking-mode is on.

Reported benchmarks (HF model card): SWE-bench Resolved 64, MMLU-Pro 88, MathArena HMMT Feb 2026 87.31, HLE 21.94, MMMU-Pro Vision 76.88, WildClawBench 39.2. First open-source model reported to do material crystal-structure generation. Apache 2.0.

Released as a "Preview" weights drop on May 22, 2026; no companion technical report on arXiv as of June 2026 — the Intern-S2 paper is presumably forthcoming alongside the non-preview release.

HuggingFace GitHub Chat demo

Model Details

Architecture MOE

Parameters 36B

Experts 256

Context window 131,072

License Apache 2.0

Base model qwen3.5

Benchmark Scores

Benchmark	Score	Mode
SWE-bench Resolved	64	—
MMLU-Pro	88	—
MathArena HMMT Feb 2026	87.31	—
HLE	21.94	—
MMMU-Pro Vision	76.88	—
WildClawBench	39.2	—

frontiersciencemultimodalmoeopen-weightreasoning

Model Details

Benchmark Scores

Related