Intern-S2-Preview
modelEfficient scientific multimodal foundation model from InternLM, continued pretrained from Qwen3.5. Per the HuggingFace model card: 36B parameters (35B + utilities), positioned as comparable to the trillion-scale Intern-S1-Pro on core scientific tasks while using a fraction of the parameter budget.
Architecture (from the model's config.json): the text backbone is a Qwen3.5-MoE variant with 256 experts and a hybrid attention stack — 30 linear-attention layers and 10 full-attention layers at a 3:1 ratio (every fourth layer is full-attention), 2048 hidden dim, GQA with 16 query / 2 KV heads. Vision module is a 27-layer ViT-style encoder (1152 → 2048 hidden, patch 16). Native max position embedding is 262,144 tokens; the model card caps recommended inference at 128K tokens for text reasoning / 64K for multimodal. Default thinking-mode is on.
Reported benchmarks (HF model card): SWE-bench Resolved 64, MMLU-Pro 88, MathArena HMMT Feb 2026 87.31, HLE 21.94, MMMU-Pro Vision 76.88, WildClawBench 39.2. First open-source model reported to do material crystal-structure generation. Apache 2.0.
Released as a "Preview" weights drop on May 22, 2026; no companion technical report on arXiv as of June 2026 — the Intern-S2 paper is presumably forthcoming alongside the non-preview release.
Model Details
Benchmark Scores
| Benchmark | Score | Mode |
|---|---|---|
| SWE-bench Resolved | 64 | — |
| MMLU-Pro | 88 | — |
| MathArena HMMT Feb 2026 | 87.31 | — |
| HLE | 21.94 | — |
| MMMU-Pro Vision | 76.88 | — |
| WildClawBench | 39.2 | — |