CoKE: Context as the Key to Biomolecular Understanding

Identifies a fundamental tokenization dilemma in Scientific LLMs processing biomolecular sequences: sequence-as-language (e.g., NatureLM, Intern-S1) atomizes sequences into disconnected tokens that destroy functional motifs, while sequence-as-modality approaches (e.g., Evolla) lose fine-grained information through Q-Former alignment bottlenecks — even failing to preserve sensitivity to single point mutations.

Proposes a context-driven paradigm (CoKE): instead of feeding raw sequences to LLMs, run established bioinformatics tools (InterProScan, BLASTp, ProTrek) to produce structured textual descriptions of protein domains, functions, and annotations. Context-only input consistently outperforms raw sequences across every model tested — specialized Sci-LLMs and general-purpose models alike. Gemini 2.5 Pro with context-only (87.2%) outperforms every specialized Sci-LLM, and adding raw sequences actually degrades performance. Reframes the field: LLMs should reason over structured biological knowledge extracted by traditional tools, not learn biology from raw syntax.

Paper (arXiv)GitHub

Paper

arXiv HTML

sciencereasoningresearch