PLaMo-100B
model100B dense Transformer with QK Normalization and Z-Loss for training stability. Trained on 2T tokens (1.3T English, 0.7T Japanese) in two phases on NVIDIA H100 GPUs with FP8. Funded under Japan's GENIAC/NEDO program.
Beats GPT-4 on Japanese benchmarks: Jaster 0-shot avg 0.738 (vs GPT-4 0.722), 4-shot avg 0.775 (vs 0.772). Japanese MT-Bench: 7.78.
Model Details
Architecture DENSE
Parameters 100B
Paper
arXiv: 2410.07563