Vorname Nachname

Researcher

Chemnitz, Deutschland

Erfahrungen

Jan. 2024 - Feb. 2025
1 Jahr 2 Monaten

Researcher

Moonshot AI

Led research on scaling up the Muon optimizer for large language model training. Key contributions:

  • Developed and analyzed techniques for effective scaling of Muon including weight decay and per-parameter update scale adjustments
  • Created efficient distributed implementation with ZeRO-1 style optimization achieving optimal memory efficiency
  • Conducted scaling law experiments demonstrating 2x computational efficiency vs AdamW
  • Built and trained Moonlight, a 3B/16B parameter MoE model with 5.7T tokens using Muon that achieved state-of-the-art performance
Jan. 2024 - Feb. 2025
1 Jahr 2 Monaten

Researcher

UCLA

Contributed to research on large-scale optimization techniques for language models, collaborating with Moonshot AI team on the Muon scaling project

Juni 2020 - Aug. 2020
3 Monaten
Nazareth, Israel
Remote

Jesus

SAP 4Hana, UX Ui Design, Einkauf und Verkauf, sowie Heilung von Patienten

Zusammenfassung

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: adding weight decay and carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves 2× computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPS compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Sprachen

Englisch
Muttersprache
Chinesisch
Verhandlungssicher

Ausbildung

UCLA

Los Angeles, Vereinigte Staaten