Mechanical Dreams
By: Mechanical Dirk
Language: en-us
Categories: Science, Natural
An automatically generated podcast about machine learning and natural language processing. The two fictional hosts talk about papers that I want to learn more about on my way to work.It's not good, but it's useful.
Episodes
From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence
Jan 09, 2026In this episode:
• Introduction: Is Shannon Information Theory Broken?: Professor Norris and Linda introduce the episode, with Norris expressing skepticism about challenging the foundations of information theory. Linda introduces the paper 'From Entropy to Epiplexity' and the premise that traditional theory fails to account for computational bounds.
• The Paradox of Deterministic Creation: The hosts discuss the first major paradox: how deterministic processes like AlphaZero or synthetic data generation seem to create new knowledge, despite the Data Processing Inequality suggesting otherwise. Linda explains why cryptographic randomness proves that 'computational difficulty' looks like entropy.
• Defining Epiplexity and Time-Bounded Entropy: Linda...
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Jan 08, 2026In this episode:
• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.
• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.
• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Q...
NorMuon- Making Muon more efficient and scalable
Jan 07, 2026In this episode:
• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.
• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prom...
Dion- Distributed Orthonormalized Updates
Jan 06, 2026In this episode:
• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.
• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.
• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skepti...
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Jan 06, 2026In this episode:
• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.
• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.
• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal...
Latent State Models of Training Dynamics
Oct 28, 2025DeepSeek OCR
Oct 24, 2025The Coverage Principle - How Pre-training Enables Post-Training
Oct 23, 2025Continual Learning via Sparse Memory Finetuning
Oct 22, 2025Untitled Episode
Oct 10, 2025Characterization and Mitigation of Training Instabilities in Microscaling Formats
Oct 08, 2025Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Oct 07, 2025Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Oct 06, 2025Apertus Tech Report
Sep 21, 2025Learning Facts at Scale with Active Reading
Sep 20, 2025Fantastic Pretraining Optimizers and Where to Find Them
Sep 19, 2025Benchmarking Optimizers for Large Language Model Pretraining
Sep 19, 2025Fantastic Pretraining Optimizers and Where to Find Them
Sep 18, 2025The Pitfalls of Next-Token Prediction
Sep 11, 2025Large Language Models and Games
Sep 09, 2025UQ - Assessing Language Models on Unsolved Questions
Sep 05, 2025Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation
Aug 19, 2025Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
Aug 18, 2025Thinking Like Transformers
Jul 29, 2025Kimi K2
Jul 28, 2025ERNIE Technical Report
Jul 25, 2025Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Jul 24, 2025Gemini 2.5
Jul 04, 2025How new data permeates LLM knowledge and how to dilute it
Jun 18, 2025Harnessing the Universal Geometry of Embeddings
Jun 17, 2025Model Merging in Pre-training of Large Language Models
Jun 16, 2025Learning Dynamics in Continual Pre-Training for Large Language Models
Jun 13, 2025Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs
Jun 12, 2025Scalable-Softmax Is Superior for Attention
Jun 11, 2025Breast Cancer Recurrence Prediction
Jun 06, 2025Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research
Jun 04, 2025Native Sparse Attention
Jun 04, 2025Critical Batch Size Revisited
Jun 03, 2025Rope to Nope and Back Again
May 17, 2025Base of RoPE Bounds Context Length
May 17, 2025Training Deep Learning Models with Norm-Constrained LMOs
May 15, 2025SkyLadder
May 09, 2025LLMs on the Line
May 07, 2025The Leaderboard Illusion
Apr 30, 2025Why Linearly Decaying the Learning Rate to Zero Works Best
Apr 16, 2025Not All Data Are Unlearned Equally
Apr 15, 2025A Multi-Power Law for Loss Curve Prediction
Apr 14, 2025Efficient Training of Ultra-Long Context Large Language Models
Apr 11, 2025Multi-Token Attention
Apr 03, 2025From Style to Facts
Apr 02, 2025Compute Optimal Scaling of Skills
Mar 22, 2025Predictive Data Selection
Mar 15, 2025Continual Pre-training of MoEs
Mar 12, 2025s1 - Simple test-time scaling
Mar 06, 2025Cognitive Behaviors that Enable Self-Improving Reasoners
Mar 05, 2025Phi 4 Multimodal Instruct
Mar 04, 2025Claude 3.7 Sonnet System Card
Feb 25, 2025Project Sid: Many-agent simulations toward AI civilization
Feb 09, 2025The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Feb 09, 2025Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Feb 05, 2025NExtLong - Toward Effective Long-Context Training without Long Documents
Jan 30, 2025Over-Tokenized Transformer
Jan 30, 2025Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
Jan 30, 2025HashAttention: Semantic Sparsity for Faster Inference
Jan 16, 2025From Tokens to Words
Jan 15, 2025DeepSeek V3
Jan 07, 2025Optimal Linear Decay Learning Rate Schedules and Further Refinements
Jan 05, 2025Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Dec 20, 2024Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
Dec 20, 2024Phi-4
Dec 14, 2024Rephrasing natural text data with different languages and quality levels
Dec 13, 2024Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
Dec 12, 2024EXAONE 3.5
Dec 11, 2024Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Dec 09, 2024Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Dec 06, 2024Nemotron-CC
Dec 05, 2024Tülu 3
Dec 02, 2024The Zamba2 Suite
Nov 29, 2024Small-scale proxies for large-scale Transformer training instabilities
Nov 26, 2024Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Nov 25, 2024I slightly tweaked the personality of the hosts.
Duration: 00:10:28Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Nov 19, 2024Understanding WSD Learning Rates
Nov 18, 2024Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Nov 16, 2024New generation algorithm! Should make the episodes longer, more detailed, and more coherent.
Duration: 00:07:29Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Nov 08, 2024Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?
Nov 07, 2024How Does Critical Batch Size Scale in Pre-training?
Nov 04, 2024The Road Less Scheduled
Nov 01, 2024Learning-Rate-Free Learning by D-Adaptation
Oct 31, 2024Scaling FP8 Training to Trillion Token LLMs
Oct 30, 2024Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Oct 29, 2024A Survey on Model MoErging
Oct 28, 2024Liquid Time-constant Networks
Oct 27, 2024Scaling Laws for Predicting Downstream Performance in LLMs
Oct 26, 2024A Spectral Condition for Feature Learning
Oct 25, 2024Don't decay the learning rate
Oct 24, 2024A classic paper about learning rates.
Duration: 00:07:07OLMoE
Oct 23, 2024professor norris: Welcome back to Mechanical Dreams, the podcast where we delve into the exciting world of machine learning and natural language processing. I'm Professor Norris, and as always, I'm joined by my brilliant student, Linda.
linda: It's great to be back, Professor. And I'm particularly excited about today's paper. It tackles a topic that's been buzzing in the NLP community: Mixture-of-Experts models, or MoEs for short.
professor norris: Ah yes, MoEs. I remember when they were a promising but somewhat fringe concept. It seems they're making a comeback, especially with industry giants like Google...
An Empirical Model of Large Batch Training
Oct 23, 2024First attempt to automatically generate a podcast from a paper. This one is way too short, but it's a start.
Duration: 00:11:32