Mechanical Dreams

Mechanical Dreams

By: Mechanical Dirk

Language: en-us

Categories: Science, Natural

An automatically generated podcast about machine learning and natural language processing. The two fictional hosts talk about papers that I want to learn more about on my way to work.It's not good, but it's useful.

Episodes

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence
Jan 09, 2026

In this episode:
• Introduction: Is Shannon Information Theory Broken?: Professor Norris and Linda introduce the episode, with Norris expressing skepticism about challenging the foundations of information theory. Linda introduces the paper 'From Entropy to Epiplexity' and the premise that traditional theory fails to account for computational bounds.
• The Paradox of Deterministic Creation: The hosts discuss the first major paradox: how deterministic processes like AlphaZero or synthetic data generation seem to create new knowledge, despite the Data Processing Inequality suggesting otherwise. Linda explains why cryptographic randomness proves that 'computational difficulty' looks like entropy.
• Defining Epiplexity and Time-Bounded Entropy: Linda...

Duration: 00:19:55
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Jan 08, 2026

In this episode:
• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.
• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.
• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Q...

Duration: 00:19:42
NorMuon- Making Muon more efficient and scalable
Jan 07, 2026

In this episode:
• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.
• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prom...

Duration: 00:19:08
Dion- Distributed Orthonormalized Updates
Jan 06, 2026

In this episode:
• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.
• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.
• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skepti...

Duration: 00:18:40
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Jan 06, 2026

In this episode:


• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.
• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.
• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal...

Duration: 00:20:01
Latent State Models of Training Dynamics
Oct 28, 2025

Duration: 00:12:25
DeepSeek OCR
Oct 24, 2025

Duration: 00:13:59
The Coverage Principle - How Pre-training Enables Post-Training
Oct 23, 2025

Duration: 00:15:18
Continual Learning via Sparse Memory Finetuning
Oct 22, 2025

Duration: 00:13:33
Untitled Episode
Oct 10, 2025

Duration: 00:11:56
Characterization and Mitigation of Training Instabilities in Microscaling Formats
Oct 08, 2025

Duration: 00:13:44
Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Oct 07, 2025

Duration: 00:14:19
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Oct 06, 2025

Duration: 00:13:09
Apertus Tech Report
Sep 21, 2025

Duration: 00:13:10
Learning Facts at Scale with Active Reading
Sep 20, 2025

Duration: 00:14:46
Fantastic Pretraining Optimizers and Where to Find Them
Sep 19, 2025

Duration: 00:13:50
Benchmarking Optimizers for Large Language Model Pretraining
Sep 19, 2025

Duration: 00:16:17
Fantastic Pretraining Optimizers and Where to Find Them
Sep 18, 2025

Duration: 00:14:29
The Pitfalls of Next-Token Prediction
Sep 11, 2025

Duration: 00:10:57
Large Language Models and Games
Sep 09, 2025

Duration: 00:17:01
UQ - Assessing Language Models on Unsolved Questions
Sep 05, 2025

Duration: 00:13:35
Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation
Aug 19, 2025

Duration: 00:14:19
Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
Aug 18, 2025

Duration: 00:14:42
Thinking Like Transformers
Jul 29, 2025

Duration: 00:13:19
Kimi K2
Jul 28, 2025

Duration: 00:13:28
ERNIE Technical Report
Jul 25, 2025

Duration: 00:11:16
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Jul 24, 2025

Duration: 00:14:30
Gemini 2.5
Jul 04, 2025

Duration: 00:11:11
How new data permeates LLM knowledge and how to dilute it
Jun 18, 2025

Duration: 00:11:17
Harnessing the Universal Geometry of Embeddings
Jun 17, 2025

Duration: 00:12:23
Model Merging in Pre-training of Large Language Models
Jun 16, 2025

Duration: 00:10:56
Learning Dynamics in Continual Pre-Training for Large Language Models
Jun 13, 2025

Duration: 00:11:10
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs
Jun 12, 2025

Duration: 00:11:49
Scalable-Softmax Is Superior for Attention
Jun 11, 2025

Duration: 00:10:20
Breast Cancer Recurrence Prediction
Jun 06, 2025

Duration: 00:10:20
Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research
Jun 04, 2025

Duration: 00:10:27
Native Sparse Attention
Jun 04, 2025

Duration: 00:11:40
Critical Batch Size Revisited
Jun 03, 2025

Duration: 00:10:57
Rope to Nope and Back Again
May 17, 2025

Duration: 00:12:10
Base of RoPE Bounds Context Length
May 17, 2025

Duration: 00:11:02
Training Deep Learning Models with Norm-Constrained LMOs
May 15, 2025

Duration: 00:11:09
SkyLadder
May 09, 2025

Duration: 00:12:25
LLMs on the Line
May 07, 2025

Duration: 00:09:30
The Leaderboard Illusion
Apr 30, 2025

Duration: 00:10:15
Why Linearly Decaying the Learning Rate to Zero Works Best
Apr 16, 2025

Duration: 00:09:01
Not All Data Are Unlearned Equally
Apr 15, 2025

Duration: 00:12:38
A Multi-Power Law for Loss Curve Prediction
Apr 14, 2025

Duration: 00:12:31
Efficient Training of Ultra-Long Context Large Language Models
Apr 11, 2025

Duration: 00:10:48
Multi-Token Attention
Apr 03, 2025

Duration: 00:15:04
From Style to Facts
Apr 02, 2025

Duration: 00:10:50
Compute Optimal Scaling of Skills
Mar 22, 2025

Duration: 00:09:09
Predictive Data Selection
Mar 15, 2025

Duration: 00:08:43
Continual Pre-training of MoEs
Mar 12, 2025

Duration: 00:10:41
s1 - Simple test-time scaling
Mar 06, 2025

Duration: 00:10:37
Cognitive Behaviors that Enable Self-Improving Reasoners
Mar 05, 2025

Duration: 00:07:20
Phi 4 Multimodal Instruct
Mar 04, 2025

Duration: 00:11:26
Claude 3.7 Sonnet System Card
Feb 25, 2025

Duration: 00:09:21
Project Sid: Many-agent simulations toward AI civilization
Feb 09, 2025

Duration: 00:10:16
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Feb 09, 2025

Duration: 00:09:07
Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Feb 05, 2025

Duration: 00:08:17
NExtLong - Toward Effective Long-Context Training without Long Documents
Jan 30, 2025

Duration: 00:11:38
Over-Tokenized Transformer
Jan 30, 2025

Duration: 00:10:55
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
Jan 30, 2025

Duration: 00:12:36
HashAttention: Semantic Sparsity for Faster Inference
Jan 16, 2025

Duration: 00:11:04
From Tokens to Words
Jan 15, 2025

Duration: 00:14:04
DeepSeek V3
Jan 07, 2025

Duration: 00:11:11
Optimal Linear Decay Learning Rate Schedules and Further Refinements
Jan 05, 2025

Duration: 00:11:03
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Dec 20, 2024

Duration: 00:09:34
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
Dec 20, 2024

Duration: 00:11:48
Phi-4
Dec 14, 2024

Duration: 00:09:04
Rephrasing natural text data with different languages and quality levels
Dec 13, 2024

Duration: 00:11:02
Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
Dec 12, 2024

Duration: 00:11:58
EXAONE 3.5
Dec 11, 2024

Duration: 00:08:58
Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Dec 09, 2024

Duration: 00:06:14
Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Dec 06, 2024

Duration: 00:13:13
Nemotron-CC
Dec 05, 2024

Duration: 00:12:39
Tülu 3
Dec 02, 2024

Duration: 00:12:03
The Zamba2 Suite
Nov 29, 2024

Duration: 00:13:03
Small-scale proxies for large-scale Transformer training instabilities
Nov 26, 2024

Duration: 00:10:07
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Nov 25, 2024

I slightly tweaked the personality of the hosts.

Duration: 00:10:28
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Nov 19, 2024

Duration: 00:08:33
Understanding WSD Learning Rates
Nov 18, 2024

Duration: 00:09:10
Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Nov 16, 2024

New generation algorithm! Should make the episodes longer, more detailed, and more coherent.

Duration: 00:07:29
Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Nov 08, 2024

Duration: 00:13:36
Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?
Nov 07, 2024

Duration: 00:06:27
How Does Critical Batch Size Scale in Pre-training?
Nov 04, 2024

Duration: 00:07:40
The Road Less Scheduled
Nov 01, 2024

Duration: 00:08:53
Learning-Rate-Free Learning by D-Adaptation
Oct 31, 2024

Duration: 00:04:37
Scaling FP8 Training to Trillion Token LLMs
Oct 30, 2024

Duration: 00:09:54
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Oct 29, 2024

Duration: 00:15:43
A Survey on Model MoErging
Oct 28, 2024

Duration: 00:08:58
Liquid Time-constant Networks
Oct 27, 2024

Duration: 00:08:11
Scaling Laws for Predicting Downstream Performance in LLMs
Oct 26, 2024

Duration: 00:10:07
A Spectral Condition for Feature Learning
Oct 25, 2024

Duration: 00:16:52
Don't decay the learning rate
Oct 24, 2024

A classic paper about learning rates.

Duration: 00:07:07
OLMoE
Oct 23, 2024

professor norris: Welcome back to Mechanical Dreams, the podcast where we delve into the exciting world of machine learning and natural language processing. I'm Professor Norris, and as always, I'm joined by my brilliant student, Linda.

linda: It's great to be back, Professor. And I'm particularly excited about today's paper. It tackles a topic that's been buzzing in the NLP community: Mixture-of-Experts models, or MoEs for short.

professor norris: Ah yes, MoEs. I remember when they were a promising but somewhat fringe concept. It seems they're making a comeback, especially with industry giants like Google...

Duration: 00:06:59
An Empirical Model of Large Batch Training
Oct 23, 2024

First attempt to automatically generate a podcast from a paper. This one is way too short, but it's a start.

Duration: 00:11:32