Daily Paper Cast
By: Jingwen Liang, Gengyu Wang
Language: en
Categories: Science, Technology
We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, LLM ML, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Episodes
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Jan 10, 2026
🤗 Upvotes: 98 | cs.CL, cs.AI, cs.LG
Authors:
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
Title:
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Arxiv:
http://arxiv.org/abs/2601.05242v1
Abstract:
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pip...
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Jan 10, 2026
🤗 Upvotes: 29 | cs.LG
Authors:
Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
Title:
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Arxiv:
http://arxiv.org/abs/2601.04890v1
Abstract:
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equ...
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Jan 10, 2026
🤗 Upvotes: 26 | cs.CV
Authors:
Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu
Title:
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Arxiv:
http://arxiv.org/abs/2601.05249v1
Abstract:
Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination est...
Token-Level LLM Collaboration via FusionRoute
Jan 10, 2026
🤗 Upvotes: 26 | cs.AI, cs.CL, cs.LG
Authors:
Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao
Title:
Token-Level LLM Collaboration via FusionRoute
Arxiv:
http://arxiv.org/abs/2601.05106v1
Abstract:
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize bey...
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Jan 09, 2026
🤗 Upvotes: 67 | cs.LG, cs.AI, cs.CL
Authors:
Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma
Title:
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Arxiv:
http://arxiv.org/abs/2601.02151v1
Abstract:
Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the mod...
Evolving Programmatic Skill Networks
Jan 09, 2026
🤗 Upvotes: 56 | cs.AI, cs.NE
Authors:
Haochen Shi, Xingdi Yuan, Bang Liu
Title:
Evolving Programmatic Skill Networks
Arxiv:
http://arxiv.org/abs/2601.03509v1
Abstract:
We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over ski...
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Jan 09, 2026
🤗 Upvotes: 31 | cs.CL
Authors:
Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
Title:
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Arxiv:
http://arxiv.org/abs/2601.03872v1
Abstract:
The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling log...
Benchmark^2: Systematic Evaluation of LLM Benchmarks
Jan 09, 2026
🤗 Upvotes: 29 | cs.CL
Authors:
Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng
Title:
Benchmark^2: Systematic Evaluation of LLM Benchmarks
Arxiv:
http://arxiv.org/abs/2601.03986v1
Abstract:
The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, mea...
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Jan 08, 2026
🤗 Upvotes: 73 | cs.CV
Authors:
Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng
Title:
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Arxiv:
http://arxiv.org/abs/2601.03252v1
Abstract:
Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective loc...
LTX-2: Efficient Joint Audio-Visual Foundation Model
Jan 08, 2026
🤗 Upvotes: 47 | cs.CV
Authors:
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
Title:
LTX-2: Efficient Joint Audio-Visual Foundation Model
Arxiv:
http://arxiv.org/abs/2601.03233v1
Abstract:
Recent text-to-video diffusion models can generate compelling video sequences, yet they rem...
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Jan 08, 2026
🤗 Upvotes: 46 | cs.SD, cs.AI, eess.AS
Authors:
MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
Title:
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Arxiv:
http://arxiv.org/abs/2601.01554v2
Abstract:
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, whi...
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Jan 08, 2026
🤗 Upvotes: 28 | cs.AI, cs.CL
Authors:
Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
Title:
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Arxiv:
http://arx...
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Jan 08, 2026
🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG
Authors:
Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi "Jim" Fan
Title:
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Arxiv:
http://arxiv.org/abs/2601.02427v1
Abstract:
We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset cons...
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Jan 07, 2026
🤗 Upvotes: 48 | cs.CL
Authors:
Amirhosein Ghasemabadi, Di Niu
Title:
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Arxiv:
http://arxiv.org/abs/2512.20578v2
Abstract:
Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mec...
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Jan 07, 2026
🤗 Upvotes: 45 | cs.CV, cs.AI
Authors:
Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
Title:
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Arxiv:
htt...
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
Jan 07, 2026
🤗 Upvotes: 36 | cs.CV
Authors:
Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He
Title:
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
Arxiv:
http://arxiv.org/abs/2601.01425v1
Abstract:
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we pro...
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Jan 07, 2026
🤗 Upvotes: 29 | cs.CV, cs.LG
Authors:
Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
Title:
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Arxiv:
http://arxiv.org/abs/2601.02256v1
Abstract:
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes par...
GARDO: Reinforcing Diffusion Models without Reward Hacking
Jan 07, 2026
🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CV
Authors:
Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan
Title:
GARDO: Reinforcing Diffusion Models without Reward Hacking
Arxiv:
http://arxiv.org/abs/2512.24138v1
Abstract:
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. Thi...
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Jan 07, 2026
🤗 Upvotes: 22 | cs.CV
Authors:
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
Title:
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Arxiv:
http://arxiv.org/abs/2601.02281v1
Abstract:
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing met...
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Jan 07, 2026
🤗 Upvotes: 22 | cs.CV
Authors:
Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye
Title:
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Arxiv:
http://arxiv.org/abs/2601.02358v1
Abstract:
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks und...
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
Jan 06, 2026
🤗 Upvotes: 88 | cs.AI
Authors:
Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun
Title:
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
Arxiv:
http://arxiv.org/abs/2512.24615v1
Abstract:
Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool int...
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Jan 06, 2026
🤗 Upvotes: 86 | cs.CV
Authors:
Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang
Title:
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Arxiv:
http://arxiv.org/abs/2601.00393v1
Abstract:
In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our...
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Jan 06, 2026
🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CV, cs.HC, cs.MM
Authors:
Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
Title:
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Arxiv:
http://arxiv.org/abs/2601.00664v1
Abstract:
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating mot...
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Jan 06, 2026
🤗 Upvotes: 38 | cs.CV, cs.AI
Authors:
Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
Title:
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Arxiv:
http://arxiv.org/abs/2512.24271v1
Abstract:
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the int...
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Jan 06, 2026
🤗 Upvotes: 30 | cs.CV
Authors:
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
Title:
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Arxiv:
http://arxiv.org/abs/2512.24330v1
Abstract:
While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with con...
Deep Delta Learning
Jan 06, 2026
🤗 Upvotes: 22 | cs.LG, cs.AI, cs.CL, cs.CV
Authors:
Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu
Title:
Deep Delta Learning
Arxiv:
http://arxiv.org/abs/2601.00417v1
Abstract:
The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the sta...
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Jan 06, 2026
🤗 Upvotes: 22 | cs.CV
Authors:
Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu
Title:
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Arxiv:
http://arxiv.org/abs/2601.00796v1
Abstract:
Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and tem...
Nested Learning: The Illusion of Deep Learning Architectures
Jan 06, 2026
🤗 Upvotes: 22 | cs.LG, cs.AI
Authors:
Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
Title:
Nested Learning: The Illusion of Deep Learning Architectures
Arxiv:
http://arxiv.org/abs/2512.24695v1
Abstract:
Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or par...
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
Jan 03, 2026
🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG
Authors:
Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu
Title:
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
Arxiv:
http://arxiv.org/abs/2512.23959v1
Abstract:
Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage tha...
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Jan 03, 2026
🤗 Upvotes: 25 | cs.LG, cs.AI
Authors:
Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang
Title:
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Arxiv:
http://arxiv.org/abs/2512.24617v1
Abstract:
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally pre...
mHC: Manifold-Constrained Hyper-Connections
Jan 02, 2026
🤗 Upvotes: 73 | cs.CL, cs.AI, cs.LG
Authors:
Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang
Title:
mHC: Manifold-Constrained Hyper-Connections
Arxiv:
http://arxiv.org/abs/2512.24880v1
Abstract:
Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding sub...
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
Jan 02, 2026
🤗 Upvotes: 45 | cs.CL
Authors:
Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan
Title:
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
A...
Duration: 00:28:35Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
Jan 02, 2026
🤗 Upvotes: 33 | cs.AI, cs.CL
Authors:
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao...
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Jan 02, 2026
🤗 Upvotes: 22 | cs.CV
Authors:
Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu
Title:
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Arxiv:
http://arxiv.org/abs/2512.25073v1
Abstract:
Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to...
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Dec 31, 2025
🤗 Upvotes: 72 | cs.CL, cs.LG
Authors:
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
Title:
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Arxiv:
http://arxiv.org/abs/2512.23447v1
Abstract:
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for...
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Dec 31, 2025
🤗 Upvotes: 51 | cs.CV
Authors:
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
Title:
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Arxiv:
http://arxiv.org/abs/2512.23576v1
Abstract:
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, the...
Yume-1.5: A Text-Controlled Interactive World Generation Model
Dec 31, 2025
🤗 Upvotes: 50 | cs.CV
Authors:
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
Title:
Yume-1.5: A Text-Controlled Interactive World Generation Model
Arxiv:
http://arxiv.org/abs/2512.22096v1
Abstract:
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To...
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
Dec 31, 2025
🤗 Upvotes: 33 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Authors:
Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
Title:
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
Arxiv:
http://arxiv.org/abs/2512.22322v1
Abstract:
Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as...
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Dec 31, 2025
🤗 Upvotes: 32 | cs.CV
Authors:
Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
Title:
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Arxiv:
http://arxiv.org/abs/2512.23705v1
Abstract:
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is...
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Dec 31, 2025
🤗 Upvotes: 30 | cs.CV
Authors:
Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu
Title:
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Arxiv:
http://arxiv.org/abs/2512.23709v1
Abstract:
Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Tem...
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
Dec 31, 2025
🤗 Upvotes: 28 | cs.CV, cs.CL
Authors:
Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
Title:
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
Arxiv:
http://arxiv.org/abs/2512.22615v1
Abstract:
While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to ove...
SpotEdit: Selective Region Editing in Diffusion Transformers
Dec 31, 2025
🤗 Upvotes: 27 | cs.CV, cs.AI
Authors:
Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang
Title:
SpotEdit: Selective Region Editing in Diffusion Transformers
Arxiv:
http://arxiv.org/abs/2512.22323v1
Abstract:
Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to reg...
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Dec 31, 2025
🤗 Upvotes: 21 | cs.CV
Authors:
Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang
Title:
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Arxiv:
http://arxiv.org/abs/2512.15560v2
Abstract:
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the...
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
Dec 30, 2025
🤗 Upvotes: 74 | cs.CV, cs.AI
Authors:
Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo
Title:
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
Arxiv:
http://arxiv.org/abs/2512.17504v1
Abstract:
Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI fra...
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
Dec 30, 2025
🤗 Upvotes: 70 | cs.CL
Authors:
Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu
Title:
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
Arxiv:
http://arxiv.org/abs/2512.17220v1
Abstract:
Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such gui...
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Dec 30, 2025
🤗 Upvotes: 21 | cs.CV
Authors:
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
Title:
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Arxiv:
http://arxiv.org/abs/2512.22047v1
Abstract:
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to...
Latent Implicit Visual Reasoning
Dec 27, 2025
🤗 Upvotes: 34 | cs.CV
Authors:
Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig
Title:
Latent Implicit Visual Reasoning
Arxiv:
http://arxiv.org/abs/2512.21218v1
Abstract:
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, the...
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Dec 27, 2025
🤗 Upvotes: 26 | cs.LG, cs.AI
Authors:
Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
Title:
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Arxiv:
http://arxiv.org/abs/2512.20605v2
Abstract:
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by ge...
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Dec 26, 2025
🤗 Upvotes: 51 | cs.CV, cs.AI, cs.LG
Authors:
Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
Title:
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Arxiv:
http://arxiv.org/abs/2512.16093v1
Abstract:
We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Ste...
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Dec 26, 2025
🤗 Upvotes: 42 | cs.CV
Authors:
Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
Title:
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Arxiv:
http://arxiv.org/abs/2512.20557v1
Abstract:
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, ben...
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Dec 26, 2025
🤗 Upvotes: 26 | cs.CV
Authors:
Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu
Title:
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Arxiv:
http://arxiv.org/abs/2512.21252v1
Abstract:
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip con...
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Dec 26, 2025
🤗 Upvotes: 23 | cs.CV
Authors:
Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu
Title:
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Arxiv:
http://arxiv.org/abs/2512.21094v1
Abstract:
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual rea...
SemanticGen: Video Generation in Semantic Space
Dec 25, 2025
🤗 Upvotes: 78 | cs.CV
Authors:
Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai
Title:
SemanticGen: Video Generation in Semantic Space
Arxiv:
http://arxiv.org/abs/2512.20619v2
Abstract:
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In thi...
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Dec 25, 2025
🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL
Authors:
Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu
Title:
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Arxiv:
http://arxiv.org/abs/2512.19673v1
Abstract:
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In thi...
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Dec 25, 2025
🤗 Upvotes: 38 | cs.AI, cs.CV, cs.LG, cs.MA
Authors:
Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen
Title:
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Arxiv:
http://arxiv.org/abs/2512.20618v1
Abstract:
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in whi...
SpatialTree: How Spatial Abilities Branch Out in MLLMs
Dec 25, 2025
🤗 Upvotes: 35 | cs.CV
Authors:
Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
Title:
SpatialTree: How Spatial Abilities Branch Out in MLLMs
Arxiv:
http://arxiv.org/abs/2512.20617v1
Abstract:
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2...
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Dec 24, 2025
🤗 Upvotes: 159 | cs.LG, cs.CL
Authors:
Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang
Title:
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of...
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Dec 24, 2025
🤗 Upvotes: 53 | cs.CV
Authors:
Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu
Title:
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Arxiv:
http://arxiv.org/abs/2512.19693v1
Abstract:
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel enc...
Region-Constraint In-Context Generation for Instructional Video Editing
Dec 24, 2025
🤗 Upvotes: 40 | cs.CV, cs.MM
Authors:
Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei
Title:
Region-Constraint In-Context Generation for Instructional Video Editing
Arxiv:
http://arxiv.org/abs/2512.17650v1
Abstract:
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between edi...
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
Dec 24, 2025
🤗 Upvotes: 26 | cs.CL, cs.IR
Authors:
Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng
Title:
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
Arxiv:
http://arxiv.org/abs/2512.19134v1
Abstract:
Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to...
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
Dec 24, 2025
🤗 Upvotes: 26 | cs.CV
Authors:
Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo
Title:
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
Arxiv:
http://arxiv.org/abs/2512.17040v1
Abstract:
Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To add...
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Dec 24, 2025
🤗 Upvotes: 21 | cs.CL, cs.AI, cs.CY
Authors:
Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou
Title:
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Arxiv:
http://arxiv.org/abs/2512.18880v1
Abstract:
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In thi...
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
Dec 23, 2025
🤗 Upvotes: 78 | cs.AI, cs.CL, cs.LG
Authors:
Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jin...
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Dec 23, 2025
🤗 Upvotes: 64 | cs.RO
Authors:
Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen
Title:
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Arxiv:
http://arxiv.org/abs/2512.16793v1
Abstract:
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid rob...
When Reasoning Meets Its Laws
Dec 23, 2025
🤗 Upvotes: 48 | cs.AI, cs.CL
Authors:
Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang
Title:
When Reasoning Meets Its Laws
Arxiv:
http://arxiv.org/abs/2512.17901v1
Abstract:
Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis tha...
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
Dec 23, 2025
🤗 Upvotes: 40 | cs.CL
Authors:
Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu
Title:
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
Arxiv:
http://arxiv.org/abs/2512.17260v1
Abstract:
Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in for...
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Dec 23, 2025
🤗 Upvotes: 31 | cs.CV
Authors:
Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
Title:
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Arxiv:
http://arxiv.org/abs/2512.17012v1
Abstract:
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a s...
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Dec 23, 2025
🤗 Upvotes: 30 | cs.CV
Authors:
Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo
Title:
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Arxiv:
http://arxiv.org/abs/2512.17909v1
Abstract:
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to ado...
Are We on the Right Way to Assessing LLM-as-a-Judge?
Dec 23, 2025
🤗 Upvotes: 24 | cs.CL, cs.AI
Authors:
Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen
Title:
Are We on the Right Way to Assessing LLM-as-a-Judge?
Arxiv:
http://arxiv.org/abs/2512.16041v1
Abstract:
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite tha...
Kling-Omni Technical Report
Dec 20, 2025
🤗 Upvotes: 112 | cs.CV
Authors:
Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zek...
Adaptation of Agentic AI
Dec 20, 2025
🤗 Upvotes: 59 | cs.AI, cs.CL
Authors:
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han
Title:
Adaptation of Agentic AI
Arxiv:
http://arxiv.org/abs/2512.16301v1
Abstract:
Cutting-edge age...
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Dec 20, 2025
🤗 Upvotes: 54 | cs.LG, cs.AI, cs.CL
Authors:
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang
Title:
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Arxiv:
http://arxiv.org/abs/2512.15745v1
Abstract:
This pap...
Next-Embedding Prediction Makes Strong Vision Learners
Dec 20, 2025
🤗 Upvotes: 49 | cs.CV
Authors:
Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu
Title:
Next-Embedding Prediction Makes Strong Vision Learners
Arxiv:
http://arxiv.org/abs/2512.16922v1
Abstract:
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to...
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Dec 20, 2025
🤗 Upvotes: 33 | cs.CV
Authors:
Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen
Title:
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Arxiv:
http://arxiv.org/abs/2512.16915v1
Abstract:
The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. Thi...
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Dec 20, 2025
🤗 Upvotes: 30 | cs.CV
Authors:
Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Hui...
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Dec 20, 2025
🤗 Upvotes: 28 | cs.CV
Authors:
Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
Title:
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Arxiv:
http://arxiv.org/abs/2512.16913v1
Abstract:
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 sim...
Generative Refocusing: Flexible Defocus Control from a Single Image
Dec 20, 2025
🤗 Upvotes: 26 | cs.CV
Authors:
Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu
Title:
Generative Refocusing: Flexible Defocus Control from a Single Image
Arxiv:
http://arxiv.org/abs/2512.16923v1
Abstract:
Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step pro...
DeContext as Defense: Safe Image Editing in Diffusion Transformers
Dec 20, 2025
🤗 Upvotes: 22 | cs.CV
Authors:
Linghui Shen, Mingyue Cui, Xingyi Yang
Title:
DeContext as Defense: Safe Image Editing in Diffusion Transformers
Arxiv:
http://arxiv.org/abs/2512.16625v1
Abstract:
In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in...
Step-GUI Technical Report
Dec 19, 2025
🤗 Upvotes: 87 | cs.CV
Authors:
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jia...
DEER: Draft with Diffusion, Verify with Autoregressive Models
Dec 19, 2025
🤗 Upvotes: 39 | cs.LG, cs.AI
Authors:
Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu
Title:
DEER: Draft with Diffusion, Verify with Autoregressive Models
Arxiv:
http://arxiv.org/abs/2512.15176v1
Abstract:
Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to...
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Dec 19, 2025
🤗 Upvotes: 36 | cs.CL
Authors:
Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
Title:
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Arxiv:
http://arxiv.org/abs/2512.14681v1
Abstract:
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from lim...
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Dec 19, 2025
🤗 Upvotes: 31 | cs.CV, cs.CL
Authors:
HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang
Title:
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Arxiv:
http://arxiv.org/abs/2512.14052v1
Abstract:
Current multimodal large lan...
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Dec 19, 2025
🤗 Upvotes: 30 | cs.CV
Authors:
Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk
Title:
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Arxiv:
http://arxiv.org/abs/2512.14944v1
Abstract:
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and...
MMGR: Multi-Modal Generative Reasoning
Dec 18, 2025
🤗 Upvotes: 82 | cs.CL, cs.CV
Authors:
Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu
Title:
MMGR: Multi-Modal Generative Reasoning
Arxiv:
http://arxiv.org/abs/2512.14691v2
Abstract:
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, phy...
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
Dec 18, 2025
🤗 Upvotes: 53 | cs.CV
Authors:
Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin
Title:
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
Arxiv:
http://arxiv.org/abs/2512.13281v2
Abstract:
Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unc...
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Dec 18, 2025
🤗 Upvotes: 49 | cs.CV, cs.GR
Authors:
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo
Title:
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Arxiv:
http://arxiv.org/abs/2512.14614v1
Abstract:
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to...
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Dec 18, 2025
🤗 Upvotes: 39 | cs.CV, cs.AI
Authors:
Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
Title:
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Arxiv:
http://arxiv.org/abs/2512.12675v1
Abstract:
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation met...
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Dec 18, 2025
🤗 Upvotes: 31 | cs.RO, cs.CV
Authors:
Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
Title:
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Arxiv:
http://arxiv.org/abs/2512.13660v1
Abstract:
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To...
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Dec 18, 2025
🤗 Upvotes: 27 | cs.AI
Authors:
Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu
Title:
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Arxiv:
http://arxiv.org/abs/2512.14051v1
Abstract:
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them rem...
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Dec 17, 2025
🤗 Upvotes: 75 | cs.CL, cs.AI, cs.LG
Authors:
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Title:
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Arxiv:
http://arxiv.org/abs/2512.13586v1
Abstract:
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a nov...
Towards Scalable Pre-training of Visual Tokenizers for Generation
Dec 17, 2025
🤗 Upvotes: 70 | cs.CV
Authors:
Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Title:
Towards Scalable Pre-training of Visual Tokenizers for Generation
Arxiv:
http://arxiv.org/abs/2512.13687v1
Abstract:
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poo...
Memory in the Age of AI Agents
Dec 17, 2025
🤗 Upvotes: 68 | cs.CL, cs.AI
Authors:
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu...
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Dec 17, 2025
🤗 Upvotes: 59 | cs.CL
Authors:
Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
Title:
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Arxiv:
http://arxiv.org/abs/2512.12967v1
Abstract:
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tas...
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Dec 17, 2025
🤗 Upvotes: 53 | cs.CV
Authors:
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu
Title:
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Arxiv:
http://arxiv.org/abs/2512.13604v1
Abstract:
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward lon...
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Dec 17, 2025
🤗 Upvotes: 47 | cs.AI, cs.CE, cs.IR, cs.MA
Authors:
Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng
Title:
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Arxiv:
http://arxiv.org/abs/2512.13168v1
Abstract:
We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails fro...
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Dec 17, 2025
🤗 Upvotes: 39 | cs.CL
Authors:
Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wen...
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Dec 17, 2025
🤗 Upvotes: 34 | cs.LG
Authors:
Jingdi Lei, Di Zhang, Soujanya Poria
Title:
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Arxiv:
http://arxiv.org/abs/2512.12602v1
Abstract:
Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is...
KlingAvatar 2.0 Technical Report
Dec 17, 2025
🤗 Upvotes: 31 | cs.CV
Authors:
Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou
Title:
KlingAvatar 2.0 Technical Report
Arxiv:
http://arxiv.org/abs/2512.13313v1
Abstract:
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited eff...
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
Dec 17, 2025
🤗 Upvotes: 22 | cs.CL
Authors:
Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou
Title:
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
Arxiv:
http://arxiv.org/abs/2512.09636v2
Abstract:
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings rem...
EgoX: Egocentric Video Generation from a Single Exocentric Video
Dec 16, 2025
🤗 Upvotes: 48 | cs.CV
Authors:
Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo
Title:
EgoX: Egocentric Video Generation from a Single Exocentric Video
Arxiv:
http://arxiv.org/abs/2512.08269v1
Abstract:
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while syn...