Reinforcement learning
updated
Diffusion Augmented Agents: A Framework for Efficient Exploration and
Transfer Learning
Paper
• 2407.20798
• Published
• 24
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper
• 2412.16145
• Published
• 38
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
Models
Paper
• 2501.03262
• Published
• 104
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
• 2502.18449
• Published
• 75
Learning to Reason under Off-Policy Guidance
Paper
• 2504.14945
• Published
• 88
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making
Abilities
Paper
• 2504.16078
• Published
• 21
OTC: Optimal Tool Calls via Reinforcement Learning
Paper
• 2504.14870
• Published
• 35
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published
• 98
Thinkless: LLM Learns When to Think
Paper
• 2505.13379
• Published
• 50
Visual Agentic Reinforcement Fine-Tuning
Paper
• 2505.14246
• Published
• 32
Reinforcement Learning Finetunes Small Subnetworks in Large Language
Models
Paper
• 2505.11711
• Published
• 11
Scaling Reasoning, Losing Control: Evaluating Instruction Following in
Large Reasoning Models
Paper
• 2505.14810
• Published
• 62
TinyV: Reducing False Negatives in Verification Improves RL for LLM
Reasoning
Paper
• 2505.14625
• Published
• 13
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search
with Reasoning via Reinforcement Learning
Paper
• 2505.15776
• Published
• 11
Teaching Large Language Models to Maintain Contextual Faithfulness via
Synthetic Tasks and Reinforcement Learning
Paper
• 2505.16483
• Published
• 10
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published
• 62
Synthetic Data RL: Task Definition Is All You Need
Paper
• 2505.17063
• Published
• 11
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
• 2505.24864
• Published
• 144
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective
Reinforcement Learning for LLM Reasoning
Paper
• 2506.01939
• Published
• 188
Reinforcement Pre-Training
Paper
• 2506.08007
• Published
• 263
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes
Correct Reasoning in Base LLMs
Paper
• 2506.14245
• Published
• 45
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for
Empathetic Agents
Paper
• 2507.03112
• Published
• 34
Pre-Trained Policy Discriminators are General Reward Models
Paper
• 2507.05197
• Published
• 39
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based
Reinforcement Learning
Paper
• 2507.05920
• Published
• 12
Inverse Reinforcement Learning Meets Large Language Model Post-Training:
Basics, Advances, and Opportunities
Paper
• 2507.13158
• Published
• 24
The Invisible Leash: Why RLVR May Not Escape Its Origin
Paper
• 2507.14843
• Published
• 85
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via
Context-Aware Multi-Stage Policy Optimization
Paper
• 2507.14683
• Published
• 134
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
Paper
• 2507.14295
• Published
• 14
Geometric-Mean Policy Optimization
Paper
• 2507.20673
• Published
• 32
Beyond the Trade-off: Self-Supervised Reinforcement Learning for
Reasoning Models' Instruction Following
Paper
• 2508.02150
• Published
• 37
Reinforcement Learning in Vision: A Survey
Paper
• 2508.08189
• Published
• 30
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Paper
• 2508.08221
• Published
• 50
Feedback-Driven Tool-Use Improvements in Large Language Models via
Automated Build Environments
Paper
• 2508.08791
• Published
• 16
Pass@k Training for Adaptively Balancing Exploration and Exploitation of
Large Reasoning Models
Paper
• 2508.10751
• Published
• 29
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference
Optimization
Paper
• 2508.14460
• Published
• 85
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement
Learning for General LLM Reasoning
Paper
• 2508.16949
• Published
• 24
TreePO: Bridging the Gap of Policy Optimization and Efficacy and
Inference Efficiency with Heuristic Tree-based Modeling
Paper
• 2508.17445
• Published
• 80
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published
• 89
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published
• 110
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Paper
• 2509.02547
• Published
• 231
Implicit Actor Critic Coupling via a Supervised Learning Framework for
RLVR
Paper
• 2509.02522
• Published
• 26
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn
Tool-Integrated Reasoning
Paper
• 2509.02479
• Published
• 84
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM
Step-Provers
Paper
• 2509.06493
• Published
• 12
Language Self-Play For Data-Free Training
Paper
• 2509.07414
• Published
• 31
Bootstrapping Task Spaces for Self-Improvement
Paper
• 2509.04575
• Published
• 6
Revolutionizing Reinforcement Learning Framework for Diffusion Large
Language Models
Paper
• 2509.06949
• Published
• 56
Reinforcement Learning Foundations for Deep Research Systems: A Survey
Paper
• 2509.06733
• Published
• 32
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published
• 73
A Survey of Reinforcement Learning for Large Reasoning Models
Paper
• 2509.08827
• Published
• 190
Single-stream Policy Optimization
Paper
• 2509.13232
• Published
• 34
Tree Search for LLM Agent Reinforcement Learning
Paper
• 2509.21240
• Published
• 92
EPO: Entropy-regularized Policy Optimization for LLM Agents
Reinforcement Learning
Paper
• 2509.22576
• Published
• 135
Variational Reasoning for Language Models
Paper
• 2509.22637
• Published
• 69
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Paper
• 2509.25760
• Published
• 55
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive
Exploration for Agentic Reinforcement Learning
Paper
• 2509.22601
• Published
• 30
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget
Allocation
Paper
• 2509.25849
• Published
• 48
Attention as a Compass: Efficient Exploration for Process-Supervised RL
in Reasoning Models
Paper
• 2509.26628
• Published
• 17
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with
Verifiable Rewards via Monte Carlo Tree Search
Paper
• 2509.25454
• Published
• 146
BroRL: Scaling Reinforcement Learning via Broadened Exploration
Paper
• 2510.01180
• Published
• 20
RLP: Reinforcement as a Pretraining Objective
Paper
• 2510.01265
• Published
• 44
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM
Training
Paper
• 2510.04996
• Published
• 16
Training-Free Group Relative Policy Optimization
Paper
• 2510.08191
• Published
• 45
Low-probability Tokens Sustain Exploration in Reinforcement Learning
with Verifiable Reward
Paper
• 2510.03222
• Published
• 75
Agent Learning via Early Experience
Paper
• 2510.08558
• Published
• 273
Demystifying Reinforcement Learning in Agentic Reasoning
Paper
• 2510.11701
• Published
• 33
Information Gain-based Policy Optimization: A Simple and Effective
Approach for Multi-Turn LLM Agents
Paper
• 2510.14967
• Published
• 34
Agentic Entropy-Balanced Policy Optimization
Paper
• 2510.14545
• Published
• 106
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Paper
• 2510.19363
• Published
• 62
Unified Reinforcement and Imitation Learning for Vision-Language Models
Paper
• 2510.19307
• Published
• 32
Every Question Has Its Own Value: Reinforcement Learning with Explicit
Human Values
Paper
• 2510.20187
• Published
• 19
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise
Reasoning
Paper
• 2510.25992
• Published
• 48
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised
Reinforcement Learning
Paper
• 2510.27606
• Published
• 31
DRIVE: Data Curation Best Practices for Reinforcement Learning with
Verifiable Reward in Competitive Code Generation
Paper
• 2511.06307
• Published
• 53
The Path Not Taken: RLVR Provably Learns Off the Principals
Paper
• 2511.08567
• Published
• 34
Soft Adaptive Policy Optimization
Paper
• 2511.20347
• Published
• 42
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Paper
• 2511.19399
• Published
• 62
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
Paper
• 2512.02807
• Published
• 9
PretrainZero: Reinforcement Active Pretraining
Paper
• 2512.03442
• Published
• 48
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
Paper
• 2512.04220
• Published
• 16
BEAVER: An Efficient Deterministic LLM Verifier
Paper
• 2512.05439
• Published
• 36
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
Paper
• 2512.17008
• Published
• 11
Reinforcement Learning for Self-Improving Agent with Skill Library
Paper
• 2512.17102
• Published
• 36
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Paper
• 2512.19673
• Published
• 64
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Paper
• 2512.20605
• Published
• 62
Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
Paper
• 2512.21625
• Published
• 4
Evaluating Parameter Efficient Methods for RLVR
Paper
• 2512.23165
• Published
• 28
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Paper
• 2601.02256
• Published
• 33
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Paper
• 2601.02356
• Published
• 14
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper
• 2601.05242
• Published
• 228
AT^2PO: Agentic Turn-based Policy Optimization via Tree Search
Paper
• 2601.04767
• Published
• 28
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
Paper
• 2601.06021
• Published
• 47
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
Paper
• 2601.06487
• Published
• 53
MAXS: Meta-Adaptive Exploration with LLM Agents
Paper
• 2601.09259
• Published
• 95
Behavior Knowledge Merge in Reinforced Agentic Models
Paper
• 2601.13572
• Published
• 25
LongCat-Flash-Thinking-2601 Technical Report
Paper
• 2601.16725
• Published
• 176
Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
Paper
• 2601.14243
• Published
• 23
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
Paper
• 2601.20614
• Published
• 119
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System
Paper
• 2602.02488
• Published
• 32
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Paper
• 2601.22975
• Published
• 107
Unified Personalized Reward Model for Vision Generation
Paper
• 2602.02380
• Published
• 20
Rethinking the Trust Region in LLM Reinforcement Learning
Paper
• 2602.04879
• Published
• 35
Self-Hinting Language Models Enhance Reinforcement Learning
Paper
• 2602.03143
• Published
• 29
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs
Paper
• 2602.03048
• Published
• 32
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
Paper
• 2602.05261
• Published
• 49
Reinforcement World Model Learning for LLM-based Agents
Paper
• 2602.05842
• Published
• 27
Reinforced Attention Learning
Paper
• 2602.04884
• Published
• 28
Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
Paper
• 2602.05885
• Published
• 28
Steering LLMs via Scalable Interactive Oversight
Paper
• 2602.04210
• Published
• 18
WorldCompass: Reinforcement Learning for Long-Horizon World Models
Paper
• 2602.09022
• Published
• 20
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
Paper
• 2602.06422
• Published
• 44
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
Paper
• 2602.06540
• Published
• 21
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
Paper
• 2602.08847
• Published
• 26
Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models
Paper
• 2602.10224
• Published
• 19
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Paper
• 2602.12036
• Published
• 98
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
Paper
• 2602.11964
• Published
• 12
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
Paper
• 2602.11748
• Published
• 30
Experiential Reinforcement Learning
Paper
• 2602.13949
• Published
• 68
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Paper
• 2602.15112
• Published
• 20
PyVision-RL: Forging Open Agentic Vision Models via RL
Paper
• 2602.20739
• Published
• 28
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Paper
• 2602.21534
• Published
• 22