Reinforcement learning

Testerpce 's Collections

Inference

Embedding

Optimization

OCR

Model Arithmetic

Tokenization methods and language modelling

Tabular ML

Tool

Eval

Foundation Models

LLM judge

Test time

Physics and operators

Materials and structures

Vision Language Action models

Vision

World model

Code

Compression

Data

Process Reward Modelling

Memory

SAE

Applications and Uses

Theory and Representation learning

Graph

Information_retrieval

Speech

Attention

Synthetic data

Agent

MoE

RAG

Partial layer training LLMs

Reasoning

Evaluation

Fine tuning

Math

Dataset and Data processing

Style transfer

Video understanding

Reinforcement learning

Long context

Knowledge

updated 3 days ago

Upvote

Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

Paper • 2407.20798 • Published Jul 30, 2024 • 24
Offline Reinforcement Learning for LLM Multi-Step Reasoning

Paper • 2412.16145 • Published Dec 20, 2024 • 38
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4, 2025 • 104
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Paper • 2502.18449 • Published Feb 25, 2025 • 75
Learning to Reason under Off-Policy Guidance

Paper • 2504.14945 • Published Apr 21, 2025 • 88
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Paper • 2504.16078 • Published Apr 22, 2025 • 21
OTC: Optimal Tool Calls via Reinforcement Learning

Paper • 2504.14870 • Published Apr 21, 2025 • 35
Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Paper • 2504.20571 • Published Apr 29, 2025 • 98
Thinkless: LLM Learns When to Think

Paper • 2505.13379 • Published May 19, 2025 • 50
Visual Agentic Reinforcement Fine-Tuning

Paper • 2505.14246 • Published May 20, 2025 • 32
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Paper • 2505.11711 • Published May 16, 2025 • 11
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Paper • 2505.14810 • Published May 20, 2025 • 62
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Paper • 2505.14625 • Published May 20, 2025 • 13
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

Paper • 2505.15776 • Published May 21, 2025 • 11
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Paper • 2505.16483 • Published May 22, 2025 • 10
One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published May 23, 2025 • 62
Synthetic Data RL: Task Definition Is All You Need

Paper • 2505.17063 • Published May 18, 2025 • 11
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper • 2505.24864 • Published May 30, 2025 • 144
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Paper • 2506.01939 • Published Jun 2, 2025 • 188
Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9, 2025 • 263
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Paper • 2506.14245 • Published Jun 17, 2025 • 45
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Paper • 2507.03112 • Published Jul 3, 2025 • 34
Pre-Trained Policy Discriminators are General Reward Models

Paper • 2507.05197 • Published Jul 7, 2025 • 39
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Paper • 2507.05920 • Published Jul 8, 2025 • 12
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

Paper • 2507.13158 • Published Jul 17, 2025 • 24
The Invisible Leash: Why RLVR May Not Escape Its Origin

Paper • 2507.14843 • Published Jul 20, 2025 • 85
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Paper • 2507.14683 • Published Jul 19, 2025 • 134
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

Paper • 2507.14295 • Published Jul 18, 2025 • 14
Geometric-Mean Policy Optimization

Paper • 2507.20673 • Published Jul 28, 2025 • 32
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Paper • 2508.02150 • Published Aug 4, 2025 • 37
Reinforcement Learning in Vision: A Survey

Paper • 2508.08189 • Published Aug 11, 2025 • 30
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Paper • 2508.08221 • Published Aug 11, 2025 • 50
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Paper • 2508.08791 • Published Aug 12, 2025 • 16
Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Paper • 2508.10751 • Published Aug 14, 2025 • 29
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Paper • 2508.14460 • Published Aug 20, 2025 • 85
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Paper • 2508.16949 • Published Aug 23, 2025 • 24
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Paper • 2508.17445 • Published Aug 24, 2025 • 80
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Paper • 2508.20751 • Published Aug 28, 2025 • 89
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Paper • 2508.21113 • Published Aug 28, 2025 • 110
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Paper • 2509.02547 • Published Sep 2, 2025 • 231
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Paper • 2509.02522 • Published Sep 2, 2025 • 26
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2, 2025 • 84
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Paper • 2509.06493 • Published Sep 8, 2025 • 12
Language Self-Play For Data-Free Training

Paper • 2509.07414 • Published Sep 9, 2025 • 31
Bootstrapping Task Spaces for Self-Improvement

Paper • 2509.04575 • Published Sep 4, 2025 • 6
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Paper • 2509.06949 • Published Sep 8, 2025 • 56
Reinforcement Learning Foundations for Deep Research Systems: A Survey

Paper • 2509.06733 • Published Sep 8, 2025 • 32
RewardDance: Reward Scaling in Visual Generation

Paper • 2509.08826 • Published Sep 10, 2025 • 73
A Survey of Reinforcement Learning for Large Reasoning Models

Paper • 2509.08827 • Published Sep 10, 2025 • 190
Single-stream Policy Optimization

Paper • 2509.13232 • Published Sep 16, 2025 • 34
Tree Search for LLM Agent Reinforcement Learning

Paper • 2509.21240 • Published Sep 25, 2025 • 92
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Paper • 2509.22576 • Published Sep 26, 2025 • 135
Variational Reasoning for Language Models

Paper • 2509.22637 • Published Sep 26, 2025 • 69
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Paper • 2509.25760 • Published Sep 30, 2025 • 55
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Paper • 2509.22601 • Published Sep 26, 2025 • 30
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

Paper • 2509.25849 • Published Sep 30, 2025 • 48
Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

Paper • 2509.26628 • Published Sep 30, 2025 • 17
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Paper • 2509.25454 • Published Sep 29, 2025 • 146
BroRL: Scaling Reinforcement Learning via Broadened Exploration

Paper • 2510.01180 • Published Oct 1, 2025 • 20
RLP: Reinforcement as a Pretraining Objective

Paper • 2510.01265 • Published Sep 26, 2025 • 44
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Paper • 2510.04996 • Published Oct 6, 2025 • 16
Training-Free Group Relative Policy Optimization

Paper • 2510.08191 • Published Oct 9, 2025 • 45
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Paper • 2510.03222 • Published Oct 3, 2025 • 75
Agent Learning via Early Experience

Paper • 2510.08558 • Published Oct 9, 2025 • 273
Demystifying Reinforcement Learning in Agentic Reasoning

Paper • 2510.11701 • Published Oct 13, 2025 • 33
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Paper • 2510.14967 • Published Oct 16, 2025 • 34
Agentic Entropy-Balanced Policy Optimization

Paper • 2510.14545 • Published Oct 16, 2025 • 106
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Paper • 2510.19363 • Published Oct 22, 2025 • 62
Unified Reinforcement and Imitation Learning for Vision-Language Models

Paper • 2510.19307 • Published Oct 22, 2025 • 32
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

Paper • 2510.20187 • Published Oct 23, 2025 • 19
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Paper • 2510.25992 • Published Oct 29, 2025 • 48
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Paper • 2510.27606 • Published Oct 31, 2025 • 31
DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Paper • 2511.06307 • Published Nov 9, 2025 • 53
The Path Not Taken: RLVR Provably Learns Off the Principals

Paper • 2511.08567 • Published Nov 11, 2025 • 34
Soft Adaptive Policy Optimization

Paper • 2511.20347 • Published Nov 25, 2025 • 42
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published Nov 24, 2025 • 62
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Paper • 2512.02807 • Published Dec 2, 2025 • 9
PretrainZero: Reinforcement Active Pretraining

Paper • 2512.03442 • Published Dec 3, 2025 • 48
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Paper • 2512.04220 • Published Dec 3, 2025 • 16
BEAVER: An Efficient Deterministic LLM Verifier

Paper • 2512.05439 • Published Dec 5, 2025 • 36
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Paper • 2512.17008 • Published Dec 18, 2025 • 11
Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 36
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Paper • 2512.19673 • Published Dec 22, 2025 • 64
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Paper • 2512.20605 • Published Dec 23, 2025 • 62
Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Paper • 2512.21625 • Published Dec 25, 2025 • 4
Evaluating Parameter Efficient Methods for RLVR

Paper • 2512.23165 • Published Dec 29, 2025 • 28
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Paper • 2601.02256 • Published Jan 5 • 33
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Paper • 2601.02356 • Published Jan 5 • 14
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Paper • 2601.05242 • Published Jan 8 • 228
AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

Paper • 2601.04767 • Published Jan 8 • 28
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Paper • 2601.06021 • Published Jan 9 • 47
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Paper • 2601.06487 • Published Jan 10 • 53
MAXS: Meta-Adaptive Exploration with LLM Agents

Paper • 2601.09259 • Published Jan 14 • 95
Behavior Knowledge Merge in Reinforced Agentic Models

Paper • 2601.13572 • Published Jan 20 • 25
LongCat-Flash-Thinking-2601 Technical Report

Paper • 2601.16725 • Published Jan 23 • 176
Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Paper • 2601.14243 • Published Jan 20 • 23
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Paper • 2601.20614 • Published Jan 28 • 119
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Paper • 2602.02488 • Published 26 days ago • 32
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Paper • 2601.22975 • Published 30 days ago • 107
Unified Personalized Reward Model for Vision Generation

Paper • 2602.02380 • Published 26 days ago • 20
Rethinking the Trust Region in LLM Reinforcement Learning

Paper • 2602.04879 • Published 24 days ago • 35
Self-Hinting Language Models Enhance Reinforcement Learning

Paper • 2602.03143 • Published 26 days ago • 29
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Paper • 2602.03048 • Published 26 days ago • 32
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Paper • 2602.05261 • Published 24 days ago • 49
Reinforcement World Model Learning for LLM-based Agents

Paper • 2602.05842 • Published 23 days ago • 27
Reinforced Attention Learning

Paper • 2602.04884 • Published 24 days ago • 28
Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Paper • 2602.05885 • Published 23 days ago • 28
Steering LLMs via Scalable Interactive Oversight

Paper • 2602.04210 • Published 25 days ago • 18
WorldCompass: Reinforcement Learning for Long-Horizon World Models

Paper • 2602.09022 • Published 19 days ago • 20
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Paper • 2602.06422 • Published 23 days ago • 44
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Paper • 2602.06540 • Published 23 days ago • 21
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Paper • 2602.08847 • Published 19 days ago • 26
Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Paper • 2602.10224 • Published 18 days ago • 19
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Paper • 2602.12036 • Published 17 days ago • 98
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Paper • 2602.11964 • Published 17 days ago • 12
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Paper • 2602.11748 • Published 17 days ago • 30
Experiential Reinforcement Learning

Paper • 2602.13949 • Published 14 days ago • 68
ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Paper • 2602.15112 • Published 12 days ago • 20
PyVision-RL: Forging Open Agentic Vision Models via RL

Paper • 2602.20739 • Published 5 days ago • 28
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Paper • 2602.21534 • Published 4 days ago • 22

Upvote

Collection guide
Browse collections