Discover ANY AI to make more online for less.

select between over 22,900 AI Tool and 17,900 AI News Posts.


venturebeat
How Google’s 'internal RL' could unlock long-horizon AI agents

Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.The limits of next-token predictionReinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, "on the order of one in a million," according to the researchers.The issue isn't just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal."We argue that when facing a problem with some abstract structure... [goal-oriented exploration] is what you want," Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn't "get lost in one of the reasoning steps" and fail to complete the broader workflow.To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens. However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently "converging to degenerate options" that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning. Steering the LLM's internal thoughtsTo overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already "know" how to perform complex, multi-step tasks internally, even if they aren't explicitly trained to do so. Because these complex behaviors are hidden inside the model's residual stream (i.e., the numerical values that carry information through the network's layers), the researchers introduced an "internal neural network controller," or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model's internal activations in the middle layers.This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining. The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need "low temperature" (predictability) to get the syntax right, but "high temperature" (creativity) to solve the logic puzzle."Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model," Schimpf said. The agent explores the solution without breaking the syntax.The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model's residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously. Internal RL in actionTo evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal "ant" robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.Notably, the researchers found that the "frozen" approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.As the industry currently fixates on reasoning models that output verbose "chains of thought" to solve problems, Google’s research points toward a different, perhaps more efficient future."Our study joins a growing body of work suggesting that 'internal reasoning' is not only feasible but potentially more efficient than token-based approaches," Schimpf said. "Moreover, these silent 'thoughts' can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI."If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

Sony’s latest Horizon spin-off is an MMORPG for PC and mobile, but not PS5
Sony’s latest Horizon spin-off is an MMORPG for PC and mobile, but not PS

<div><div style="left:0;width:100%;height:0;position:relative;padding-bottom:56.25%;"><iframe src="https://www.youtube.com/embed/bN-NNgWIyIQ?rel=0" style="top:0 [...]

Match Score: 93.24

venturebeat
Upwork study shows AI agents excel with human partners but fail independent

<p>Artificial intelligence agents powered by the world&#x27;s most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to < [...]

Match Score: 84.98

Sony sues Tencent over its Horizon Zero Dawn clone
Sony sues Tencent over its Horizon Zero Dawn clone

<p>Sony is suing Tencent for copying nearly every aspect of its Horizon games for the upcoming <em>Light of Motiram</em>, an open-world hunting game with some obvious similarities to [...]

Match Score: 75.09

venturebeat
Amazon's new AI can code for days without human help. What does that m

<p><a href="https://aws.amazon.com/"><u>Amazon Web Services</u></a> on Tuesday announced a new class of artificial intelligence systems called &quot;<a h [...]

Match Score: 74.45

venturebeat
The Google Search of AI agents? Fetch launches ASI:One and Business tier fo

<p><a href="https://fetch.ai/">Fetch AI</a>, a startup founded and led by former DeepMind founding investor, Humayun Sheikh, <a href="https://www.businesswire.com/n [...]

Match Score: 74.18

Forza Horizon 5 is on the PS5, so I no longer need an Xbox
Forza Horizon 5 is on the PS5, so I no longer need an Xbox

<p><em>Forza Horizon 5</em> is the entire reason I have an Xbox Series S. I’m not really a car guy in real life — if money, practicality and burning through fossil fuels were les [...]

Match Score: 70.27

venturebeat
EAGLET boosts AI agent performance on longer-horizon tasks by generating cu

<p>2025 was supposed to be<a href="https://www.barrons.com/articles/nvidia-stock-ceo-ai-agents-8c20ddfb?gaa_at=eafs&amp;gaa_n=ASWzDAjLKLIimw5qFdsG0kmEnu-fOoNZXVCdnBx-zn_CbT1hLgiWcYGx [...]

Match Score: 67.28

Forza Horizon 6 will hit Xbox Series X/S and PC on May 19
Forza Horizon 6 will hit Xbox Series X/S and PC on May 19

<div><div style="left:0;width:100%;height:0;position:relative;padding-bottom:56.25%;"><iframe src="https://www.youtube.com/embed/pWw-UENvdTw?rel=0" style="top:0 [...]

Match Score: 64.36

venturebeat
Microsoft remakes Windows for an era of autonomous AI agents

<p><a href="https://www.microsoft.com/en-us/"><u>Microsoft</u></a> is fundamentally restructuring its Windows operating system to become what executives call th [...]

Match Score: 63.26