Discover ANY AI to make more online for less.

select between over 22,900 AI Tool and 17,900 AI News Posts.


venturebeat
Google's 'Watch & Learn' framework cracks the data bottleneck for training computer-use agents

A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer use agents (CUAs): Gathering high-quality training examples at scale.The framework, dubbed Watch & Learn (W&L), addresses the problem of training data generation in a way that doesn’t require human annotation and can automatically extract demonstrations from raw videos.Their experiments show that data generated W&L can be used to train or fine-tune existing computer use and foundation models to improve their performance on computer-use tasks. But equally important, the same approach can be used to create in-context learning (ICL) examples for computer use agents, enabling companies to create CUAs for bespoke internal tasks without the need for costly training of specialized models.The data bottleneck of CUAThe web is rich with video tutorials and screencasts that describe complex workflows for using applications. These videos are a gold mine that can provide computer use agents with domain knowledge and instructions for accomplishing different tasks through user interface interactions.However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (that is, a set of task descriptions, screenshots and actions), a process that is prohibitively expensive and time-consuming when done manually.Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, which usually result in low precision and faulty examples. A different approach uses self-play agents that autonomously explore user interfaces to collect trajectories. However, techniques using this approach usually create simple examples that are not useful in unpredictable real-world situations.As the researchers note in their paper, “Overall, these approaches either rely on brittle heuristics, are costly as they rely on explorations in real environments or generate low-complexity demonstrations misaligned with human intent.”Watch & LearnThe Watch & Learn framework tries to address the challenges of creating CUA demonstrations by rethinking the problem formulation.Instead of directly generating trajectories or depending on complex multi-stage pipelines, the researchers frame the problem as an “inverse dynamics objective”: Given two consecutive observations, predict the intermediate action that produced the transition.According to the researchers, this formulation is “easier to learn, avoids hand-crafted heuristics and generalizes robustly across applications.”The W&L framework can be broken down into three key stages: Training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.In the first phase, the researchers used agents to interact with live web pages to create a large corpus of 500,000 state transitions (two consecutive observations and the action that resulted in the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes in two consecutive observations and predicts the transition action. Their trained IDM, which is a small transformer model, outperformed off-the-shelf foundation models in predicting transition actions.The researchers then designed a pipeline that retrieves videos from platforms such as YouTube and runs them through IDM to generate high-quality trajectories. The IDM takes in consecutive video frames and determines the actions (scroll, click) that caused the changes in the environment, which are then packaged into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.These examples can be used to train effective computer use models for specific tasks. But the researchers also found that trajectories extracted through IDM can serve as in-context learning examples to improve the performance of CUAs on bespoke tasks at inference time. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to the observation/action examples in the trajectories, which can then be inserted into the CUA agent’s prompt (usually 3-5 examples) during inference.“This dual role (training and in-context guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.W&L in actionTo test the usefulness of W&L, the researchers ran a series of experiments with closed and open source models on the OSWorld benchmark, which evaluates agents in real desktop and operating system environments across different tasks, including productivity, programming and design.For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a strong, open source vision-language-action model designed specifically for computer use, and Qwen 2.5-VL, an open-weight multimodal LLM. For in-context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3 and Claude Sonnet 4. W&L resulted in improvements on OSWorld in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for advancing CUAs towards real-world deployment,” the researchers write.This could have important implications for real-world applications, enabling enterprises to turn their existing corpora of videos and conference recordings into training data for CUAs. It also makes it easier to generate new training trajectories. All you will need to do is record videos of performing different tasks and have them annotated by an IDM. And with frontier models constantly improving and becoming cheaper, you can expect to get more from your existing data and the field continues to progress.

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

Samsung Galaxy Watch 7 review: Incrementally better than its predecessor...again
Samsung Galaxy Watch 7 review: Incrementally better than its predecessor...

<p>We called 2023’s Galaxy Watch 6 a “<a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/samsungs-galaxy-watch-6-review-evolution-not-revolution-120006375.html" [...]

Match Score: 181.10

Framework Desktop (2025) Review: Powerful, but perhaps not for everyone
Framework Desktop (2025) Review: Powerful, but perhaps not for everyone

<p>The most obvious question is “<em>Why?</em>”</p> <p>Framework builds <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/computing/laptop [...]

Match Score: 153.10

Pixel Watch 4 review: A well-rounded smartwatch with a surprising advantage
Pixel Watch 4 review: A well-rounded smartwatch with a surprising advantage

<p>Everyone uses smartwatches differently. There are the people who wear them all day, those who only wear them when they’re outside, those who only use them while working out and even those w [...]

Match Score: 138.82

Framework Laptop 12 review: Doing the right thing comes at a cost
Framework Laptop 12 review: Doing the right thing comes at a cost

<p>Earlier this year, Framework announced it was making a <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/computing/laptops/framework-teases-a-low-cost-2-in-1-conver [...]

Match Score: 138.15

Summer Game Fest 2025 schedule, announcements, new games and everything else to expect
Summer Game Fest 2025 schedule, announcements, new games and everything els

<p>As if early June wasn&#39;t already going to be a wild enough time in the gaming world with the arrival of the <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com [...]

Match Score: 120.03

venturebeat
We keep talking about AI agents, but do we ever know what they are?

<p>Imagine you do two things on a Monday morning.</p><p>First, you ask a chatbot to summarize your new emails. Next, you ask an AI tool to figure out why your top competitor grew so [...]

Match Score: 114.21

venturebeat
Microsoft retires AutoGen and debuts Agent Framework to unify and govern en

<p><a href="https://www.microsoft.com/"><u>Microsoft</u></a>’s multi-agent framework, AutoGen, acts as the backbone for many enterprise projects, particularly [...]

Match Score: 110.37

Framework Laptop 13 (2025) with AMD Ryzen AI 300 review: The usual iterative upgrade
Framework Laptop 13 (2025) with AMD Ryzen AI 300 review: The usual iterativ

<p>You might know the story by now: Framework makes repairable, modular <a href="https://www.engadget.com/computing/laptops/best-laptops-120008636.html" data-autolinker-wiki-id=&quo [...]

Match Score: 109.80

venturebeat
Google's AI can now surf the web for you, click on buttons, and fill out fo

<p>Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots — extending their models out into &quot;agents&quot; that can actually t [...]

Match Score: 109.62