Discover ANY AI to make more online for less.

select between over 22,900 AI Tool and 17,900 AI News Posts.


venturebeat
Alibaba's model never trained as an agent — and improved agent performance across seven benchmarks

Alibaba's Qwen team released Qwen-AgentWorld on Tuesday — two models trained not to act inside agent environments, but to predict what those environments return. The release covers seven domains under a single architecture: MCP, Search, Terminal, Software Engineering, Android, Web, and OS. The release extends Alibaba's recent push into autonomous agents. Qwen3.7-Max, released in May, was built around a 35-hour autonomous execution capability. That shift targets a ceiling teams training agents at scale run into directly. Real search engines surface whatever results exist, with no mechanism to inject controlled conditions. Live terminals do not allow injecting a low-disk-space condition on demand. Agent training is bounded by what production environments will surface, with no systematic way to expose the edge cases agents will need to handle but rarely encounter in training.The research team trained agents inside the resulting simulator and found performance gains that exceeded what training against real environments alone produced. In a separate test, using world model training as a warm-up before agentic fine-tuning improved performance across seven benchmarks, including three the model had never seen during training.The paper accompanying the release identified a gap in prior agent research. "We argue that world modeling is a crucial missing piece in the path to general agents."Qwen-AgentWorld trains on what environments return, not what agents should doMost agent models are trained to answer one question: given what the environment just showed me, what should I do next? Qwen-AgentWorld is trained to answer the inverse: given what the agent just did, what will the environment show next?That reversal is the core of what the paper calls a language world model: instead of optimizing for action selection, the model learns to predict the next environment state across all seven domains under a single training objective. Prior work was narrower: WebWorld, an earlier Qwen project from February, covered web environments only; Snowflake's Agent World Model, published the same month, generates code-driven SQL-backed environments rather than training a model to predict states. Qwen-AgentWorld is the first to span seven domains in a single model, with environment modeling baked in from the earliest pretraining stage.Alibaba trained both models in three stages on more than 10 million environment interaction trajectories from real agent runs. Stage one teaches the model how environments behave — file systems, terminal states, browser DOM changes, API responses. Stage two trains the model to reason through what comes next before predicting it. Stage three, reinforcement learning, tightens predictions using rule-based checks and open-ended quality scoring.Both models are Mixture-of-Experts designs — only a fraction of parameters are active per token. The 35B model activates 3B; the 397B activates 17B. Both support 256K context windows. For GUI domains (Android, Web, and OS), the models work from textual accessibility trees and UI view hierarchies rather than screenshots.The 35B model weights and AgentWorldBench are available under Apache 2.0; the 397B weights are not publicly released.The training results matter more than the benchmarksThe benchmark scores show how accurately the models predict what environments return. The training results show what that prediction capability is actually worth for teams building agents — and those are the numbers that matter more.According to the researchers, agents trained inside controlled simulation outperformed agents trained in real environments. Injecting targeted perturbations — partial responses that force extra agent steps, and edge cases real environments rarely surface — pushed MCPMark from 24.6 to 33.8. On Search, agents trained in entirely fictional worlds transferred to real search tasks, pushing WideSearch F1 Item from 34.02 to 50.31 on the open 35B model. A separate warm-up test showed that world model pretraining improved BFCL v4 from 62.29 to 71.25 and Claw-Eval from 53.60 to 64.88 with no agent-specific fine-tuning.Researchers flag the benchmark and the overfitting riskThe paper drew immediate reaction from AI researchers on X. The concerns they raised map to what practitioners need to verify before acting on the findings.On the training objective and transfer result, the assessment from one AI/ML researcher was direct. "Every other 'agent' model has been trained to act in environments," wrote @drawais_ai, who has a PhD background and regularly breaks down AI papers. "Qwen flipped the question. They trained the model to predict the environment itself... That predictive knowledge then transfers to agent tasks even without any agent-specific fine-tuning." He identified the Controllable Sim RL result as "the receipt" for the claim that synthetic training can substitute for real-environment RL at scale, and flagged that three of the seven transfer benchmarks were entirely out of domain.The benchmark margin drew immediate scrutiny. "AgentWorldBench is a benchmark Alibaba built and published in the same paper," wrote @TheSignal_Desk, who focuses on honest takes and key numbers in AI research. "They wrote the test, then topped it by 0.46."The sim-RL methodology is the result @limalemonnn, who builds production AI agents, identified as most in need of scrutiny before the headline claim gets quoted. "Sim-trained agents traditionally overfit to the simulator's quirks," they wrote. "If the world model is too clean, the agent learns the model, not the task." They pointed to the paper's holdout split as the section practitioners should read before acting on the numbers.The overfitting concern has a partial answer in the data. The gap between uncontrolled Sim RL (MCPMark 24.6) and controlled Sim RL (MCPMark 33.8) suggests the gains depend substantially on the controllability mechanism, not simulation accuracy alone. The fictional-world Search result, where agents trained on invented environments transfer to real search tasks, is the paper's strongest evidence against the overfitting concern.What this means for teams building agentic pipelinesFor AI engineering teams building and scaling agentic pipelines, this work signals a meaningful shift in how agent capability gets built. Teams training agents at scale now have a third option between real-environment RL and static benchmarks: controlled simulation that injects the edge cases production won't surface.Synthetic environments are a legitimate training layer. Controlled simulation that injects conditions real environments won't produce is a complement to real-environment RL, not a shortcut around it.What a model learns before agent training starts matters more than most pipelines account for. The warm-up finding — performance gains across unseen benchmarks with no agent-specific training — suggests environment grounding belongs earlier in development than current practice.

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
Alibaba's AI video model rises to No. 2 in global rankings, as OpenAI&

<p><a href="https://www.alibabacloud.com/en?_p_lc=1">Alibaba Cloud</a> on Sunday released <a href="https://www.happyhorse.com/">HappyHorse 1.1</a>, a [...]

Match Score: 186.29

venturebeat
Alibaba's Qwen3.7-Plus supports text, video and imagery inputs at low

<p>Alibaba <a href="https://x.com/Alibaba_Qwen/status/2061506641120641494?s=20">this week released Qwen3.7-Plus</a>, the latest AI large language model (LLM) in its globall [...]

Match Score: 113.36

venturebeat
Most enterprises can't stop stage-three AI agent threats, VentureBeat

<p>A rogue AI agent at Meta <a href="https://venturebeat.com/security/meta-rogue-ai-agent-confused-deputy-iam-identity-governance-matrix">passed every identity check and still ex [...]

Match Score: 112.15

venturebeat
Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously an

<p>The AI industry has fully entered the &quot;agent era,&quot; a paradigm where AI models do far more than generate text — they now actively plan, execute, and course-correct complex [...]

Match Score: 108.92

venturebeat
Alibaba's Qwen 3.5 397B-A17 beats its larger trillion-parameter model

<p><a href="https://x.com/Alibaba_Qwen/status/2023331062433153103">Alibaba dropped Qwen3.5</a> earlier this week, timed to coincide with the Lunar New Year, and the headlin [...]

Match Score: 91.88

venturebeat
Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks

<p>On Sunday, a team of nine researchers at <a href="https://weibo.com/">Sina Weibo</a> — the Chinese social media giant better known for its microblogging platform than [...]

Match Score: 90.79

venturebeat
Claude’s next enterprise battle is not models: it’s the agent control p

<p><i>New VB Pulse data shows Microsoft and OpenAI leading enterprise agent orchestration, but Anthropic’s first measurable foothold points to a larger fight over who controls the infras [...]

Match Score: 86.03

venturebeat
Phi-4 proves that a 'data-first' SFT methodology is the new diffe

<p>AI engineers often chase performance by scaling up LLM parameters and data, but the trend toward smaller, more efficient, and better-focused models has accelerated. </p><p>The &l [...]

Match Score: 85.75

blogspot
How I Get Free Traffic from ChatGPT in 2025 (AIO vs SEO)

<p style="text-align: left;">Three weeks ago, I tested something that completely changed how I think about organic traffic. I opened ChatGPT and asked a simple question: "What [...]

Match Score: 85.02