English
全部
搜索
图片
视频
地图
资讯
Copilot
更多
购物
航班
旅游
笔记本
Top stories
冬季运动会
Sports
U.S.
Local
World
Science
Technology
Entertainment
Business
More
Politics
过去 24 小时
时间不限
过去 1 小时
过去 7 天
过去 30 天
最佳匹配
最新
19 小时
SWE-AGI基准评测:中大型软件在全新语言上的批量生成成功率已达80%
在这一高难度的“系统构建”场景下,模型表现呈现出了明显的两极分化。GPT-5.3-codex 凭借 86.4% 的通过率(19/22)稳居榜首,Claude Opus 4.6 以 68.2%(15/22)紧随其后。相比之下,其他参评模型(包括开源模型及部分闭源模型)在简单任务上的表现尚可,但一旦进入中高难度领域,成功率便跌至个位数甚至为零。
一些您可能无法访问的结果已被隐去。
显示无法访问的结果
今日热点
To cease use of Anthropic AI
Columbia student released
Overhauls Artemis program
Block plans 40% layoffs
US allows staff to leave ISR
Shoots down CBP drone
Ordered to enter rehab
US citizen killed in shooting
Serial stowaway arrested
To chair UN Security Council
Buc-ee’s sues Ohio chain
DOJ sues five states
Tram derails in Milan
To pull synthetic dye cereals
Endorses Jasmine Crockett
SOTU draws 32.6M viewers
Secures $110B funding
FAA shuts TX airspace
'The Wire' star dies at 62
Tariff refunds to customers?
Rejects Pentagon’s AI demands
US producer prices rise
To alter policies
Introduces bonus payments
Wire grill brushes recalled
Congo, US sign $1.2B deal
Longtime MLB umpire dies
Testifies in Epstein probe
Arrests mount in ICE protest
TX to correct Bible curriculum
Refugee found dead in Buffalo
Dismisses assistant DL coach
Closing hundreds of stores
Pak declares ‘open war’
反馈