当单一模型遇到瓶颈,让多个模型协同工作会成为下一个突破口吗?日本AI公司Sakana AI给出了他们的答案。
| 基准测试 | Fugu Ultra | GPT-5.5 | Gemini 3.1 Pro | Opus 4.8 |
| SWE Bench Pro | 73.7 | 58.6 | 54.2 | 69.2 |
| LiveCodeBench | 93.2 | 85.3 | 88.5 | 87.8 |
| Humanity's Last Exam | 50.0 | 41.4 | 44.4 | 49.8 |
| GPQA-D | 95.5 | 93.6 | 94.3 | 92.0 |
| MRCRv2 | 93.6 | 94.8 | 84.9 | 87.9 |
| 欢迎光临 闲社 (https://www.xianshe.com/) | Powered by Discuz! X5.0 |