On the OSWorld benchmark test, which evaluates a model's ability to use a computer, humans typically score around 70-75%, and Claude scored just 14.9%. But that's nearly double the score of the ...
Since Anthropic released the “Computer Use” feature for Claude in October, there has been a lot of excitement about what AI agents can do when given the power to imitate human interactions. A new ...