Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models’ abilities and the performance of ...
FrontierMath, a new benchmark from Epoch AI, challenges advanced AI systems with complex math problems, revealing how far AI still has to go before achieving true human-level reasoning.
An abstract is a summary of a piece of academic writing. The abstract appears in multiple locations, including at the start of a publication, in conference proceedings, and in electronic databases.
A team of Apple researchers has released a paper scrutinising the mathematical reasoning capabilities of large language models (LLMs), suggesting that while these models can exhibit abstract ...
Can artificial intelligence (AI) pass cognitive puzzles designed for human IQ tests? The results were mixed. Researchers from the USC Viterbi School of Engineering Information Sciences Institute ...