Evaluation datasets

A total of 20 mathematical evaluation datasets, widely used in dozens of top artificial intelligence conferences such as ACL, AAAI, and ICLR since 2010 till now, have been collected. The collected evaluation datasets cover to a certain extent different grades, question types, text forms, and difficulty levels of mathematical problems, which facilitates the provision of more comprehensive and fine-grained mathematical ability evaluation results for LLMs participating in the evaluation.