Open Source & Benchmarks

Open-source frameworks, tools, and benchmark datasets released by Mingyue Cheng's research group at USTC. Covering agentic AI, time series analysis, table mining, and scientific literature mining.

Open Source Projects

CastClaw(观星阁)
2026.04 GitHub Website Docs
CastClaw(观星阁) is an open-source framework for agentic time series forecasting. It enables LLM agents to autonomously perform data analysis, model selection, feature engineering, and iterative forecasting refinement through structured tool-augmented workflows. CastClaw(观星阁) bridges classical time series methods with agentic reasoning, supporting diverse forecasting scenarios ranging from short-term point prediction to long-horizon multi-step forecasting, with built-in evaluation and interpretability mechanisms.
CastFactory(炼星坊)
2026.05 GitHub
CastFactory(炼星坊) is an open-source framework for LLM-driven time series forecasting model training, designed to make it extremely easy for users to build and adapt TSF models with modern large-model pipelines. It provides a unified and practical workflow for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), helping researchers and practitioners develop, optimize, and evaluate LLM-based forecasting systems with much lower engineering overhead.
FutureCast(天星台)
2026.05 Website
FutureCast(天星台) is a unified evaluation suite for time series models and agentic forecasting systems. It organizes benchmark management, metric computation, error attribution, and report generation into a consistent evaluation workflow, helping researchers compare model capabilities and system-level gains across datasets, scenarios, and task settings including forecasting, classification, and anomaly detection.
NeoResearch(智多星)
2026.05 Website
NeoResearch(智多星) is an autonomous research agent system for time series forecasting model development. It connects research hypothesis generation, candidate recipe search, controlled experiments, evaluation diagnosis, and research memory into a reproducible loop, supporting systematic iteration from problem definition to validated forecasting model candidates.
Academic Search
2026.04 GitHub
Academic Search is an open-source academic literature research skill for Claude Code. It unifies paper discovery across arXiv, Semantic Scholar, Google Scholar, CNKI, and other scholarly sources, while supporting query expansion, citation tracing, BibTeX export, PDF-first retrieval, and multi-source deduplication. With recency-aware ranking and browser-assisted fallback for hard-to-access platforms, it helps researchers surface recent, high-value papers and their code resources more efficiently.
PaperScout
2026.01 GitHub Website Paper
PaperScout is an autonomous academic paper search agent trained with Process-Aware Sequence-Level Policy Optimization (PSPO). It formulates literature search as a multi-turn sequential decision-making problem, allowing agents to dynamically decide when to Search for new directions and when to Expand along citation and related-paper paths. PaperScout turns scholarly retrieval into an agentic RL environment for learning efficient tool-use policies and producing stronger recall under realistic research queries.
TabClaw
2026.03 GitHub Website Platform
TabClaw is an open-source agentic framework that empowers LLMs to reason over complex, real-world tabular data. It decomposes table-centric tasks into structured sub-goals, equips agents with code execution, schema-aware lookup, and formula tools, and coordinates them through multi-step decision workflows. TabClaw is designed to tackle challenges beyond flat QA — including multi-hop joins, conditional aggregation, and cross-table inference — making it suitable for enterprise data analysis and scientific table understanding.
Claw-R1
2026.03 GitHub Website
Claw-R1 is an open-source framework for training reasoning-intensive, tool-using LLM agents via reinforcement learning. It extends standard RL environments with agentic action spaces — including tool invocation, multi-turn interaction, and environment feedback — and supports GRPO-based policy optimization with customizable reward signals. Designed for reproducibility and extensibility, Claw-R1 enables researchers to study how RL shapes emergent reasoning behaviors, tool-use strategies, and decision robustness in large language models.
Agent-R1
2025.04 GitHub Website Docs
Agent-R1 is a large-model agent training framework for end-to-end reinforcement learning fine-tuning. It supports GRPO-based policy optimization with reward signals derived from environment feedback, and equips agents with multi-tool orchestration, persistent long-term memory, and reflective self-correction across interaction turns. Grounded in the DeepSeek-R1 paradigm, Agent-R1 enables researchers to train LLM agents that not only reason under uncertainty but also learn to use tools strategically and recover from failures autonomously.

Benchmarks & Datasets

E-Commerce Search · Recall to Relevance
KuaiSearch Search-based Recommendation Paper
2026.02 Website GitHub
KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance is built from real user search interactions on Kuaishou. It preserves authentic user queries and natural-language product texts, covers cold-start users and long-tail products, and spans the three key stages of modern search systems: recall, ranking, and relevance judgment.
Scientific Literature · Agentic Evaluation
PaperArena Scientific Literature Mining Paper
2025.10 Website GitHub
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature benchmarks agentic systems on scientific reading and reasoning with tool use. It targets literature understanding, tool-augmented reasoning, and evidence-grounded evaluation over scientific papers and related scholarly workflows.
AI for Science · Chemical Tables
ChemTable Scientific Literature Mining Paper
2025.06 GitHub
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables introduces ChemTable, a real-world benchmark curated from chemical literature. It supports two core tasks: table recognition and table understanding, with expert annotations over cell polygons, logical layouts, and chemistry-specific semantic labels such as reagents, catalysts, yields, and graphical components.
RAG Evaluation · Dynamic Benchmark
HoH RAG Paper
2025.06 GitHub
HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on RAG studies how retrieval-augmented generation systems fail when knowledge becomes stale. It provides a dynamic evaluation setting for measuring temporal robustness, outdated-information sensitivity, and the impact of knowledge freshness in modern RAG pipelines.