Welcome to LLMSQL Project

LLMSQL is a Python package for SQL reasoning with LLMs and vLLM inference.

💡 Description

LLMSQL Benchmark is an open-source framework providing a modernized, cleaned, and extended version of the original WikiSQL dataset, specifically designed for evaluating and fine-tuning Large Language Models (LLMs) on Text-to-SQL tasks.

Key improvements

📚 Documentation

Note: Documentation pages (installation guide, API reference) are under construction. See Quick Start below.

⚡ Quick Start

⚠️ WARNING — Reproducibility

vLLM and HuggingFace Transformers may produce different results even with the same settings (e.g., temperature=0). This is due to differences in implementation, computation precision, and batching mechanisms.

Recommendation: when comparing model quality, use the same backend (either only vLLM or only Transformers).

Sources:
• vLLM FAQ: FAQ
• Model Support Policy: Supported Models

Installation

pip3 install llmsql

Recommended Workflow (vLLM)

pip install llmsql[vllm]
llmsql evaluate --model gpt-4 --dataset llmsql_dev

Evaluation API (Python)

from llmsql import LLMSQLEvaluator

evaluator = LLMSQLEvaluator(workdir_path="llmsql_workdir")
report = evaluator.evaluate(outputs_path="path_to_your_outputs.jsonl")
print(report)
ResourceDetails
📦 PyPI Projectllmsql on PyPI
💾 Dataset on Hugging Facellmsql-bench dataset
💻 Source CodeGitHub repo

📊 Leaderboard [in progress]

The official Leaderboard is currently empty and in progress. Submit your model results to be the first on the ranking!

📄 Citation

@inproceedings{llmsql_bench,
  title={LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL},
  author={Pihulski, Dzmitry and Charchut, Karol and Novogrodskaia, Viktoria and Koco{'n}, Jan},
  booktitle={2025 IEEE ICDMW},
  year={2025},
  organization={IEEE}
}
💬 Made with ❤️ by the LLMSQL Team