Evaluation API Reference¶
The evaluate() function allows you to benchmark Text-to-SQL model outputs against the LLMSQL gold queries and SQLite database. It prints metrics, logs mismatches, and saves detailed reports automatically.
Features¶
Evaluate model predictions from JSONL files or Python dicts.
Automatically download benchmark questions and SQLite DB if missing.
Prints mismatch summaries and supports configurable reporting.
Saves detailed JSON report with metrics, mismatches, timestamp, and input mode.
Usage Examples¶
Evaluate from a JSONL file:
from llmsql.evaluation.evaluate import evaluate
report = evaluate("path_to_outputs.jsonl")
print(report)
Evaluate from a list of Python dicts:
predictions = [
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"},
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"},
]
report = evaluate(predictions)
print(report)
Providing your own DB and questions (skip workdir):
report = evaluate(
"path_to_outputs.jsonl",
questions_path="bench/questions.jsonl",
db_path="bench/sqlite_tables.db",
workdir_path=None
)
Function Arguments¶
Argument |
Description |
|---|---|
outputs |
Path to JSONL file or a list of prediction dicts (required). |
workdir_path |
Directory for automatic benchmark downloads. Ignored if both questions_path and db_path are provided. Default: “llmsql_workdir”. |
questions_path |
Optional path to benchmark questions JSONL file. |
db_path |
Optional path to SQLite DB with evaluation tables. |
save_report |
Path to save detailed JSON report. Defaults to “evaluation_results_{uuid}.json”. |
show_mismatches |
Print mismatches while evaluating. Default True. |
max_mismatches |
Maximum number of mismatches to display. Default 5. |
Input Format¶
The predictions should be in JSONL format:
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"}
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"}
{"question_id": "3", "predicted_sql": "SELECT * FROM Table WHERE active=1"}
Output Metrics¶
The function returns a dictionary with the following keys:
total – Total queries evaluated
matches – Queries where predicted SQL results match gold results
pred_none – Queries where the model returned NULL or no result
gold_none – Queries where the reference result was NULL or no result
sql_errors – Invalid SQL or execution errors
accuracy – Overall exact match accuracy
mismatches – List of mismatched queries with details
timestamp – Evaluation timestamp
input_mode – How results were provided (“jsonl_path” or “dict_list”)
Report Saving¶
By default, a report is saved automatically as evaluation_results_{uuid}.json in the current directory. It contains metrics, mismatches, timestamp, and input mode. You can override this path using save_report.
—
LLMSQL Evaluation Module¶
Provides the evaluate() function to benchmark Text-to-SQL model outputs on the LLMSQL benchmark.
See the documentation for full usage details.
- llmsql.evaluation.evaluate.evaluate(outputs: str | list[dict[int, str | int]], *, workdir_path: str | None = 'llmsql_workdir', questions_path: str | None = None, db_path: str | None = None, save_report: str | None = None, show_mismatches: bool = True, max_mismatches: int = 5) dict[source]¶
Evaluate predicted SQL queries against the LLMSQL benchmark.
- Parameters:
outputs – Either a JSONL file path or a list of dicts.
workdir_path – Directory for auto-downloads (ignored if all paths provided).
questions_path – Manual path to benchmark questions JSONL.
db_path – Manual path to SQLite benchmark DB.
save_report – Optional manual save path. If None → auto-generated.
show_mismatches – Print mismatches while evaluating.
max_mismatches – Max mismatches to print.
- Returns:
Metrics and mismatches.
- Return type:
dict
—