Inference API Reference

LLMSQL Transformers Inference Function

This module provides a single function inference_transformers() that performs text-to-SQL generation using large language models via the Transformers backend.

Example

from llmsql.inference import inference_transformers

results = inference_transformers(
    model_or_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
    output_file="outputs/preds_transformers.jsonl",
    questions_path="data/questions.jsonl",
    tables_path="data/tables.jsonl",
    num_fewshots=5,
    batch_size=8,
    max_new_tokens=256,
    temperature=0.7,
    model_kwargs={
        "torch_dtype": "bfloat16",
    },
    generation_kwargs={
        "do_sample": False,
    },
)

Notes

This function uses the HuggingFace Transformers backend and may produce slightly different outputs than the vLLM backend even with the same inputs due to differences in implementation and numerical precision.

llmsql.inference.inference_transformers.inference_transformers(model_or_model_name_or_path: str | AutoModelForCausalLM, tokenizer_or_name: str | Any | None = None, *, trust_remote_code: bool = True, dtype: dtype = torch.float16, device_map: str | dict[str, int] | None = 'auto', hf_token: str | None = None, model_kwargs: dict[str, Any] | None = None, tokenizer_kwargs: dict[str, Any] | None = None, chat_template: str | None = None, max_new_tokens: int = 256, temperature: float = 0.0, do_sample: bool = False, top_p: float = 1.0, top_k: int = 50, generation_kwargs: dict[str, Any] | None = None, output_file: str = 'llm_sql_predictions.jsonl', questions_path: str | None = None, tables_path: str | None = None, workdir_path: str = 'llmsql_workdir', num_fewshots: int = 5, batch_size: int = 8, seed: int = 42) list[dict[str, str]][source]

Inference a causal model (Transformers) on the LLMSQL benchmark.

Parameters:
  • model_or_model_name_or_path – Model object or HF model name/path.

  • tokenizer_or_name – Tokenizer object or HF tokenizer name/path.

  • Loading (# Tokenizer)

  • trust_remote_code – Whether to trust remote code (default: True).

  • dtype – Torch dtype for model (default: float16).

  • device_map – Device placement strategy (default: “auto”).

  • hf_token – Hugging Face authentication token.

  • model_kwargs – Additional arguments for AutoModelForCausalLM.from_pretrained(). Note: ‘dtype’, ‘device_map’, ‘trust_remote_code’, ‘token’ are handled separately and will override values here.

  • Loading

  • tokenizer_kwargs – Additional arguments for AutoTokenizer.from_pretrained(). ‘padding_side’ defaults to “left”. Note: ‘trust_remote_code’, ‘token’ are handled separately and will override values here.

  • Chat (# Prompt &)

  • chat_template – Optional chat template to apply before tokenization.

  • Generation (#)

  • max_new_tokens – Maximum tokens to generate per sequence.

  • temperature – Sampling temperature (0.0 = greedy).

  • do_sample – Whether to use sampling vs greedy decoding.

  • top_p – Nucleus sampling parameter.

  • top_k – Top-k sampling parameter.

  • generation_kwargs – Additional arguments for model.generate(). Note: ‘max_new_tokens’, ‘temperature’, ‘do_sample’, ‘top_p’, ‘top_k’ are handled separately.

  • Benchmark (#)

  • output_file – Output JSONL file path for completions.

  • questions_path – Path to benchmark questions JSONL.

  • tables_path – Path to benchmark tables JSONL.

  • workdir_path – Working directory path.

  • num_fewshots – Number of few-shot examples (0, 1, or 5).

  • batch_size – Batch size for inference.

  • seed – Random seed for reproducibility.

Returns:

List of generated SQL results with metadata.

LLMSQL vLLM Inference Function

This module provides a single function inference_vllm() that performs text-to-SQL generation using large language models via the vLLM backend.

Example

from llmsql.inference import inference_vllm

results = inference_vllm(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",
    output_file="outputs/predictions.jsonl",
    questions_path="data/questions.jsonl",
    tables_path="data/tables.jsonl",
    num_fewshots=5,
    batch_size=8,
    max_new_tokens=256,
    temperature=0.7,
    tensor_parallel_size=1,
)

Notes

This function uses the vLLM backend. Outputs may differ from the Transformers backend due to differences in implementation, batching, and numerical precision.

llmsql.inference.inference_vllm.inference_vllm(model_name: str, *, trust_remote_code: bool = True, tensor_parallel_size: int = 1, hf_token: str | None = None, llm_kwargs: dict[str, Any] | None = None, use_chat_template: bool = True, max_new_tokens: int = 256, temperature: float = 1.0, do_sample: bool = True, sampling_kwargs: dict[str, Any] | None = None, output_file: str = 'llm_sql_predictions.jsonl', questions_path: str | None = None, tables_path: str | None = None, workdir_path: str = 'llmsql_workdir', num_fewshots: int = 5, batch_size: int = 8, seed: int = 42) list[dict[str, str]][source]

Run SQL generation using vLLM.

Parameters:
  • model_name – Hugging Face model name or path.

  • Loading (# Model)

  • trust_remote_code – Whether to trust remote code (default: True).

  • tensor_parallel_size – Number of GPUs for tensor parallelism (default: 1).

  • hf_token – Hugging Face authentication token.

  • llm_kwargs – Additional arguments for vllm.LLM(). Note: ‘model’, ‘tokenizer’, ‘tensor_parallel_size’, ‘trust_remote_code’ are handled separately and will override values here.

  • Generation (#)

  • max_new_tokens – Maximum tokens to generate per sequence.

  • temperature – Sampling temperature (0.0 = greedy).

  • do_sample – Whether to use sampling vs greedy decoding.

  • sampling_kwargs – Additional arguments for vllm.SamplingParams(). Note: ‘temperature’, ‘max_tokens’ are handled separately and will override values here.

  • Benchmark (#)

  • output_file – Path to write outputs (will be overwritten).

  • questions_path – Path to questions.jsonl (auto-downloads if missing).

  • tables_path – Path to tables.jsonl (auto-downloads if missing).

  • workdir_path – Directory to store downloaded data.

  • num_fewshots – Number of few-shot examples (0, 1, or 5).

  • batch_size – Number of questions per generation batch.

  • seed – Random seed for reproducibility.

Returns:

List of dicts containing question_id and generated completion.

💬 Made with ❤️ by the LLMSQL Team