Skip to content

API Reference

The LangStruct class is the primary entry point for building extractors, parsing natural language queries, and exporting results. All examples below assume:

from langstruct import LangStruct
ls = LangStruct(example={
"company": "Apple",
"revenue": 125.3,
"quarter": "Q3 2024",
})
LangStruct(
schema: Optional[Type[Schema]] = None,
model: Optional[Union[str, dspy.LM]] = None,
optimizer: str = "miprov2",
chunking_config: Optional[ChunkingConfig] = None,
use_sources: bool = True,
example: Optional[Dict[str, Any]] = None,
examples: Optional[List[Dict[str, Any]]] = None,
schema_name: str = "GeneratedSchema",
descriptions: Optional[Dict[str, str]] = None,
refine: Union[bool, Refine, Dict[str, Any]] = False,
**llm_kwargs,
)
  • Provide a Pydantic schema or pass one/many example dicts for automatic schema generation.
  • Pass model="gpt-4o-mini", model=dspy.LM(...), or omit to auto-detect from configured API keys.
  • Set refine=True or pass a Refine config to boost accuracy with additional model calls.
result = ls.extract(
text_or_texts,
confidence_threshold: float = 0.0,
validate: bool = True,
debug: bool = False,
return_sources: Optional[bool] = None,
max_workers: Optional[int] = None,
show_progress: bool = False,
rate_limit: Optional[int] = None,
retry_failed: bool = True,
refine: Union[bool, Refine, Dict[str, Any], None] = None,
**kwargs,
)
  • Accepts either a single string or a list of strings. Lists automatically parallelize.
  • validate=True runs LangStruct’s validator; combine with debug=True to surface suggestions.
  • Override return_sources to force-enable/disable character-level grounding per call.
  • Use refine=True or a custom dict (e.g., { "strategy": "bon", "n_candidates": 5 }).
  • Returns an ExtractionResult for single inputs or List[ExtractionResult] for batches.

Explicit parallel API when you want access to failure details:

results = ls.extract_batch(
texts,
max_workers: int = 10,
show_progress: bool = True,
rate_limit: Optional[int] = None,
return_failures: bool = False,
)
  • Set return_failures=True to receive a ProcessingResult with successful/failed collections.
  • Honors the same validation/refinement arguments as extract.
parsed = ls.query(
query_or_queries,
explain: bool = True,
max_workers: Optional[int] = None,
show_progress: bool = False,
rate_limit: Optional[int] = None,
retry_failed: bool = True,
)
  • Converts natural language RAG queries into ParsedQuery objects containing semantic_terms, structured_filters, confidence, and optional explanations.
  • Accepts strings or lists; lists parallelize just like extract.

Same spirit as extract_batch, returning either parsed queries or a ProcessingResult when return_failures=True.

results = ls.query_batch(
queries,
max_workers: int = 10,
show_progress: bool = True,
rate_limit: Optional[int] = None,
return_failures: bool = False,
)
ls.optimize(
texts: List[str],
expected_results: Optional[List[Dict]] = None,
validation_split: float = 0.2,
)
  • Initializes a DSPy optimizer (MIPROv2 by default, or GEPA if optimizer="gepa").
  • Provide expected_results for supervised optimization; otherwise LangStruct uses metric-free improvements.
  • Returns the same LangStruct instance for chaining.
scores = ls.evaluate(
texts: List[str],
expected_results: List[Dict],
metrics: Optional[List[str]] = None,
)
  • Computes accuracy/F1 by default, with optional precision and recall.
  • Uses the current extractor pipeline, so run optimize beforehand if you want tuned prompts.
ls.export_batch(
results: List[ExtractionResult],
file_path: str,
format: str = "csv",
include_metadata: bool = False,
include_sources: bool = False,
**kwargs,
)
  • Supports csv, json, excel, and parquet outputs.
  • Set include_sources=True to embed grounding spans beside values.

save_annotated_documents / load_annotated_documents

Section titled “save_annotated_documents / load_annotated_documents”
  • Persist extractions to JSONL with save_annotated_documents(results, "extractions.jsonl").
  • Rehydrate them via load_annotated_documents(path) to continue processing or visualize later.
html = ls.visualize(results_or_jsonl, file_path: Optional[str] = None, **kwargs)
  • Generates the interactive HTML viewer used across LangStruct demos.
  • Pass a file path to save the visualization, or omit to receive the HTML string.
  • ls.save("./my_extractor") writes schema, pipeline state, optimizer config, and refinement options to disk.
  • LangStruct.load(path) reconstructs the extractor (API keys must still be set in the environment).
  • Access ls.schema_info to retrieve field descriptions, JSON Schema, and an example payload structure. Useful for generating docs or debugging auto-generated schemas.
  • ls.save_annotated_documents, ls.load_annotated_documents, and ls.visualize share formats with LangExtract—handy for migrating annotated corpora.
  • ProcessingResult objects (returned when return_failures=True) expose successful, failed, success_rate, and helper methods like raise_if_failed().