Custom LLM Scrapers
Scraipe supports LLM-powered feature extraction for advanced content parsing with LLM scrapers. This page explains how to integrate a custom LLM into Scraipe by creating your own LLM scraper.
Overview
LlmAnalyzerBase is an abstract base class that provides a common interface for LLM analyzers. The class:
- Orchestrates asynchronous requests to the LLM.
- Abstracts API-specific details by delegating the
query_llm()implementation to subclasses. - Ensures consistent LLM JSON responses with a user-defined Pydantic schema.
Steps to Create a Custom Analyzer
-
Extend LlmAnalyzerBase
Create your analyzer class that inherits fromLlmAnalyzerBase. -
Set up the analyzer in
__init__()
This could involve authenticating with a cloud-based LLM provider or initializing a model from a local LLM package.Always call
super().__init__()in your child's__init__()function to initialize shared properties such asinstruction,pydantic_schema,max_content_size, andmax_workers. This ensures that your analyzer is configured to perform the commonLlmAnalyzerBaselogic. -
Implement query_llm()
Define the asynchronous methodquery_llm(content: str, instruction: str)that:- Prepares the request for your LLM.
- Sends the request using your chosen client library.
- Returns the JSON response as a string.
Examples
GeminiAnalyzer Example
The GeminiAnalyzer and OpenAiAnalyzer implementations showcase:
- Initializing a client from a cloud-based LLM provider.
- Configuring a request from the provided instruction, content, and Pydantic schema.
- Returning the LLM provider's response.