Skip to content

multi_scraper

IngressRule

IngressRule(pattern: str | Pattern, scraper: IScraper, exclusive: bool = False)

A rule that defines how to handle a specific type of URL.

Attributes:

Name Type Description
match Pattern

A compiled regular expression used to match URLs.

scraper IScraper

An instance of a scraper to be used when the URL matches.

match (str|re.Pattern): The regex pattern to match against URLs.
scraper (IScraper): The scraper to use for this match.
exclusive (bool): If True, this rule is exclusive and no other rules will be processed if it matches.

from_scraper staticmethod

from_scraper(scraper: IScraper, exclusive: bool = False) -> IngressRule

Create an IngressRule from a scraper instance and its expected link format.

Parameters:

Name Type Description Default
scraper IScraper

The scraper to use for this rule.

required
exclusive bool

If True, this rule is exclusive and no other rules will be processed if it matches.

False

Returns:

Name Type Description
IngressRule IngressRule

An IngressRule instance with a match that always returns True.

MultiScraper

MultiScraper(ingress_rules: List[IngressRule], debug: bool = False, debug_delimiter: str = '; ')

Bases: IAsyncScraper

A scraper that uses multiple ingress rules to determine how to scrape a link.

Attributes:

Name Type Description
DEFAULT_USER_AGENT str

Default User-Agent used for HTTP requests.

ingress_rules List[IngressRule]

A list of ingress rule instances.

debug bool

Indicates whether debug mode is enabled.

debug_delimiter str

The delimiter used to join debug log messages.

Methods:

Name Description
async_scrape

str) -> ScrapeResult: Asynchronously scrapes the given URL using the first matching ingress rule. Returns a ScrapeResult indicating success or failure.

Parameters:

Name Type Description Default
ingress_rules list[IngressRule]

A list of IngressRule instances. None items are omited.

required
debug bool

Enable debug mode. Defaults to False.

False
debug_delimiter str

Delimiter for joining debug log messages. Defaults to "; ".

'; '

async_scrape async

async_scrape(url: str) -> ScrapeResult

Scrape the given URL using the appropriate scraper based on ingress rules.

Parameters:

Name Type Description Default
url str

The URL to scrape.

required

Returns:

Name Type Description
ScrapeResult ScrapeResult

The result of the scrape.