MultiScraper Guide
MultiScraper handles links by delegating scrape requests to appropriate scrapers based on customizable ingress rules. It's designed for flexible and fault-tolerant scraping of links that could be different types of documents.
This page explains how to tailor MultiScraper for specific news and non-news links.
How It Works
- Ingress Rules: Each
IngressRuleconsists of a regex pattern and an associated scraper. If a link matches the pattern, the corresponding scraper is used. - Rule Processing: The rules are processed in order. If a rule succeeds, its result is immediately returned. If the rule skips or fails execution, subsequent rules have the opportunity to match and scrape the links.
- Error Preservation: When enabled, errors from failed scraping attempts are preserved for debugging.
Usage
-
Define Ingress Rules
Create rules by specifying a regex pattern and an instance of a class that implementsIScraper. These rules will be evaluated in order until one returns a successfulScrapeResult.2. Creating the MultiScraperfrom scraipe.defaults import IngressRule, MultiScraper, TextScraper from scraipe.extended import NewsScraper # Define ingress rule for news links and TextScraper fallback ingress_rules = [ # Use NewsScraper for links containing "news", "article", or "story" IngressRule( r"(news|article|story)", scraper=NewsScraper() ), # Fallback to TextScraper IngressRule( r".*", scraper=TextScraper() ), ]Pass a list of ingress rules and a fallback scraper when creating an instance of
MultiScraper. Note that settingdebug=Truewill save information about processed ingress rules in the result's error field.3. Scrape# Instantiate a new MultiScraper with our custom ingress rules multi_scraper = MultiScraper(ingress_rules=ingress_rules, debug=True)
Use thescrapemethod to test yourMultiScraperon a news link an a non-news link.# Scrape a news link link = "https://apnews.com/article/studio-ghibli-chatgpt-images-hayao-miyazaki-openai-0f4cb487ec3042dd5b43ad47879b91f4" result = multi_scraper.scrape(link) print ("=== News Link ===") print("Content:", result.content[0:400]) print("Debug Info:", result.error) # Ingress debug chain stored in error # Scrape a non-news link link = "https://www.example.com/" result = multi_scraper.scrape(link) print ("\n=== Non-News Link ===") print("Content:",result.content[0:400]) print("Debug Info:", result.error) # Ingress debug chain stored in error
Running this script will output the content and debug info from processing the different links. Notice how the news link used NewsScraper while the non-news link used TextScraper, as expected based on our ingress rules.
=== News Link ===
Content: ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns
LOS ANGELES (AP) — Fans of Studio Ghibli, the famed Japanese animation studio behind “Spirited Away” and other beloved movies, were delighted this week when a new version of ChatGPT let them transform popular internet memes or personal photos into the distinct style of Ghibli founder Hayao Miyazaki.
But the trend also highl
Debug <class 'scraipe.extended.news_scraper.NewsScraper'>[SUCCESS]
=== Non-News Link ===
Content: Example Domain
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
Debug: <class 'scraipe.defaults.text_scraper.TextScraper'>[SUCCESS]
Tips
- Test each custom scraper in isolation before integrating it with
MultiScraper. - Enable error preservation during development to capture and debug failure chains.
- Leverage built-in scrapers provided by Scraipe.
- Extend
IScraperorIAsyncScraperto implement your custom scraping logic. - Plug your
MultiScraperinto a workflow for bulk scraping and analysis.