Skip to content

MultiScraper Guide

MultiScraper handles links by delegating scrape requests to appropriate scrapers based on customizable ingress rules. It's designed for flexible and fault-tolerant scraping of links that could be different types of documents.

This page explains how to tailor MultiScraper for specific news and non-news links.

How It Works

  • Ingress Rules: Each IngressRule consists of a regex pattern and an associated scraper. If a link matches the pattern, the corresponding scraper is used.
  • Rule Processing: The rules are processed in order. If a rule succeeds, its result is immediately returned. If the rule skips or fails execution, subsequent rules have the opportunity to match and scrape the links.
  • Error Preservation: When enabled, errors from failed scraping attempts are preserved for debugging.

Usage

  1. Define Ingress Rules
    Create rules by specifying a regex pattern and an instance of a class that implements IScraper. These rules will be evaluated in order until one returns a successful ScrapeResult.

    from scraipe.defaults import IngressRule, MultiScraper, TextScraper
    from scraipe.extended import NewsScraper
    
    # Define ingress rule for news links and TextScraper fallback
    ingress_rules = [
        # Use NewsScraper for links containing "news", "article", or "story"
        IngressRule(
            r"(news|article|story)",
            scraper=NewsScraper()
        ),
        # Fallback to TextScraper
        IngressRule(
            r".*",
            scraper=TextScraper()
        ),
    ]
    
    2. Creating the MultiScraper

    Pass a list of ingress rules and a fallback scraper when creating an instance of MultiScraper. Note that setting debug=True will save information about processed ingress rules in the result's error field.

    # Instantiate a new MultiScraper with our custom ingress rules
    multi_scraper = MultiScraper(ingress_rules=ingress_rules, debug=True)
    
    3. Scrape
    Use the scrape method to test your MultiScraper on a news link an a non-news link.

    # Scrape a news link
    link = "https://apnews.com/article/studio-ghibli-chatgpt-images-hayao-miyazaki-openai-0f4cb487ec3042dd5b43ad47879b91f4"
    result = multi_scraper.scrape(link)
    
    print ("=== News Link ===")
    print("Content:", result.content[0:400])
    print("Debug Info:", result.error) # Ingress debug chain stored in error
    
    # Scrape a non-news link
    link = "https://www.example.com/"
    result = multi_scraper.scrape(link)
    
    print ("\n=== Non-News Link ===")
    print("Content:",result.content[0:400])
    print("Debug Info:", result.error) # Ingress debug chain stored in error
    

Running this script will output the content and debug info from processing the different links. Notice how the news link used NewsScraper while the non-news link used TextScraper, as expected based on our ingress rules.

=== News Link ===
Content: ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns
LOS ANGELES (AP)  Fans of Studio Ghibli, the famed Japanese animation studio behind “Spirited Away” and other beloved movies, were delighted this week when a new version of ChatGPT let them transform popular internet memes or personal photos into the distinct style of Ghibli founder Hayao Miyazaki.
But the trend also highl
Debug <class 'scraipe.extended.news_scraper.NewsScraper'>[SUCCESS]

=== Non-News Link ===
Content: Example Domain
Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...
Debug: <class 'scraipe.defaults.text_scraper.TextScraper'>[SUCCESS]

Tips

  • Test each custom scraper in isolation before integrating it with MultiScraper.
  • Enable error preservation during development to capture and debug failure chains.
  • Leverage built-in scrapers provided by Scraipe.
  • Extend IScraper or IAsyncScraper to implement your custom scraping logic.
  • Plug yourMultiScraper into a workflow for bulk scraping and analysis.