Skip to content

Async Architecture

Many scraping and analysis tasks are IO-bound meaning we have to wait a lot for network responses. To achieve high performance, most scrapers and analyzers need to execute their logic asynchronously. Scraipe's synchronous API cannot directly execute asynchronous code, so Scraipe manages this interaction under the hood.

This page provides a deep dive into how Scraipe seamlessly orchestrates synchronous and asynchronous code. If you just want to create async components, check out the guide on custom components.

Async Interfaces

Scraipe provides two primary interfaces for asynchronous operations: IAsyncScraper and IAsyncAnalyzer. These interfaces extend their synchronous counterparts (IScraper and IAnalyzer).

  • IAsyncScraper: Designed for asynchronous scraping. It allows you to implement async_scrape() for non-blocking operations and provides synchronous wrappers (scrape(), scrape_multiple) for compatibility with synchronous workflows.
  • IAsyncAnalyzer: Designed for asynchronous analysis tasks. Similar to IAsyncScraper, it provides async_analyze() for non-blocking analysis and provides synchronous wrappers (analyze(), scrape_multiple).

The synchronous wrappers use Async Orchestration to run the async functions from a synchronous context. As soon as you implement IAsyncScraper.async_scrape(), the class will behave like a normal IScraper without any additional configuration.

Async Orchestration

Scraipe's async orchestration is powered by the AsyncManager and implementations of IAsyncExecutor. These components ensure seamless integration of asynchronous operations within a synchronous API.

IAsyncExecutor

Executors manage the execution of asynchronous tasks. Scraipe provides two implementations of IAsyncExecutor: - DefaultBackgroundExecutor: Runs a single asyncio event loop in a dedicated background thread. This is the default executor used by AsyncManager. - EventLoopPoolExecutor: Manages a pool of asyncio event loops, each running in its own thread. It balances tasks across the pool for improved concurrency.

AsyncManager

The AsyncManager is a static provider for an IAsyncExecutor instance.

  • get_executor(): Get the singleton executor instance. This is a DefaultBackgroundExecutor instance by default.
  • set_executor(): Allows switching between the singleton executor instances.
  • enable_multithreading(pool_size: int = 3): Enables multithreading by switching the executor to an instance of EventLoopPoolExecutor. It creates a pool of the given size.
  • disable_multithreading(): Disables multithreading by switching the executor to an instance of DefaultBackgroundExecutor.