Skip to content

news_scraper

NewsScraper

NewsScraper(headers=None)

Bases: IAsyncScraper

A scraper that uses aiohttp and trafilatura to extract article content.

Retrieves HTML content from a given URL and extracts the main article text using the trafilatura library. Handles HTTP errors by raising exceptions or returning a failed ScrapeResult.

Parameters:

Name Type Description Default
headers dict

A dictionary of HTTP headers to use for requests. Defaults to a User-Agent header.

None

get_site_html async

get_site_html(url: str) -> str

Retrieve HTML content from the specified URL using aiohttp.

Parameters:

Name Type Description Default
url str

The URL of the webpage to scrape.

required

Returns:

Name Type Description
str str

The HTML content of the webpage.

Raises:

Type Description
ClientResponseError

If the HTTP response status is not 200.

async_scrape async

async_scrape(url: str) -> ScrapeResult

Asynchronously scrape the specified URL and extract its content.

Parameters:

Name Type Description Default
url str

The URL to scrape and extract content from.

required

Returns:

Name Type Description
ScrapeResult ScrapeResult

A ScrapeResult object containing the extracted text if successful, or an error message if the scraping fails.