news_scraper

NewsScraper

NewsScraper(headers=None)

Bases: IAsyncScraper

A scraper that uses aiohttp and trafilatura to extract article content.

Retrieves HTML content from a given URL and extracts the main article text using the trafilatura library. Handles HTTP errors by raising exceptions or returning a failed ScrapeResult.

Parameters:

Name	Type	Description	Default
`headers`	`dict`	A dictionary of HTTP headers to use for requests. Defaults to a User-Agent header.	`None`

get_site_html `async`

get_site_html(url: str) -> str

Retrieve HTML content from the specified URL using aiohttp.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the webpage to scrape.	required

Returns:

Name	Type	Description
`str`	`str`	The HTML content of the webpage.

Raises:

Type	Description
`ClientResponseError`	If the HTTP response status is not 200.

async_scrape `async`

async_scrape(url: str) -> ScrapeResult

Asynchronously scrape the specified URL and extract its content.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to scrape and extract content from.	required

Returns:

Name	Type	Description
`ScrapeResult`	`ScrapeResult`	A ScrapeResult object containing the extracted text if successful, or an error message if the scraping fails.

news_scraper

NewsScraper

get_site_html async

async_scrape async

get_site_html `async`

async_scrape `async`