news_scraper
NewsScraper
Bases: IAsyncScraper
A scraper that uses aiohttp and trafilatura to extract article content.
Retrieves HTML content from a given URL and extracts the main article text using the trafilatura library. Handles HTTP errors by raising exceptions or returning a failed ScrapeResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
dict
|
A dictionary of HTTP headers to use for requests. Defaults to a User-Agent header. |
None
|
get_site_html
async
Retrieve HTML content from the specified URL using aiohttp.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL of the webpage to scrape. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The HTML content of the webpage. |
Raises:
| Type | Description |
|---|---|
ClientResponseError
|
If the HTTP response status is not 200. |
async_scrape
async
Asynchronously scrape the specified URL and extract its content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to scrape and extract content from. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ScrapeResult |
ScrapeResult
|
A ScrapeResult object containing the extracted text if successful, or an error message if the scraping fails. |