workflow
Workflow
Workflow(scraper: IScraper, analyzer: IAnalyzer, link_collector: ILinkCollector = None, logger: Logger = None)
Orchestrates scraping and analysis processes.
Attributes:
| Name | Type | Description |
|---|---|---|
scraper |
IScraper
|
The scraper instance. |
analyzer |
IAnalyzer
|
The analyzer instance. |
store |
Dict[str, StoreRecord]
|
Storage for scrape and analysis results keyed by link. |
scrape__generator
Scrape content from the provided links.
Removes duplicate links and filters out those already scraped (unless 'overwrite' is True).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
links
|
Sequence[str]
|
List of URLs to scrape. |
required |
overwrite
|
bool
|
If True, re-scrape links already present in the store. |
required |
scrape
Scrape content from the provided links and return a list of ScrapeResult objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
links
|
Iterable[str]
|
Collection of URLs to scrape. If None, targets links already collected in the store. |
None
|
overwrite
|
bool
|
If True, re-scrape links that have already been successfully scraped. |
False
|
Returns:
| Type | Description |
|---|---|
Sequence[ScrapeResult]
|
Sequence[ScrapeResult]: List of ScrapeResult objects for each link. |
update_scrapes
Update scrapes using different input data types.
This generic dispatcher supports updating scrape results using
- A list of ScrapeResults
- A dictionary mapping links (str) to ScrapeResults
- A pandas DataFrame containing ScrapeResult fields in columns
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If the data type is not supported. |
update_analyses
Update analyses using different input data types.
This generic dispatcher supports updating analysis results using
- A dictionary mapping links (str) to AnalysisResult objects
- A pandas DataFrame containing a 'links' column and AnalysisResult information
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If the data type is not supported. |
analyze__generator
Analyze content for links that have been successfully scraped but not yet analyzed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
overwrite
|
bool
|
(Currently unused) Reserved for future re-analysis capabilities. |
required |
analyze
Analyze content for links that have been successfully scraped but not yet analyzed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
overwrite
|
bool
|
(Currently unused) Reserved for future re-analysis capabilities. |
False
|
Returns:
| Type | Description |
|---|---|
Sequence[AnalysisResult]
|
Sequence[AnalysisResult]: List of AnalysisResult objects for each link. |
update_records
Update records in the store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
Sequence[StoreRecord]
|
List of StoreRecord objects to replace the current store. |
required |
load_store
Load the store from a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing scrape and analysis results. |
required |
flush
|
bool
|
If True, clear the current store before loading. |
True
|
export
Export stored records as a DataFrame with unnested analysis outputs.
If verbose is True, include extra metadata such as success flags and error messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
If True, include additional columns for detailed status. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing exported records. |