Skip to content

workflow

Workflow

Workflow(scraper: IScraper, analyzer: IAnalyzer, link_collector: ILinkCollector = None, logger: Logger = None)

Orchestrates scraping and analysis processes.

Attributes:

Name Type Description
scraper IScraper

The scraper instance.

analyzer IAnalyzer

The analyzer instance.

store Dict[str, StoreRecord]

Storage for scrape and analysis results keyed by link.

StoreRecord

StoreRecord(link: str)

Stores the scrape and analysis results for a specific link.

scrape__generator

scrape__generator(links: Sequence[str]) -> Generator[ScrapeResult, None, None]

Scrape content from the provided links.

Removes duplicate links and filters out those already scraped (unless 'overwrite' is True).

Parameters:

Name Type Description Default
links Sequence[str]

List of URLs to scrape.

required
overwrite bool

If True, re-scrape links already present in the store.

required
collect_links()

Collect links using the targeter and store them in the store.

get_links() -> pd.DataFrame

Return a DataFrame copy of the stored links.

scrape

scrape(links: Iterable[str] = None, overwrite: bool = False) -> Sequence[ScrapeResult]

Scrape content from the provided links and return a list of ScrapeResult objects.

Parameters:

Name Type Description Default
links Iterable[str]

Collection of URLs to scrape. If None, targets links already collected in the store.

None
overwrite bool

If True, re-scrape links that have already been successfully scraped.

False

Returns:

Type Description
Sequence[ScrapeResult]

Sequence[ScrapeResult]: List of ScrapeResult objects for each link.

get_scrapes

get_scrapes() -> pd.DataFrame

Return a DataFrame copy of the stored scrape results.

update_scrapes

update_scrapes(data)

Update scrapes using different input data types.

This generic dispatcher supports updating scrape results using
  • A list of ScrapeResults
  • A dictionary mapping links (str) to ScrapeResults
  • A pandas DataFrame containing ScrapeResult fields in columns

Raises:

Type Description
NotImplementedError

If the data type is not supported.

update_analyses

update_analyses(data, output_cols: Sequence[str] = None)

Update analyses using different input data types.

This generic dispatcher supports updating analysis results using
  • A dictionary mapping links (str) to AnalysisResult objects
  • A pandas DataFrame containing a 'links' column and AnalysisResult information

Raises:

Type Description
NotImplementedError

If the data type is not supported.

clear_scrapes

clear_scrapes()

Clear all ScrapeResults from the store.

clear_analyses

clear_analyses()

Clear all AnalysisResults from the store.

clear_store

clear_store()

Erase all records of scraped and analyzed content.

analyze__generator

analyze__generator(links: Sequence[str]) -> Generator[AnalysisResult, None, None]

Analyze content for links that have been successfully scraped but not yet analyzed.

Parameters:

Name Type Description Default
overwrite bool

(Currently unused) Reserved for future re-analysis capabilities.

required

analyze

analyze(overwrite: bool = False) -> Sequence[AnalysisResult]

Analyze content for links that have been successfully scraped but not yet analyzed.

Parameters:

Name Type Description Default
overwrite bool

(Currently unused) Reserved for future re-analysis capabilities.

False

Returns:

Type Description
Sequence[AnalysisResult]

Sequence[AnalysisResult]: List of AnalysisResult objects for each link.

get_analyses

get_analyses() -> pd.DataFrame

Return a DataFrame copy of the stored analysis results.

update_records

update_records(records: Sequence[StoreRecord])

Update records in the store.

Parameters:

Name Type Description Default
records Sequence[StoreRecord]

List of StoreRecord objects to replace the current store.

required

dump_store

dump_store() -> pd.DataFrame

Dump the store to a DataFrame that can be loaded later.

load_store

load_store(df: DataFrame, flush: bool = True)

Load the store from a DataFrame.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing scrape and analysis results.

required
flush bool

If True, clear the current store before loading.

True

export

export(verbose=False) -> pd.DataFrame

Export stored records as a DataFrame with unnested analysis outputs.

If verbose is True, include extra metadata such as success flags and error messages.

Parameters:

Name Type Description Default
verbose bool

If True, include additional columns for detailed status.

False

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing exported records.