Creating a Basic Workflow
For background, Scraipe uses interfaces to define scraping and analysis logic:
IScraper: Fetches and extracts content from a link.IAnalyzer: Extracts structured information from content (e.g., with an LLM)
The Workflow class orchestrates scrapers and analyzers in one persistent process.
Check out celebrities_example.ipynb for an advanced workflow using NewsScraper and OpenAiAnalyzer. Continue reading for a basic example.
Setup
Make sure scraipe is installed. Scraipe requires python 3.10 or greater.
Basic Example
Our basic workflow will use two bundled components:
TextScraper: gets the content of a website and strips out html tags.TextStatsAnalyzer: computes word and sentence statistics.
-
Import dependencies.
-
Configure the workflow.
-
Run the workflow on links.
Running our workflow script will print a Pandas dataframe containing text stats for each link.
$ python basic_workflow.py
Scraping: 100%|█████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.59link/s]
Analyzing: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1065.08item/s]
link word_count character_count sentence_count average_word_length
0 https://rickandmortyapi.com/api/character/1 366 2719 58 5.669399
1 https://ckaestne.github.io/seai/ 2426 15878 96 5.298434
2 https://example.com 30 206 3 5.600000
Conclusion
We created a basic workflow to orchestrate web scraping and text stats extraction. You can plug other bundled components into your workflow or write create custom components for your project's specific needs.