src.ingestion package
Submodules
src.ingestion.fetch_articles module
This script asynchronously fetches and stores content of news articles by scraping provided URLs and checking if the content is already present in the database.
- async src.ingestion.fetch_articles.fetch_article_content(article_ids, session)[source]
Fetches the content of a list of articles asynchronously, by checking if content already exists in the database, and if not, extracting the content from the given URLs.
- Parameters:
article_ids (List[str]) – List of IDs of the articles to fetch content for.
session (aiohttp.ClientSession) – The aiohttp session to use for the request.
- Returns:
List of dictionaries, each containing the ID and content of a fetched article.
- Return type:
List[Dict[str, str]]
- async src.ingestion.fetch_articles.test_fetch_article_content(article_ids: List[str]) List[Dict[str, str]][source]
Tests the fetch_article_content function by fetching content for a list of article IDs.
- Parameters:
article_ids (List[str]) – A list of article IDs to fetch content for.
- Returns:
A list of dictionaries where each dictionary contains the ID and content of a fetched article.
- Return type:
List[Dict[str, str]]
src.ingestion.newsapi module
- src.ingestion.newsapi.fetch_news(query, from_date: datetime, sort_by, limit, to_json)[source]
Fetches news articles from NewsAPI for the given query, from date and sort_by.
- Parameters:
query (str) – The query to search for in the NewsAPI.
from_date (datetime.datetime) – The date from which to fetch the articles.
sort_by (str) – The field to sort the results by.
limit (int) – The number of articles to fetch.
to_json (bool) – Whether to store the results in a JSON file.
- Returns:
The IDs of the articles that were fetched and stored in MongoDB.
- Return type:
List[str]