src.ingestion package

Submodules

src.ingestion.fetch_articles module

This script asynchronously fetches and stores content of news articles by scraping provided URLs and checking if the content is already present in the database.

async src.ingestion.fetch_articles.fetch_article_content(article_ids, session)[source]

Fetches the content of a list of articles asynchronously, by checking if content already exists in the database, and if not, extracting the content from the given URLs.

Parameters:
  • article_ids (List[str]) – List of IDs of the articles to fetch content for.

  • session (aiohttp.ClientSession) – The aiohttp session to use for the request.

Returns:

List of dictionaries, each containing the ID and content of a fetched article.

Return type:

List[Dict[str, str]]

async src.ingestion.fetch_articles.test_fetch_article_content(article_ids: List[str]) List[Dict[str, str]][source]

Tests the fetch_article_content function by fetching content for a list of article IDs.

Parameters:

article_ids (List[str]) – A list of article IDs to fetch content for.

Returns:

A list of dictionaries where each dictionary contains the ID and content of a fetched article.

Return type:

List[Dict[str, str]]

src.ingestion.newsapi module

src.ingestion.newsapi.fetch_news(query, from_date: datetime, sort_by, limit, to_json)[source]

Fetches news articles from NewsAPI for the given query, from date and sort_by.

Parameters:
  • query (str) – The query to search for in the NewsAPI.

  • from_date (datetime.datetime) – The date from which to fetch the articles.

  • sort_by (str) – The field to sort the results by.

  • limit (int) – The number of articles to fetch.

  • to_json (bool) – Whether to store the results in a JSON file.

Returns:

The IDs of the articles that were fetched and stored in MongoDB.

Return type:

List[str]

src.ingestion.prawapi module

Module contents