News AI documentation

src.ingestion package

Submodules

src.ingestion.fetch_articles module

This script asynchronously fetches and stores content of news articles by scraping provided URLs and checking if the content is already present in the database.

async src.ingestion.fetch_articles.fetch_article_content(article_ids, session)[source]

Fetches the content of a list of articles asynchronously, by checking if content already exists in the database, and if not, extracting the content from the given URLs.

Parameters:
  • article_ids (List[str]) – List of IDs of the articles to fetch content for.

  • session (aiohttp.ClientSession) – The aiohttp session to use for the request.

Returns:

List of dictionaries, each containing the ID and content of a fetched article.

Return type:

List[Dict[str, str]]

async src.ingestion.fetch_articles.test_fetch_article_content(article_ids: List[str]) List[Dict[str, str]][source]

Tests the fetch_article_content function by fetching content for a list of article IDs.

Parameters:

article_ids (List[str]) – A list of article IDs to fetch content for.

Returns:

A list of dictionaries where each dictionary contains the ID and content of a fetched article.

Return type:

List[Dict[str, str]]

src.ingestion.newsapi module

src.ingestion.newsapi.fetch_news(query, from_date: datetime, sort_by, limit, to_json)[source]

Fetches news articles from NewsAPI for the given query, from date and sort_by.

Parameters:
  • query (str) – The query to search for in the NewsAPI.

  • from_date (datetime.datetime) – The date from which to fetch the articles.

  • sort_by (str) – The field to sort the results by.

  • limit (int) – The number of articles to fetch.

  • to_json (bool) – Whether to store the results in a JSON file.

Returns:

The IDs of the articles that were fetched and stored in MongoDB.

Return type:

List[str]

src.ingestion.prawapi module

Module contents

src.preprocessing package

Submodules

src.preprocessing.keyword_extraction module

src.preprocessing.keyword_extraction.bert_keyword_extraction(texts: List[str], top_n: int = 10) List[str][source]

Extracts keywords from a list of texts using KeyBERT.

Parameters:
  • texts (List[str]) – List of texts to extract keywords from.

  • top_n (int) – Number of top keywords to extract per text.

Returns:

List of unique extracted keywords.

Return type:

List[str]

src.preprocessing.keyword_extraction.extract_keywords(article_ids, top_n: int = 10)[source]

Extracts keywords from a list of texts using KeyBERT.

Parameters:
  • texts (List[str]) – List of texts to extract keywords from.

  • top_n (int) – Number of top keywords to extract per text.

Returns:

It returns something else not a list of list of str. List[List[str]]: List of keyword lists for each text.

src.preprocessing.keyword_extraction.preprocess_text(text)[source]

Preprocesses a given text by tokenizing it and removing stopwords.

Parameters:

text (str) – The text to preprocess.

Returns:

A list of words without stopwords.

Return type:

List[str]

src.preprocessing.summarization module

Module contents

src.sentiment_analysis package

Submodules

src.sentiment_analysis.classify module

src.sentiment_analysis.classify.classify_sentiments(texts: List[str]) Dict[str, List[Tuple[str, float]]][source]

Classify the sentiment of multiple texts.

Parameters:

texts (List[str]) – List of text to classify sentiment for.

Returns:

Dictionary with three keys: ‘positive’, ‘negative’, ‘neutral’.

Each key maps to a list of tuples, where the first element of the tuple is the text and the second element is the sentiment score.

Return type:

Dict[str, List[Tuple[str, float]]]

src.sentiment_analysis.sentiment_model module

src.sentiment_analysis.sentiment_model.analyze_sentiments(article_ids: List[str]) List[Dict[str, float]][source]

Analyze the sentiment of a list of article IDs.

Parameters:

article_ids (List[str]) – List of article IDs to analyze.

Returns:

List of sentiment analysis results for each text.

Return type:

List[Dict[str, float]]

src.sentiment_analysis.wordcloud module

src.sentiment_analysis.wordcloud.generate_wordcloud(keywords: List[str], sentiment_label: str) WordCloud[source]

Generates a word cloud for the given list of keywords and sentiment label.

Parameters:
  • keywords (List[str]) – List of keywords to include in the word cloud.

  • sentiment_label (str) – Sentiment label to generate the word cloud for.

Returns:

The generated word cloud.

Return type:

WordCloud

Module contents

src package

Subpackages

Submodules

src.pipeline module

Module contents

src

src.dashboard package

Submodules

src.dashboard.app module

Module contents

src.utils package

Submodules

src.utils.dbconnector module

src.utils.dbconnector.append_to_document(collection_name, query, update_data)[source]

Appends new data to an existing document in the MongoDB collection.

Parameters:
  • collection_name (str) – The name of the MongoDB collection.

  • query (dict) – The query to select the document to update.

  • update_data (dict) – The new data to be appended to the document.

Returns:

The number of documents updated.

Return type:

int

src.utils.dbconnector.content_manager(article_id, required_fields)[source]

Checks if the specified fields are present in the database for the given article_id.

Parameters:
  • article_id (str) – The ID of the article to check.

  • required_fields (list) – A list of fields to check for presence (e.g., [“content”, “summary”, “keywords”, “sentiment”]).

Returns:

A dictionary with the status of each field (True if present, False if not).

Return type:

dict

src.utils.dbconnector.fetch_and_combine_articles(collection_name, article_ids)[source]

Fetches documents from the given MongoDB collection using the given IDs and combines them into a Pandas DataFrame.

Parameters:
  • collection_name (str) – The name of the MongoDB collection.

  • article_ids (List[str]) – List of IDs of the articles to fetch and combine.

Returns:

A Pandas DataFrame containing the combined documents.

Return type:

pd.DataFrame

Raises:

Exception – If there is an error fetching and combining the documents.

src.utils.dbconnector.find_documents(collection_name, query)[source]

Finds documents in the given MongoDB collection using the given query.

Parameters:
  • collection_name (str) – The name of the MongoDB collection.

  • query (dict) – The query to select documents.

Returns:

A list of documents found by the query.

Return type:

list

Raises:

Exception – If there is an error finding documents.

src.utils.dbconnector.find_one_document(collection_name, query)[source]

Finds a single document in the given MongoDB collection using the given query.

Parameters:
  • collection_name (str) – The name of the collection.

  • query (dict) – The query to select documents.

Returns:

The selected document.

Return type:

dict

Raises:

Exception – If there is an error finding the document.

src.utils.dbconnector.get_mongo_client()[source]

Connects to MongoDB and returns the database object.

Uses environment variables for connection:

MONGO_USERNAME: username for MongoDB authentication MONGO_PASSWORD: password for MongoDB authentication MONGO_DB_NAME: name of the database to connect to

Returns:

the connected database object

Return type:

pymongo.database.Database

Raises:

Exception – if connection fails

src.utils.dbconnector.insert_document(collection_name, document)[source]

Inserts a document into the given collection.

Parameters:
  • collection_name (str) – The name of the collection.

  • document (dict) – The document to be inserted.

Returns:

The ID of the inserted document.

Return type:

str

Raises:

Exception – If there is an error inserting the document.

src.utils.logger module

src.utils.logger.setup_logger(log_file='app.log')[source]

Sets up a logger with a console handler and a rotating file handler.

The console handler has color coding for different log levels, while the file handler does not. The file handler will rotate the log file every 5MB, keeping up to 5 backups.

Parameters:

log_file (str) – The name of the log file to write to. Defaults to “app.log”.

Returns:

The configured logger.

Return type:

logger (logging.Logger)

Module contents

Project Overview

<div align=”center”>

<img src=”https://github.com/user-attachments/assets/b825468e-515c-45e8-9b81-a4f1b033ab0c” alt=”NewsAI Logo” width=”200px”> <h1>🚀 NewsAI: Where AI Meets Breaking News! 🌟</h1> <p><i>Buckle up, news junkies! We’re about to take you on a wild ride through the information superhighway! 🎢</i></p>

![GitHub stars](https://img.shields.io/github/stars/Multiverse-of-Projects/NewsAI?style=social) ![GitHub forks](https://img.shields.io/github/forks/Multiverse-of-Projects/NewsAI?style=social) ![GitHub watchers](https://img.shields.io/github/watchers/Multiverse-of-Projects/NewsAI?style=social) ![GitHub contributors](https://img.shields.io/github/contributors/Multiverse-of-Projects/NewsAI) ![GitHub last commit](https://img.shields.io/github/last-commit/Multiverse-of-Projects/NewsAI) [![Documentation Status](https://readthedocs.org/projects/newsai/badge/?version=latest)](https://newsai.readthedocs.io/en/latest/?badge=latest)

</div>

## 🎭 What’s All the Fuss About?

Imagine if CNN, Reddit, and a fortune-teller had a baby, and that baby was raised by AI. That’s NewsAI! We’re not just aggregating news; we’re revolutionizing how you experience information:

  • 🔮 Gemini-Powered Insights: Google’s Gemini AI is our crystal ball!

  • 🧠 BERT-Based Sentiment Analysis: We don’t just read news; we feel it in our circuits!

  • 🚀 FastAPI Backend: So fast, it breaks the space-time continuum!

  • 🖥️ Streamlit Dashboard: Where data visualization meets modern art!

  • 🍃 MongoDB: Because our data is too cool for tables!

## 🎬 See It or Don’t Believe It!

Deployment link : https://news-ai-dashboard.streamlit.app/ <div align=”center”>

<a href=”https://www.youtube.com/watch?v=stTXgljJVPQ”>

<img src=”https://img.youtube.com/vi/stTXgljJVPQ/0.jpg” alt=”Demo Video” width=”500px”>

</a> <br> <i>Warning: This video may cause uncontrollable desire to code! 🤓</i>

</div>

## 🚀 Quick Start: 0 to Hero in 3… 2… 1…

```bash # Clone this bad boy git clone https://github.com/Multiverse-of-Projects/NewsAI.git

# Enter the matrix cd NewsAI

# Install magical dependencies pip install -r requirements.txt

# Add neccessary creds in .env file create an .env file with api keys and all

# Add python path and run streamlit from src/dashboard/ streamlit run app.py

# If you want to run only the pipeline.py python -m src.pipeline

# If you want to Unleash your creativity git checkout -b feature/skynet-integration

# Start coding like you’re trying to prevent Y2K! # for reference my python version == 3.12.7 ```

## 🌈 Contribution: Join Our Avengers of Code!

  1. 🍴 Fork (the repo, not your dinner)

  2. 🌿 Branch (create one, don’t climb one)

  3. 💡 Commit (changes, not crimes)

  4. 🚀 Push (to the repo, not your luck)

  5. 🎉 PR (Pull Request, not Public Relations)

## 🏆 Wall of Fame: Our Code Wizards

<div align=”center”>
<a href=”https://github.com/Multiverse-of-Projects/NewsAI/graphs/contributors”>

<img src=”https://contrib.rocks/image?repo=Multiverse-of-Projects/NewsAI” />

</a>

</div>

<div align=”center”>

<b>These legends write code that makes Shakespeare look like a casual blogger!</b>

</div>

## 📚 Documentation: The Sacred Texts

Our docs are so good, they’re basically the eighth wonder of the world. Check them out on [Read the Docs](https://newsai.readthedocs.io/)!

## 🎨 Our Tech Palette: Tools of Mass Construction

  • 🧠 Gemini AI: For insights sharper than a samurai’s sword

  • 🤖 BERT: Sentiment analysis that can read between the lines (and emojis)

  • 🚀 FastAPI: Because life’s too short for slow APIs

  • 🖥️ Streamlit: Making dashboards sexier than a sports car

  • 🍃 MongoDB: NoSQL? More like YesQL to all our data needs!

## 📬 Reach Out and Touch Code

<!– - 🐦 Twitter: [@NewsAIDashboard](https://twitter.com/NewsAIDashboard) (Follow us for dad jokes and tech puns) –> - 💬 Discord: [Join our server](https://discord.gg/kV4ANf6x) (Where we debate tabs vs. spaces)

## 📜 License to Thrill

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details. It’s basically a license to code with reckless abandon!

<div align=”center”>

<img src=”https://media.giphy.com/media/3o7btXkbsV26U95Uly/giphy.gif” width=”200px”> <br> <b>May your code be bug-free and your coffee be strong! 🚀☕</b>

</div>

Demo Video