File size: 6,486 Bytes
427cbc1
aef5ea1
427cbc1
 
 
 
 
 
 
 
aef5ea1
d868e89
c20266f
 
aef5ea1
c20266f
 
d868e89
33e3d05
 
 
 
 
 
 
 
 
 
 
 
c20266f
d868e89
cfec995
d868e89
a33f647
d868e89
c20266f
aef5ea1
 
 
a33f647
aef5ea1
 
d868e89
c20266f
d868e89
c20266f
d868e89
c20266f
aef5ea1
 
 
 
 
 
d868e89
c20266f
d868e89
c20266f
d868e89
c20266f
d868e89
aef5ea1
 
 
d868e89
aef5ea1
 
 
d868e89
c20266f
aef5ea1
a33f647
aef5ea1
 
 
d868e89
aef5ea1
 
d868e89
c20266f
d868e89
c20266f
d868e89
c20266f
aef5ea1
 
 
 
 
a33f647
aef5ea1
a33f647
aef5ea1
 
 
 
a33f647
aef5ea1
 
a33f647
aef5ea1
 
 
a33f647
 
aef5ea1
 
d868e89
a33f647
 
aef5ea1
d868e89
a33f647
 
aef5ea1
 
d868e89
a33f647
aef5ea1
d868e89
a33f647
d868e89
a33f647
 
d868e89
a33f647
d868e89
a33f647
 
d868e89
a33f647
d868e89
a33f647
 
d868e89
aef5ea1
 
d868e89
aef5ea1
d868e89
aef5ea1
d868e89
 
aef5ea1
d868e89
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: Reputation Monitor
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860
---

# πŸ“Š End-to-End MLOps Pipeline for Real-Time Reputation Monitoring

![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
![Python](https://img.shields.io/badge/python-3.9%2B-blue)
![Model](https://img.shields.io/badge/model-RoBERTa-yellow)
![Deployment](https://img.shields.io/badge/deployed%20on-HuggingFace-orange)
![License](https://img.shields.io/badge/license-MIT-green)

### πŸ‘€ Author

**Fabio Celaschi**

<a href="https://www.linkedin.com/in/fabio-celaschi-4371bb92">
  <img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn" />
</a>

<a href="https://www.instagram.com/fabiocelaschi/">
  <img src="https://img.shields.io/badge/Instagram-E4405F?style=for-the-badge&logo=instagram&logoColor=white" alt="Instagram" />
</a>

## πŸš€ Project Overview

This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis of real-time news. It was developed to demonstrate **scalable, production-ready machine learning engineering** capabilities.

Unlike standard static notebooks, this repository demonstrates a **full-cycle ML workflow**. The system scrapes live data from **Google News**, analyzes sentiment using a **RoBERTa Transformer** model, and visualizes insights via an interactive dashboard, all orchestrated within a Dockerized environment.

### Key Features
* **Real-Time Data Ingestion:** Automated scraping of Google News for target brand keywords.
* **State-of-the-Art NLP:** Utilizes `twitter-roberta-base-sentiment` for high-accuracy classification.
* **Full-Stack Architecture:** Integrates a **FastAPI** backend for inference and a **Streamlit** frontend for visualization in a single container.
* **Automated Continuous Training (CT):** Implements a pipeline logic that checks for new data and simulates model fine-tuning during CI/CD execution.
* **CI/CD Automation:** Robust GitHub Actions pipeline for automated testing, building, and deployment to Hugging Face Spaces.
* **Embedded Monitoring:** Basic logging system to track model predictions and sentiment distribution over time.

---

## πŸ› οΈ Tech Stack & Tools

* **Core:** Python 3.9+
* **Machine Learning:** Hugging Face Transformers, PyTorch, Scikit-learn.
* **Backend:** FastAPI, Uvicorn (REST API).
* **Frontend:** Streamlit (Interactive Dashboard).
* **Data Ingestion:** `GoogleNews` library (Real-time scraping).
* **DevOps:** Docker, GitHub Actions (CI/CD).
* **Deployment:** Hugging Face Spaces (Docker SDK).

---

## βš™οΈ Architecture & MLOps Workflow

The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:

1.  **Data & Modeling:**
    * **Input:** Real-time news titles and descriptions fetched dynamically.
    * **Model:** Pre-trained **RoBERTa** model optimized for social media and short-text sentiment.

2.  **Containerization (Docker):**
    * The application is containerized using a custom `Dockerfile`.
    * Implements a custom `entrypoint.sh` script to run both the **FastAPI backend** (port 8000) and **Streamlit frontend** (port 7860) simultaneously.

3.  **CI/CD Pipeline (GitHub Actions):**
    * **Trigger:** Pushes to the `main` branch.
    * **Continuous Training:** Checks the `data/` directory for new labeled datasets. If found, initiates a training simulation to demonstrate the retraining lifecycle.
    * **Test:** Executes `pytest` suite to verify API endpoints (`/health`, `/analyze`) and model loading.
    * **Build:** Verifies Docker image creation.
    * **Deploy:** Automatically pushes the validated code to Hugging Face Spaces.

4.  **Monitoring:**
    * The system logs every prediction to a local CSV file, which is visualized in the "Monitoring" tab of the dashboard.

---

## πŸ“‚ Repository Structure

```bash
β”œβ”€β”€ .github/workflows/   # CI/CD configurations (GitHub Actions)
β”œβ”€β”€ app/                 # Backend Application Code
β”‚   β”œβ”€β”€ api/             # FastAPI endpoints (main.py)
β”‚   β”œβ”€β”€ model/           # Model loader logic (RoBERTa)
β”‚   └── services/        # Google News scraping logic
β”œβ”€β”€ data/                # Dataset storage for retraining
β”œβ”€β”€ streamlit_app/       # Frontend Application Code (app.py)
β”œβ”€β”€ src/                 # Training scripts (Simulation)
β”œβ”€β”€ tests/               # Unit and integration tests (Pytest)
β”œβ”€β”€ Dockerfile           # Container configuration
β”œβ”€β”€ entrypoint.sh        # Startup script for dual-process execution
β”œβ”€β”€ requirements.txt     # Project dependencies
β”œβ”€β”€ Appunti_Progetto.doc # Note and explanation of the project
└── README.md            # Project documentation


πŸ’» Installation & Usage
To run this project locally using Docker (Recommended):

### 1. Clone the repository
```bash
git clone [https://github.com/YOUR_USERNAME/SentimentAnalysis.git](https://github.com/YOUR_USERNAME/SentimentAnalysis.git)
cd SentimentAnalysis

### 2. Build the Docker Image
```bash
docker build -t reputation-monitor .

### 3. Run the Container
```bash
docker run -p 7860:7860 reputation-monitor
Access the application at http://localhost:7860

Manual Installation (No Docker):
If you prefer running it directly with Python:

    1. Install dependencies:

    ```bash
    pip install -r requirements.txt

    2. Start the Backend (FastAPI):

    ```bash
    uvicorn app.api.main:app --host 0.0.0.0 --port 8000 --reload

    3. Start the Frontend (Streamlit) in a new terminal:

    ```bash
    streamlit run streamlit_app/app.py

⚠️ Limitations & Future Roadmap
Data Persistence: Currently, monitoring logs are stored in an ephemeral CSV file. In a production environment, this would be replaced by a persistent database (e.g., PostgreSQL) to ensure data retention across container restarts.

Scalability: The current Google News scraper is synchronous. Future versions will implement asynchronous scraping (aiohttp) or a message queue (RabbitMQ/Celery) for high-volume processing.

Model Retraining: A placeholder pipeline (src/train.py) is included. Full implementation would require GPU resources and a labeled dataset for fine-tuning.

🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“ License
Distributed under the MIT License. See LICENSE for more information.