Spaces:

donato11
/

Capibara

Running

App Files Files Community

Capibara / docs /ml_canvas.md

donato11

Switched file directory

ef83990 19 days ago

preview code

raw

history blame contribute delete

7.17 kB

MACHINE LEARNING CANVAS

Designed for: Giuseppe Colavito Designed by: Capibara Date: 15/10/2025 Iteration: 1.1

VALUE PROPOSITION

The ML system defines a specialized pipeline for synthetic data generation aimed at improving issue report classification tasks (e.g., bugs, feature requests, questions).

This pipeline is designed for Machine Learning and Software Engineers facing the problem of scarce labeled data. By providing a rich and representative synthetic dataset, the system enables the training of more robust and effective classifiers for automating issue triage. The solution easily integrates into existing ML pipelines, acting as a data augmentation step before the final model's training phase.

PREDICTION TASK

The system solves a generative task. The goal is to expand an initial small dataset of real issue reports.

Starting from a set of issues as input, the pipeline generates new synthetic samples that mimic the statistical distribution and semantic characteristics of the original data. The output is a new synthetic dataset composed by the generated samples, ready to be used for training a classification model.

DECISIONS

Our system supports an engineer in deciding how to compose the training dataset for an issue classifier, reducing the dependency on costly manual data collection and labeling.

The specific actions the engineer can take include:

Choosing the Training Strategy: The engineer can decide whether to:
- Use only synthetic data (ideal when no labeled real data is available).
- Augment a small set of real data with synthetic data to improve the classifier's generalization and performance.
Defining the Data Mix: When augmenting, the engineer can experiment with the optimal ratio of real to synthetic data (e.g., 30% real, 70% synthetic) to maximize classification metrics.

The final decision is guided by a performance comparison: the engineer evaluates if the metrics from the model trained with the new dataset are sufficient for the task or if they approach those of a model trained on a full real dataset, thereby justifying the savings in time and cost.

IMPACT SIMULATION

The system's impact will be evaluated through a rigorous comparison:

Dataset Splitting: The original issue dataset is split into a training set and a test set.
Baseline Creation: Classification models (e.g., SetFit) are trained only on the original training set and their performance is evaluated on the test set.
Synthetic Generation: The original training set is used as a "seed" to generate a new synthetic dataset.
Training with Augmented Data: The same models are trained on various data combinations (e.g., synthetic only, real + synthetic) and evaluated on the same test set.
Comparison of Results: The performance of the "augmented" models (step 4) is compared with the baseline performance (step 2).

The desired result is to achieve performance equal to or better than the baseline. Otherwise, the synthetic generation process would be a waste of computational resources. The experiment will be repeated on issue datasets from different software projects to validate the approach's generality.

MAKING PREDICTIONS

The system is designed to operate in batch mode. The dataset generation is a one-off process that occurs before the engineer's classifier training phase. Since there are no real-time requirements, there are no strict constraints on generation time.

The generation process can be time-consuming, depending on the size of the dataset to be generated, the LLM used, and the available computational resources (GPUs).

DATA COLLECTION

The synthetic issue data generation pipeline can be improved over time by incorporating feedback from various issue classification tasks and collecting datasets from unseen domains, allowing the pipeline to become more robust and adaptable for generating diverse issue reports.

The pipeline can be improved over time incorporating other publicly available issue report datasets from platforms like GitHub and Zenodo. This will allow us to test the robustness and generalizability of our data generation pipeline across a wider variety of projects and domains.

DATA SOURCES

The primary data source for this project is the official dataset provided by the NLBSE'24 Tool Competition on Issue Report Classification.

This dataset consists of 3,000 labeled issue reports extracted from five real-world open-source projects. Each issue report is structured with the following fields: Repository, Label, Title, and Body. Issues are categorized into one of three distinct, mutually exclusive classes: bug, enhancement (or feature), and question.

To ensure constant availability, version control, and ease of access for the team, all datasets (original, generated, and updated versions) will be stored and managed on a shared online storage platform, such as Google Drive.

BUILDING MODELS

Currently, no model is deployed to production. The final output of the project is not a "live" classification model, but a configurable pipeline for data generation. Large Language Models (LLMs) are used as a tool within this pipeline.

"Updating" the system means modifying or improving the pipeline itself, a process that requires significant computational resources (powerful CPUs and GPUs) for experimentation and validation. In this case, we will keep track of the systems' versions by using experiment tracking tools to ensure reproducibilty.

FEATURES

At prediction time (for synthetic data generation), a description of the context and of the single columns (e.g., title, body, label) of the issue dataset to augment is needed. Furthermore, the engineer can provide the system insights about the issue data that needs to be generated under the form of few-shot examples coming from the original dataset. These examples are represented in a textual manner, with a structure recalling those of the previously described columns (e.g., a sample issue title, body, and its corresponding label).

For data transformation, data cleaning phases are planned for the input data (to remove templates, logs, and irrelevant information) along with an analysis of the class distribution to guide a balanced generation.

MONITORING

To track the impact of the solution, we adopt a dual-level monitoring approach:

Downstream Model Performance
- Metrics: Accuracy, F1-Score, Precision, Recall (same as those used for the final classifier)
- Goal: Assess whether synthetic data genuinely improves the end task.
Synthetic Data Quality
- Statistical Tools:
  - Class distribution analysis (balance check)
  - Semantic similarity measurement (e.g., embeddings) between real and synthetic data
  - Text diversity calculation (to avoid mode collapse)
- LLM as a Judge: a Large Language Model can be used as an automatic judge to evaluate the coherence, realism, and relevance of generated samples compared to real data.

All metrics and evaluations are tracked after each experiment to guide subsequent iterations and optimize the pipeline.