'[object Object]': null
Dataset Card for NLBSE24 Issue Classification Dataset
The NLBSE24 Issue Classification dataset contains GitHub issue reports from various open-source projects. Each report is classified into one of three categories: 'bug', 'feature', or 'question'. This dataset is a "soft-cleaned" version of the original raw data, designed to support machine learning research on automated issue triage and classification tasks within software engineering.
Dataset Details
Dataset Description
This dataset is a curated collection of GitHub issue reports, specifically sourced from the NLBSE24 challenge. It provides a benchmark for developing and evaluating models capable of classifying issue descriptions into fundamental categories relevant to software development. The "soft-cleaned" version used in our project has undergone basic preprocessing steps to ensure data integrity and usability.
- Curated by: NLBSE (Natural Language-Based Software Engineering) organizers and Capibara Team (for soft-cleaning)
- Language: English
- License: The original NLBSE24 dataset typically inherits licenses from its source GitHub projects.
- Size Categories:
- Training Set: 1500 samples
- Test Set: 1498 samples
- Task Categories: Multi-class Text Classification (3 labels)
- Supported Languages: English (derived from various programming project contexts on GitHub)
Data Fields
Each record in the dataset contains the following fields:
| Field | Type | Description |
|---|---|---|
repo |
string |
The GitHub repository from which the issue was extracted. |
created_at |
string |
Timestamp of when the issue was created. |
label |
string |
The ground-truth classification category ('bug', 'feature', 'question'). |
issue |
string |
The main text content of the issue report (title and body combined, or just body if no separate title field). |
Label Definitions
The dataset contains three predefined labels relevant to software development issues:
bug: Indicates an error, flaw, or fault in the code that causes it to produce an incorrect or unexpected result.feature: Represents a new functionality or enhancement requested for the software.question: Denotes a request for information, clarification, or a general query about the software.
Uses
Direct Use
This dataset is primarily intended for:
- Training and evaluating machine learning models for automated issue classification.
- Benchmarking different text classification algorithms in a software engineering context.
- Research into synthetic data generation techniques for augmenting limited labeled datasets.
Out-of-Scope Use
- Use for classification tasks outside of 'bug', 'feature', 'question' categories without proper re-annotation.
- Training models for languages other than English.
- Generalizing findings to highly specialized or proprietary software domains without further domain-specific validation.
- As a direct measure of software quality or project health without context-specific metrics.
Dataset Structure
The dataset is provided in a CSV format and is split into training and testing sets, available as issues_train.csv and issues_test.csv respectively.
- Splits:
nlbse24_train: Training set. Total samples: 1500.nlbse24_test: Testing set. Total samples: 1498.
- Data Location (within project):
data/soft_cleaned/issue-report-classification/nlbse24/(tracked via DVC)
Dataset Creation
Curation Rationale
The NLBSE24 dataset was created to foster research and development in natural language processing for software engineering, specifically focusing on the automatic understanding and categorization of GitHub issues. Its curation provides a standardized benchmark for comparative studies of issue classification models.
Source Data
Data Collection and Processing
The original data consists of GitHub issue reports collected from various open-source repositories.
Who are the source data producers?
The original data was produced by developers submitting issues on GitHub. The dataset was curated and annotated by the NLBSE organizers and contributors. The "soft-cleaned" version was produced by the Capibara Team.
Annotations
Annotation process
The original NLBSE24 dataset featured manual annotation of issue reports into predefined categories, performed by expert annotators. The "soft-cleaned" version preserves these original annotations.
Personal and Sensitive Information
The dataset contains publicly available GitHub issue report text. Efforts have been made by the original curators to manage personally identifiable information (PII) typically found in publicly scraped data, but users should exercise caution.
Bias, Risks, and Limitations
- Project Bias: The dataset is derived from a selection of open-source GitHub projects, and its characteristics (e.g., common issue types, language style) may reflect specific biases inherent to these projects.
- Label Set Limitations: The set of three labels ('bug', 'feature', 'question') is a simplification of the diverse range of issue types that can exist. Fine-grained classification might require a different label taxonomy.
- Preprocessing Impact: While "soft-cleaning" improves integrity, it may not remove all noise or irrelevant information (e.g., long stack traces, irrelevant discussion in the issue body) which could influence model performance.
Recommendations
Users should be made aware of the risks, biases, and limitations of the dataset. Recommendations:
- Understand context: Familiarize yourself with the original source projects and annotation guidelines of NLBSE24 to understand potential biases.
- Domain-specific analysis: Supplement with domain-specific EDA if applying the dataset to a new domain.
- Review preprocessing: Be aware of the level of cleaning applied and consider more aggressive preprocessing steps if the downstream task requires it (e.g., removing code blocks from issue bodies).
Citation
If you use this dataset in your research, please cite the original NLBSE24 dataset paper if available, and acknowledge the Capibara Team for the soft-cleaned version.
BibTeX:
@inproceedings{nlbse24,
author={NLBSE 2024 Organizers},
title={The NLBSE'24 Shared Task on Issue Classification},
booktitle={Proceedings of The 4th International Workshop on Natural Language-based Software Engineering (NLBSE'24)},
year={2024}
}
# Acknowledge our work if this specific soft-cleaned version is used.
APA:
NLBSE 2024 Organizers. (2024). The NLBSE'24 Shared Task on Issue Classification. In Proceedings of The 4th International Workshop on Natural Language-based Software Engineering (NLBSE'24).
Dataset Card Authors
Capibara Team - SE4AI 2526 Course, University of Bari
Dataset Card Contact
GitHub: https://github.com/se4ai2526-uniba/Capibara
```