Capibara / docs /dataset_card.md
donato11's picture
Final changes
16ae4d8
metadata
'[object Object]': null

Dataset Card for NLBSE24 Issue Classification Dataset

The NLBSE24 Issue Classification dataset contains GitHub issue reports from various open-source projects. Each report is classified into one of three categories: 'bug', 'feature', or 'question'. This dataset is a "soft-cleaned" version of the original raw data, designed to support machine learning research on automated issue triage and classification tasks within software engineering.

Dataset Details

Dataset Description

This dataset is a curated collection of GitHub issue reports, specifically sourced from the NLBSE24 challenge. It provides a benchmark for developing and evaluating models capable of classifying issue descriptions into fundamental categories relevant to software development. The "soft-cleaned" version used in our project has undergone basic preprocessing steps to ensure data integrity and usability.

  • Curated by: NLBSE (Natural Language-Based Software Engineering) organizers and Capibara Team (for soft-cleaning)
  • Language: English
  • License: The original NLBSE24 dataset typically inherits licenses from its source GitHub projects.
  • Size Categories:
    • Training Set: 1500 samples
    • Test Set: 1498 samples
  • Task Categories: Multi-class Text Classification (3 labels)
  • Supported Languages: English (derived from various programming project contexts on GitHub)

Data Fields

Each record in the dataset contains the following fields:

Field Type Description
repo string The GitHub repository from which the issue was extracted.
created_at string Timestamp of when the issue was created.
label string The ground-truth classification category ('bug', 'feature', 'question').
issue string The main text content of the issue report (title and body combined, or just body if no separate title field).

Label Definitions

The dataset contains three predefined labels relevant to software development issues:

  • bug: Indicates an error, flaw, or fault in the code that causes it to produce an incorrect or unexpected result.
  • feature: Represents a new functionality or enhancement requested for the software.
  • question: Denotes a request for information, clarification, or a general query about the software.

Uses

Direct Use

This dataset is primarily intended for:

  • Training and evaluating machine learning models for automated issue classification.
  • Benchmarking different text classification algorithms in a software engineering context.
  • Research into synthetic data generation techniques for augmenting limited labeled datasets.

Out-of-Scope Use

  • Use for classification tasks outside of 'bug', 'feature', 'question' categories without proper re-annotation.
  • Training models for languages other than English.
  • Generalizing findings to highly specialized or proprietary software domains without further domain-specific validation.
  • As a direct measure of software quality or project health without context-specific metrics.

Dataset Structure

The dataset is provided in a CSV format and is split into training and testing sets, available as issues_train.csv and issues_test.csv respectively.

  • Splits:
    • nlbse24_train: Training set. Total samples: 1500.
    • nlbse24_test: Testing set. Total samples: 1498.
  • Data Location (within project): data/soft_cleaned/issue-report-classification/nlbse24/ (tracked via DVC)

Dataset Creation

Curation Rationale

The NLBSE24 dataset was created to foster research and development in natural language processing for software engineering, specifically focusing on the automatic understanding and categorization of GitHub issues. Its curation provides a standardized benchmark for comparative studies of issue classification models.

Source Data

Data Collection and Processing

The original data consists of GitHub issue reports collected from various open-source repositories.

Who are the source data producers?

The original data was produced by developers submitting issues on GitHub. The dataset was curated and annotated by the NLBSE organizers and contributors. The "soft-cleaned" version was produced by the Capibara Team.

Annotations

Annotation process

The original NLBSE24 dataset featured manual annotation of issue reports into predefined categories, performed by expert annotators. The "soft-cleaned" version preserves these original annotations.

Personal and Sensitive Information

The dataset contains publicly available GitHub issue report text. Efforts have been made by the original curators to manage personally identifiable information (PII) typically found in publicly scraped data, but users should exercise caution.

Bias, Risks, and Limitations

  1. Project Bias: The dataset is derived from a selection of open-source GitHub projects, and its characteristics (e.g., common issue types, language style) may reflect specific biases inherent to these projects.
  2. Label Set Limitations: The set of three labels ('bug', 'feature', 'question') is a simplification of the diverse range of issue types that can exist. Fine-grained classification might require a different label taxonomy.
  3. Preprocessing Impact: While "soft-cleaning" improves integrity, it may not remove all noise or irrelevant information (e.g., long stack traces, irrelevant discussion in the issue body) which could influence model performance.

Recommendations

Users should be made aware of the risks, biases, and limitations of the dataset. Recommendations:

  • Understand context: Familiarize yourself with the original source projects and annotation guidelines of NLBSE24 to understand potential biases.
  • Domain-specific analysis: Supplement with domain-specific EDA if applying the dataset to a new domain.
  • Review preprocessing: Be aware of the level of cleaning applied and consider more aggressive preprocessing steps if the downstream task requires it (e.g., removing code blocks from issue bodies).

Citation

If you use this dataset in your research, please cite the original NLBSE24 dataset paper if available, and acknowledge the Capibara Team for the soft-cleaned version.

BibTeX:

@inproceedings{nlbse24,
  author={NLBSE 2024 Organizers},
  title={The NLBSE'24 Shared Task on Issue Classification},
  booktitle={Proceedings of The 4th International Workshop on Natural Language-based Software Engineering (NLBSE'24)},
  year={2024}
}
# Acknowledge our work if this specific soft-cleaned version is used.

APA:

NLBSE 2024 Organizers. (2024). The NLBSE'24 Shared Task on Issue Classification. In Proceedings of The 4th International Workshop on Natural Language-based Software Engineering (NLBSE'24).

Dataset Card Authors

Capibara Team - SE4AI 2526 Course, University of Bari

Dataset Card Contact

GitHub: https://github.com/se4ai2526-uniba/Capibara ```