Create DATASET_CREDITS.md
Browse files- DATASET_CREDITS.md +37 -0
DATASET_CREDITS.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Credits
|
| 2 |
+
|
| 3 |
+
This dataset is a combination of three existing datasets, pre-processed with **deduplication** and **token limit of 1024 tokens per example**.
|
| 4 |
+
|
| 5 |
+
## Included Datasets
|
| 6 |
+
|
| 7 |
+
1. **[CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO)**
|
| 8 |
+
- Creator: CyberNative
|
| 9 |
+
- License: Apache 2.0
|
| 10 |
+
- Description: Code dataset focused on security vulnerabilities.
|
| 11 |
+
|
| 12 |
+
2. **[Madras1/minimax-m2.5-code-distilled-14k](https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k)**
|
| 13 |
+
- Creator: Madras1
|
| 14 |
+
- License: Apache 2.0
|
| 15 |
+
- Description: Distilled code dataset emphasizing coding patterns and representations.
|
| 16 |
+
|
| 17 |
+
3. **[pedrodev2026/pedro-open-distil-dataset](https://huggingface.co/datasets/pedrodev2026/pedro-open-distil-dataset)**
|
| 18 |
+
- Creator: pedrodev2026
|
| 19 |
+
- License: BSD 3-Clause
|
| 20 |
+
- Description: Custom distilled code dataset created and maintained by pedrodev2026.
|
| 21 |
+
|
| 22 |
+
## Preprocessing
|
| 23 |
+
|
| 24 |
+
The combined dataset was prepared by:
|
| 25 |
+
|
| 26 |
+
- **Deduplicating** all examples to remove redundancy.
|
| 27 |
+
- Limiting examples to **1024 tokens each**.
|
| 28 |
+
|
| 29 |
+
## License
|
| 30 |
+
|
| 31 |
+
The final combined dataset is licensed under **BSD 3-Clause**.
|
| 32 |
+
Users must still respect the original licenses of the included datasets when redistributing or using the original unmodified datasets.
|
| 33 |
+
|
| 34 |
+
- Original licenses:
|
| 35 |
+
- **[CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO)**: Apache 2.0
|
| 36 |
+
- **[Madras1/minimax-m2.5-code-distilled-14k](https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k)**: Apache 2.0
|
| 37 |
+
- **[pedrodev2026/pedro-open-distil-dataset](https://huggingface.co/datasets/pedrodev2026/pedro-open-distil-dataset)**: BSD 3-Clause
|