NamCyan
/

codebert-base-technical-debt-code-tesoro

Text Classification

text-embeddings-inference

Model card Files Files and versions

codebert-base-technical-debt-code-tesoro / README.md

NamCyan's picture

Update README.md

849dd19 verified over 1 year ago

|

history blame contribute delete

3.74 kB

	---
	library_name: transformers
	datasets:
	- NamCyan/tesoro-code
	base_model:
	- microsoft/codebert-base
	---

	# Improving the detection of technical debt in Java source code with an enriched dataset


	## Model Details

	### Model Description

	This model is the part of Tesoro project, used for detecting technical debt in source code. More information can be found at [Tesoro HomePage](https://github.com/NamCyan/tesoro.git).

	- Developed by: [Nam Hai Le](https://github.com/NamCyan)
	- Model type: Encoder-based PLMs
	- Language(s): Java
	- Finetuned from model: [CodeBERT](https://huggingface.co/microsoft/codebert-base)

	### Model Sources

	- Repository: [Tesoro](https://github.com/NamCyan/tesoro.git)
	- Paper: [To be update]

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")
	model = AutoModelForSequenceClassification.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")
	```


	## Training Details

	- Training Data: The model is finetuned using [tesoro-code](https://huggingface.co/datasets/NamCyan/tesoro-code)

	- Infrastructure: Training process is conducted on two NVIDIA A100 GPUs with 80GB of VRAM.

	## Leaderboard
	\| Model \| Model size \| EM \| F1 \|
	\|:-------------\|:-----------\|:------------------\|:------------------\|
	\| Encoder-based PLMs \|
	\| [CodeBERT](https://huggingface.co/microsoft/codebert-base) \| 125M \| 38.28 \| 43.47 \|
	\| [UniXCoder](https://huggingface.co/microsoft/unixcoder-base) \| 125M \| 38.12 \| 42.58 \|
	\| [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base)\| 125M \| 39.38 \| 44.21 \|
	\| [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) \| 125M \| 35.37 \| 38.22 \|
	\| [ALBERT](https://huggingface.co/albert/albert-base-v2) \| 11.8M \| 39.32 \| 41.99 \|
	\| Encoder-Decoder-based PLMs \|
	\| [PLBART](https://huggingface.co/uclanlp/plbart-base) \| 140M \| 36.85 \| 39.90 \|
	\| [Codet5](https://huggingface.co/Salesforce/codet5-base) \| 220M \| 32.66 \| 35.41 \|
	\| [CodeT5+](https://huggingface.co/Salesforce/codet5p-220m) \| 220M \| 37.91 \| 41.96 \|
	\| Decoder-based PLMs (LLMs) \|
	\| [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama_v1.1_math_code) \| 1.03B \| 37.05 \| 40.05 \|
	\| [DeepSeek-Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) \| 1.28B \| 42.52 \| 46.19 \|
	\| [OpenCodeInterpreter](https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-1.3B) \| 1.35B \| 38.16 \| 41.76 \|
	\| [phi-2](https://huggingface.co/microsoft/phi-2) \| 2.78B \| 37.92 \| 41.57 \|
	\| [starcoder2](https://huggingface.co/bigcode/starcoder2-3b) \| 3.03B \| 35.37 \| 41.77 \|
	\| [CodeLlama](https://huggingface.co/codellama/CodeLlama-7b-hf) \| 6.74B \| 34.14 \| 38.16 \|
	\| [Magicoder](https://huggingface.co/ise-uiuc/Magicoder-S-DS-6.7B) \| 6.74B \| 39.14 \| 42.49 \|


	## Citing us
	```bibtex
	@article{nam2024tesoro,
	title={Improving the detection of technical debt in Java source code with an enriched dataset},
	author={Hai, Nam Le and Bui, Anh M. T. Bui and Nguyen, Phuong T. and Ruscio, Davide Di and Kazman, Rick},
	journal={},
	year={2024}
	}
	```