|
|
--- |
|
|
base_model: Salesforce/codegen-350M-mono |
|
|
library_name: peft |
|
|
license: mit |
|
|
datasets: |
|
|
- google/code_x_glue_ct_code_to_text |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# CodeGen-ft-python |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
Generate python code from natural language prompts. |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
This model is a fine-tuned variant of Salesforce/codegen-350M-mono, |
|
|
specialized for natural language to code generation in Python. |
|
|
It takes natural language instructions (e.g., “check MySQL database connection”) |
|
|
and generates the corresponding Python code snippet. |
|
|
The model was trained on a curated text-to-code dataset containing diverse |
|
|
programming instructions and function-level examples to improve semantic and syntactic accuracy. |
|
|
|
|
|
|
|
|
- **Developed by:** Akshay Bharadwaj |
|
|
|
|
|
- **Model type:** Transformer-based Causal Language Model |
|
|
- **Language(s) (NLP):** English (Prompts) and Python (Code Outputs) |
|
|
- **License:** MIT License |
|
|
- **Finetuned from model [optional]:** Salesforce/codegen-350M-mono |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
|
|
The model can be used for: |
|
|
|
|
|
* Translating natural language prompts into functional Python code. |
|
|
|
|
|
* Assisting in code autocompletion or boilerplate generation. |
|
|
|
|
|
* Supporting educational and prototyping environments. |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
|
|
Can be integrated into: |
|
|
|
|
|
* Developer tools (IDE plugins or assistants). |
|
|
|
|
|
* Chatbots for code assistance or educational coding tutors. |
|
|
|
|
|
* LLM pipelines for multi-step reasoning or coding workflows. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
|
|
* Generating production-level code without human review. |
|
|
|
|
|
* Security-critical or real-time applications (e.g., code execution automation). |
|
|
|
|
|
* Generation of malicious or unsafe code. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
* The model may produce incomplete or syntactically incorrect code for ambiguous prompts. |
|
|
|
|
|
* It can misinterpret vague natural language queries (semantic drift). |
|
|
|
|
|
* Potential bias toward common Python idioms and limited handling of rare libraries or APIs. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
``` |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "akshayb/nl-code-gen-python" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
prompt = "write a python function to check mysql database connection" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
The dataset contains paired natural language descriptions |
|
|
and Python function implementations, collected and cleaned |
|
|
from public code repositories and text-to-code benchmarks (e.g., CodeXGLUE). |
|
|
Preprocessing involved deduplication, tokenization, and removal of incomplete code samples. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
For Comparison between Base Model and Fine-tuned model, we use the following metrics: |
|
|
|
|
|
| Metric | Focus | Strength | |
|
|
| ---------------- | ------------------------------ | ----------------------------------------- | |
|
|
| **BLEU** | Token-level similarity | Measures fluency and lexical accuracy | |
|
|
| **CodeBLEU** | Lexical + syntactic + semantic | Captures holistic code quality | |
|
|
| **Exact Match** | String equality | Strict correctness measure | |
|
|
| **Syntax Match** | AST structure | Validates syntactic and logical integrity | |
|
|
|
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
|
|
**BibTeX:** |
|
|
``` |
|
|
@misc{akshay2025nlcodegen, |
|
|
title={Natural Language to Code Generation (Fine-tuned CodeGen-350M)}, |
|
|
author={Akshay Bharadwaj}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/akshayb/nl-code-gen-python}} |
|
|
} |
|
|
``` |
|
|
|
|
|
- PEFT 0.7.2.dev0 |