File size: 5,008 Bytes
e2a30c1 6d3fd52 a235503 4a12398 e2a30c1 4a12398 e2a30c1 4a12398 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 a235503 e2a30c1 0d4715f a235503 0d4715f e2a30c1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
base_model: Salesforce/codegen-350M-mono
library_name: peft
license: mit
datasets:
- google/code_x_glue_ct_code_to_text
language:
- en
pipeline_tag: text-generation
---
# CodeGen-ft-python
<!-- Provide a quick summary of what the model is/does. -->
Generate python code from natural language prompts.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model is a fine-tuned variant of Salesforce/codegen-350M-mono,
specialized for natural language to code generation in Python.
It takes natural language instructions (e.g., “check MySQL database connection”)
and generates the corresponding Python code snippet.
The model was trained on a curated text-to-code dataset containing diverse
programming instructions and function-level examples to improve semantic and syntactic accuracy.
- **Developed by:** Akshay Bharadwaj
- **Model type:** Transformer-based Causal Language Model
- **Language(s) (NLP):** English (Prompts) and Python (Code Outputs)
- **License:** MIT License
- **Finetuned from model [optional]:** Salesforce/codegen-350M-mono
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model can be used for:
* Translating natural language prompts into functional Python code.
* Assisting in code autocompletion or boilerplate generation.
* Supporting educational and prototyping environments.
### Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
Can be integrated into:
* Developer tools (IDE plugins or assistants).
* Chatbots for code assistance or educational coding tutors.
* LLM pipelines for multi-step reasoning or coding workflows.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
* Generating production-level code without human review.
* Security-critical or real-time applications (e.g., code execution automation).
* Generation of malicious or unsafe code.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
* The model may produce incomplete or syntactically incorrect code for ambiguous prompts.
* It can misinterpret vague natural language queries (semantic drift).
* Potential bias toward common Python idioms and limited handling of rare libraries or APIs.
## How to Get Started with the Model
Use the code below to get started with the model.
```
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "akshayb/nl-code-gen-python"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "write a python function to check mysql database connection"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The dataset contains paired natural language descriptions
and Python function implementations, collected and cleaned
from public code repositories and text-to-code benchmarks (e.g., CodeXGLUE).
Preprocessing involved deduplication, tokenization, and removal of incomplete code samples.
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
For Comparison between Base Model and Fine-tuned model, we use the following metrics:
| Metric | Focus | Strength |
| ---------------- | ------------------------------ | ----------------------------------------- |
| **BLEU** | Token-level similarity | Measures fluency and lexical accuracy |
| **CodeBLEU** | Lexical + syntactic + semantic | Captures holistic code quality |
| **Exact Match** | String equality | Strict correctness measure |
| **Syntax Match** | AST structure | Validates syntactic and logical integrity |
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{akshay2025nlcodegen,
title={Natural Language to Code Generation (Fine-tuned CodeGen-350M)},
author={Akshay Bharadwaj},
year={2025},
howpublished={\url{https://huggingface.co/akshayb/nl-code-gen-python}}
}
```
- PEFT 0.7.2.dev0 |