Update README.md

acf3088 over 2 years ago

4.59 kB

	---
	license: apache-2.0
	---

	## CodeShell

	CodeShell 是[北京大学知识计算实验室](http://se.pku.edu.cn/kcl/)与蚌壳智能科技联合研发的大规模预训练代码语言模型基座。

	CodeShell的主要特点包括：

	* 性能强大：7B规模代码基座大模型，超过同等规模的最强基座模型（如CodeLlama-7B）
	* 训练高效：基于高效的数据治理体系，冷启动训练500B高质量数据
	* 体系完整：模型与IDE插件全栈技术体系开源
	* 轻量快速：支持本地C++部署，提供轻量的本地化解决方案
	* 评测全面：提供支持完整项目上下文的代码多任务评测体系（即将开源）

	本次开源的模型和工具列表如下：

	- CodeShell Base
	- CodeShell Chat
	- CodeShell Chat 4bit
	- C/C++本地化部署工具
	- VS Code插件
	- JetBrains插件


	## Model Use

	### Code Generation

	Codeshell 提供了Hugging Face格式的模型，开发者可以通过下列代码快速载入并使用Codeshell。

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("codeshell", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("codeshell", trust_remote_code=True).cuda()
	inputs = tokenizer('def print_hello_world():', return_tensors='pt').cuda()
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	### Fill in the Moddle

	CodeShell 支持Fill-in-the-Middle模式，从而更好的支持软件开发过程。

	```
	input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
	inputs = tokenizer(input_text, return_tensors='pt').cuda()
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	## Model Quantization

	CodeShell 支持4 bit/8 bit量化，4 bit量化后，占用显存大小约6G。

	```
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("codeshell", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("codeshell", trust_remote_code=True)
	model = model.quantize(4).cuda()

	inputs = tokenizer('def print_hello_world():', return_tensors='pt').cuda()
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	## CodeShell IDE Plugin

	### Web API

	CodeShell提供了Web API部署工具，为IDE插件提供API支持。

	```
	git clone git@github.com:WisdomShell/codeshell.git
	cd codeshell
	python api.py
	```

	CodeShell提供了C/C++版本的推理支持，在没有GPU的个人PC上也能高效使用。开发者可以根据本地环境进行编译，详见[C/C++本地化部署工具]()。编译完成后，可以通过下列命令启动Web API服务。

	```
	./server -m codeshell.gguf
	```

	部署完成后，开发者可以通过Web API进行模型推理：

	```
	curl --location 'http://127.0.0.1:8080/completion' --header 'Content-Type: application/json' --data '{"messages": {"content": "用python写个hello world"}, "temperature": 0.2, "stream": true}'
	```

	### VS Code Plugin

	CodeShell提供 [VS Code插件]()，开发者可以通过插件进行代码补全、代码问答等操作。VS Code 插件也已开源，插件相关问题欢迎在[VS Code插件仓库]()中讨论。

	## Model Details

	- 模型架构
	- Architecture: GPT-2
	- Attention: Grouped-Query Attention with Flash Attention 2
	- Position embedding: [Rotary Position Embedding](RoFormer: Enhanced Transformer with Rotary Position Embedding)
	- Precision: bfloat16
	- 超参数
	- n_layer: 42
	- n_embd: 4096
	- n_inner: 16384
	- n_head: 32
	- num_query_groups: 8
	- seq-length: 8192
	- vocab_size: 70144

	Code Shell使用GPT-2作为基础架构，并使用Grouped-Query Attention、RoPE相对位置编码等技术。

	## Evaluation

	我们选取了目前最流行的两个代码评测数据集对模型进行评估，与目前最先进的两个7b代码大模型CodeLllama与Starcoder相比，Codeshell 取得了最优的成绩。具体评测结果如下。

	### Pass@1
	\| 任务 \| codeshell-7B \| codellama-7B \| starcoderbase-7B \|
	\| ------- \| --------- \| --------- \| --------- \|
	\| humaneval \| 33.48 \| 29.44 \| 27.80 \|
	\| mbpp \| 39.08 \| 37.60 \| 34.16 \|
	\| multiple-java \| 29.56 \| 29.24 \| 24.30 \|
	\| multiple-js \| 33.60 \| 31.30 \| 27.02 \|


	# License

	本仓库开源的模型遵循[Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0)，对学术研究完全开放，若需要商用，开发者可发送邮件进行申请，得到书面授权后方可使用。联系邮箱：[wye@pku.edu.cn](mailto:wye@pku.edu.cn)