kdf
/

python-docstring-generation

Text Generation

Model card Files Files and versions

python-docstring-generation / README.md

qhduan's picture

Update README.md

e4808e7 over 3 years ago

|

history blame contribute delete

2.7 kB

	---
	license: apache-2.0
	widget:
	- text: "<\|endoftext\|>\ndef load_excel(path):\n return pd.read_excel(path)\n# docstring\n\"\"\""
	---

	## Basic info

	model based [Salesforce/codegen-350M-mono](https://huggingface.co/Salesforce/codegen-350M-mono)

	fine-tuned with data [codeparrot/github-code-clean](https://huggingface.co/datasets/codeparrot/github-code-clean)

	data filter by python

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_type = 'kdf/python-docstring-generation'
	tokenizer = AutoTokenizer.from_pretrained(model_type)
	model = AutoModelForCausalLM.from_pretrained(model_type)

	inputs = tokenizer('''<\|endoftext\|>
	def load_excel(path):
	return pd.read_excel(path)

	# docstring
	"""''', return_tensors='pt')

	doc_max_length = 128

	generated_ids = model.generate(
	**inputs,
	max_length=inputs.input_ids.shape[1] + doc_max_length,
	do_sample=False,
	return_dict_in_generate=True,
	num_return_sequences=1,
	output_scores=True,
	pad_token_id=50256,
	eos_token_id=50256 # <\|endoftext\|>
	)

	ret = tokenizer.decode(generated_ids.sequences[0], skip_special_tokens=False)
	print(ret)

	```

	## Prompt

	You could give model a style or a specific language, for example:

	```python
	inputs = tokenizer('''<\|endoftext\|>
	def add(a, b):
	return a + b

	# docstring
	"""
	Calculate numbers add.

	Args:
	a: the first number to add
	b: the second number to add

	Return:
	The result of a + b
	"""
	<\|endoftext\|>
	def load_excel(path):
	return pd.read_excel(path)

	# docstring
	"""''', return_tensors='pt')

	doc_max_length = 128

	generated_ids = model.generate(
	**inputs,
	max_length=inputs.input_ids.shape[1] + doc_max_length,
	do_sample=False,
	return_dict_in_generate=True,
	num_return_sequences=1,
	output_scores=True,
	pad_token_id=50256,
	eos_token_id=50256 # <\|endoftext\|>
	)

	ret = tokenizer.decode(generated_ids.sequences[0], skip_special_tokens=False)
	print(ret)

	inputs = tokenizer('''<\|endoftext\|>
	def add(a, b):
	return a + b

	# docstring
	"""
	计算数字相加

	Args:
	a: 第一个加数
	b: 第二个加数

	Return:
	相加的结果
	"""
	<\|endoftext\|>
	def load_excel(path):
	return pd.read_excel(path)

	# docstring
	"""''', return_tensors='pt')

	doc_max_length = 128

	generated_ids = model.generate(
	**inputs,
	max_length=inputs.input_ids.shape[1] + doc_max_length,
	do_sample=False,
	return_dict_in_generate=True,
	num_return_sequences=1,
	output_scores=True,
	pad_token_id=50256,
	eos_token_id=50256 # <\|endoftext\|>
	)

	ret = tokenizer.decode(generated_ids.sequences[0], skip_special_tokens=False)
	print(ret)
	```