| | --- |
| | license: mit |
| | tags: |
| | - code-understanding |
| | - unixcoder |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # RepoSim4Py |
| |
|
| | An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories. |
| |
|
| | ## Model Details |
| |
|
| | **RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories. |
| | For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository. |
| | By taking the mean of these embeddings, a repository-level mean embedding is generated. |
| | These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison. |
| |
|
| | ### Model Description |
| |
|
| | The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset. |
| |
|
| | - **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65) |
| | - **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py) |
| | - **Model type:** **code understanding** |
| | - **Language(s):** **Python** |
| | - **License:** **MIT** |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
| | - **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf) |
| |
|
| | ## Uses |
| |
|
| | Below is an example of how to use the RepoSim4Py pipeline to easily generate embeddings for GitHub Python repositories. |
| |
|
| | First, initialise the pipeline: |
| | ```python |
| | from transformers import pipeline |
| | |
| | model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True) |
| | ``` |
| | Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries: |
| | ```python |
| | repo_infos = model("lazyhope/python-hello-world") |
| | print(repo_infos) |
| | ``` |
| | Output (Long numpy array outputs are omitted): |
| | ```python |
| | [{'name': 'lazyhope/python-hello-world', |
| | 'topics': [], |
| | 'license': 'MIT', |
| | 'stars': 0, |
| | 'code_embeddings': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), |
| | 'mean_code_embedding': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), |
| | 'doc_embeddings': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), |
| | 'mean_doc_embedding': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), |
| | 'requirement_embeddings': array([[0., 0., 0., ...]], dtype=float32), |
| | 'mean_requirement_embedding': array([[0., 0., 0., ...]], dtype=float32), |
| | 'readme_embeddings': array([[-2.1671042 , 2.8404987 , 1.4761417 , ...]], dtype=float32), |
| | 'mean_readme_embedding': array([[-1.91171765e+00, 1.65386486e+00, 9.49612021e-01, ...]], dtype=float32), |
| | 'mean_repo_embedding': array([[-2.0755134, 2.8138795, 2.352167 , ...]], dtype=float32), |
| | 'code_embeddings_shape': (1, 768) |
| | 'mean_code_embedding_shape': (1, 768) |
| | 'doc_embeddings_shape': (1, 768) |
| | 'mean_doc_embedding_shape': (1, 768) |
| | 'requirement_embeddings_shape': (1, 768) |
| | 'mean_requirement_embedding_shape': (1, 768) |
| | 'readme_embeddings_shape': (3, 768) |
| | 'mean_readme_embedding_shape': (1, 768) |
| | 'mean_repo_embedding_shape': (1, 3072) |
| | }] |
| | ``` |
| | More specific information please refer to [Example.py](https://github.com/RepoMining/RepoSim4Py/blob/main/Script/Example.py). Note that "github_token" is unnecessary. |
| | |
| | ## Training Details |
| | |
| | Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task. |
| | |
| | ## Evaluation |
| | |
| | We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 400 Python repositories categorized in different topics, in order to label similar repositories. |
| | The evaluation metrics and results can be found in the RepoSim4Py repository, under the [Embedding](https://github.com/RepoMining/RepoSim4Py/tree/main/Embedding) folder. |
| | |
| | ## Acknowledgements |
| | Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline. |
| | - **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
| | - **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) |
| | - **awesome-python** (https://github.com/vinta/awesome-python) |
| | |
| | ## Authors |
| | - **Honglin Zhang** (https://github.com/liaomu0926) |
| | - **Rosa Filgueira** (https://www.rosafilgueira.com) |