| | --- |
| | datasets: |
| | - bigcode/pii-annotated-toloka-donwsample-emails |
| | - bigcode/pseudo-labeled-python-data-pii-detection-filtered |
| | metrics: |
| | - f1 |
| | pipeline_tag: token-classification |
| | language: |
| | - code |
| | extra_gated_prompt: >- |
| | ## Terms of Use for the model |
| | |
| |
|
| | This is an NER model trained to detect Personal Identifiable Information (PII) |
| | in code datasets. We ask that you read and agree to the following Terms of Use |
| | before using the model: |
| |
|
| | 1. You agree that you will not use the model for any purpose other than PII |
| | detection for the purpose of removing PII from datasets. |
| |
|
| | 2. You agree that you will not share the model or any modified versions for |
| | whatever purpose. |
| |
|
| | 3. Unless required by applicable law or agreed to in writing, the model is |
| | provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, |
| | either express or implied, including, without limitation, any warranties or |
| | conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A |
| | PARTICULAR PURPOSE. You are solely responsible for determining the |
| | appropriateness of using the model, and assume any risks associated with your |
| | exercise of permissions under these Terms of Use. |
| |
|
| | 4. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, |
| | DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR |
| | OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL OR THE USE OR |
| | OTHER DEALINGS IN THE MODEL. |
| | extra_gated_fields: |
| | Email: text |
| | I have read the License and agree with its terms: checkbox |
| | --- |
| | |
| | # StarPII |
| |
|
| | ## Model description |
| |
|
| | This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned [bigcode-encoder](https://huggingface.co/bigcode/bigcode-encoder) |
| | on a PII dataset we annotated, available with gated access at [bigcode-pii-dataset](https://huggingface.co/datasets/bigcode/pii-annotated-toloka-donwsample-emails) (see [bigcode-pii-dataset-training](https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training) for the exact data splits). |
| | We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames. |
| |
|
| |
|
| | ## Dataset |
| |
|
| | ### Fine-tuning on the annotated dataset |
| | The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88 |
| | programming languages from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. |
| |
|
| | ### Initial training on a pseudo-labelled dataset |
| | To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset. |
| | The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data. |
| |
|
| | Specifically, we annotated 18,000 files available at [bigcode-pii-ppseudo-labeled](https://huggingface.co/datasets/bigcode/pseudo-labeled-python-data-pii-detection-filtered) |
| | using an ensemble of two encoder models [Deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and [stanford-deidentifier-base](StanfordAIMI/stanford-deidentifier-base) |
| | which were fine-tuned on an intern previously labeled PII [dataset](https://huggingface.co/datasets/bigcode/pii-for-code) for code with 400 files from this [work](https://arxiv.org/abs/2301.03988). |
| | To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score. |
| | After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like `key`, `auth` and `pwd` in the surrounding context. |
| | Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories, |
| | as demonstrated in the table in the following section. |
| |
|
| |
|
| | ### Performance |
| |
|
| | This model is respresented in the last row (NER + pseudo labels ) |
| | - Emails, IP addresses and Keys |
| |
|
| | | Method | Email address | | | IP address | | | Key | | | |
| | | ------------------ | -------------- | ---- | ---- | ---------- | ---- | ---- | ----- | ---- | ---- | |
| | | | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | |
| | | Regex | 69.8% | 98.8% | 81.8% | 65.9% | 78% | 71.7% | 2.8% | 46.9% | 5.3% | |
| | | NER | 94.01% | 98.10% | 96.01% | 88.95% | *94.43%* | 91.61% | 60.37% | 53.38% | 56.66% | |
| | | + pseudo labels | **97.73%** | **98.94%** | **98.15%** | **90.10%** | 93.86% | **91.94%** | **62.38%** | **80.81%** | **70.41%** | |
| |
|
| | - Names, Usernames and Passwords |
| |
|
| | | Method | Name | | | Username | | | Password | | | |
| | | ------------------ | -------- | ---- | ---- | -------- | ---- | ---- | -------- | ---- | ---- | |
| | | | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | |
| | | NER | 83.66% | 95.52% | 89.19% | 48.93% | *75.55%* | 59.39% | 59.16% | *96.62%* | 73.39%| |
| | | + pseudo labels | **86.45%** | **97.38%** | **91.59%** | **52.20%** | 74.81% | **61.49%** | **70.94%** | 95.96% | **81.57%** | |
| |
|
| | We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. |
| | For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub): |
| |
|
| | - Ignore secrets with less than 4 characters. |
| | - Detect full names only. |
| | - Ignore detected keys with less than 9 characters or that are not gibberish using a [gibberish-detector](https://github.com/domanchi/gibberish-detector). |
| | - Ignore IP addresses that aren't valid or are private (non-internet facing) using the `ipaddress` python package. We also ignore IP addresses from popular DNS servers. |
| | We use the same list as in this [paper](https://huggingface.co/bigcode/santacoder). |
| |
|
| | # Considerations for Using the Model |
| |
|
| | While using this model, please be aware that there may be potential risks associated with its application. |
| | There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. |
| | Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases. |
| | Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, |
| | our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII. |