ctrltokyo commited on
Commit
18981cb
·
verified ·
1 Parent(s): b723c73

Add model card

Browse files
Files changed (1) hide show
  1. README.md +87 -510
README.md CHANGED
@@ -1,554 +1,131 @@
1
  ---
 
 
 
 
 
2
  tags:
3
  - ColBERT
4
  - PyLate
5
  - sentence-transformers
6
- - sentence-similarity
7
- - feature-extraction
8
- - generated_from_trainer
9
- - dataset_size:9959
10
- - loss:CachedContrastive
 
 
 
 
 
 
 
11
  pipeline_tag: sentence-similarity
12
- library_name: PyLate
13
  ---
14
 
15
- # PyLate
16
 
17
- This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
18
 
19
- ## Model Details
20
 
21
- ### Model Description
22
- - **Model Type:** PyLate model
23
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
24
- - **Document Length:** 512 tokens
25
- - **Query Length:** 128 tokens
26
- - **Output Dimensionality:** 128 tokens
27
- - **Similarity Function:** MaxSim
28
- <!-- - **Training Dataset:** Unknown -->
29
- <!-- - **Language:** Unknown -->
30
- <!-- - **License:** Unknown -->
31
 
32
- ### Model Sources
33
 
34
- - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
35
- - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
36
- - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
 
37
 
38
- ### Full Model Architecture
39
-
40
- ```
41
- ColBERT(
42
- (0): Transformer({'max_seq_length': 127, 'do_lower_case': False}) with Transformer model: ModernBertModel
43
- (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
44
- )
45
- ```
46
 
47
- ## Usage
48
- First install the PyLate library:
 
 
 
 
 
 
 
 
 
49
 
50
- ```bash
51
- pip install -U pylate
52
- ```
53
 
54
- ### Retrieval
55
 
56
- PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
 
 
 
57
 
58
- #### Indexing documents
 
 
 
 
59
 
60
- First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
61
 
62
  ```python
63
- from pylate import indexes, models, retrieve
64
-
65
- # Step 1: Load the ColBERT model
66
- model = models.ColBERT(
67
- model_name_or_path=pylate_model_id,
68
- )
69
-
70
- # Step 2: Initialize the Voyager index
71
- index = indexes.Voyager(
72
- index_folder="pylate-index",
73
- index_name="index",
74
- override=True, # This overwrites the existing index if any
75
- )
76
-
77
- # Step 3: Encode the documents
78
- documents_ids = ["1", "2", "3"]
79
- documents = ["document 1 text", "document 2 text", "document 3 text"]
80
-
81
- documents_embeddings = model.encode(
82
- documents,
83
- batch_size=32,
84
- is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
85
- show_progress_bar=True,
86
- )
87
-
88
- # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
89
- index.add_documents(
90
- documents_ids=documents_ids,
91
- documents_embeddings=documents_embeddings,
92
- )
93
  ```
94
 
95
- Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
96
-
97
- ```python
98
- # To load an index, simply instantiate it with the correct folder/name and without overriding it
99
- index = indexes.Voyager(
100
- index_folder="pylate-index",
101
- index_name="index",
102
- )
103
- ```
104
 
105
- #### Retrieving top-k documents for queries
 
 
106
 
107
- Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
108
- To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
109
-
110
- ```python
111
- # Step 1: Initialize the ColBERT retriever
112
- retriever = retrieve.ColBERT(index=index)
113
-
114
- # Step 2: Encode the queries
115
- queries_embeddings = model.encode(
116
- ["query for document 3", "query for document 1"],
117
- batch_size=32,
118
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
119
- show_progress_bar=True,
120
- )
121
-
122
- # Step 3: Retrieve top-k documents
123
- scores = retriever.retrieve(
124
- queries_embeddings=queries_embeddings,
125
- k=10, # Retrieve the top 10 matches for each query
126
- )
127
- ```
128
-
129
- ### Reranking
130
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
131
 
132
  ```python
133
- from pylate import rank, models
134
-
135
- queries = [
136
- "query A",
137
- "query B",
138
- ]
139
 
140
- documents = [
141
- ["document A", "document B"],
142
- ["document 1", "document C", "document B"],
143
- ]
144
 
145
- documents_ids = [
146
- [1, 2],
147
- [1, 3, 2],
148
- ]
149
 
150
- model = models.ColBERT(
151
- model_name_or_path=pylate_model_id,
152
- )
153
-
154
- queries_embeddings = model.encode(
155
- queries,
156
- is_query=True,
157
- )
158
-
159
- documents_embeddings = model.encode(
160
- documents,
161
- is_query=False,
162
- )
163
-
164
- reranked_documents = rank.rerank(
165
- documents_ids=documents_ids,
166
- queries_embeddings=queries_embeddings,
167
- documents_embeddings=documents_embeddings,
168
- )
169
  ```
170
 
171
- <!--
172
- ### Direct Usage (Transformers)
173
-
174
- <details><summary>Click to see the direct usage in Transformers</summary>
175
-
176
- </details>
177
- -->
178
-
179
- <!--
180
- ### Downstream Usage (Sentence Transformers)
181
-
182
- You can finetune this model on your own dataset.
183
-
184
- <details><summary>Click to expand</summary>
185
-
186
- </details>
187
- -->
188
-
189
- <!--
190
- ### Out-of-Scope Use
191
-
192
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
193
- -->
194
-
195
- <!--
196
- ## Bias, Risks and Limitations
197
-
198
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
199
- -->
200
-
201
- <!--
202
- ### Recommendations
203
-
204
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
205
- -->
206
-
207
- ## Training Details
208
-
209
- ### Training Dataset
210
-
211
- #### Unnamed Dataset
212
-
213
-
214
- * Size: 9,959 training samples
215
- * Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
216
- * Approximate statistics based on the first 1000 samples:
217
- | | query | positive | negative |
218
- |:--------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
219
- | type | string | string | string |
220
- | details | <ul><li>min: 128 tokens</li><li>mean: 128.0 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 108.34 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 79.95 tokens</li><li>max: 128 tokens</li></ul> |
221
- * Samples:
222
- | query | positive | negative |
223
- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
224
- | <code>Here is the step-by-step reasoning to identify the correct code solution for reading an OVF descriptor file with robust error handling.<br><br>### 1. Identify the Kind of Code<br>The code required is a **Python utility function** (or a small script) that performs **file I/O operations**. Specifically, it needs to:<br>* Accept a file path as an input argument.<br>* Attempt to open and read the contents of a file (likely a text-based XML or text file, as OVF descriptors are XML).<br>* Implement **exception handling** to gracefully manage scenarios where the file does not exist or cannot be read due to permissions or corruption.<br>* Return the file content (string) or a parsed object (if XML parsing is included), or raise a specific, user-friendly error.<br><br>### 2. Relevant Programming Concepts & Patterns<br>* **File I/O and Context Managers**: The code must use the `with open(...)` statement. This ensures the file handle is properly closed even if an error occurs during reading, preventing resource leak...</code> | <code>def get_ovf_descriptor(ovf_path):<br> if path.exists(ovf_path):<br> with open(ovf_path, 'r') as f:<br> try:<br> ovfd = f.read()<br> f.close()<br> return ovfd<br> except:<br> print "Could not read file: %s" % ovf_path<br> exit(1)</code> | <code>def read_vnf_descriptor(vnfd_id, vnf_vendor, vnf_version):<br> if _catalog_backend is not None:<br> return _catalog_backend.read_vnf_descriptor(vnfd_id, vnf_vendor,<br> vnf_version)<br> return None</code> |
225
- | <code>Here is the step-by-step reasoning to identify the correct code solution for adding a custom 'Settings' link to the WordPress plugin action links.<br><br>### 1. What kind of code would answer this query?<br>The solution requires **PHP code** specifically designed for **WordPress plugin development**. It will not be a JavaScript snippet or a CSS style. The code must be a function that hooks into the WordPress plugin management system, likely using the `plugin_action_links_{plugin_basename}` filter.<br><br>### 2. Relevant Programming Concepts, Patterns, and Algorithms<br>* **WordPress Hooks (Filters):** The core mechanism is the `apply_filters()` system. Specifically, the dynamic filter `plugin_action_links_{plugin_basename}` allows developers to modify the array of action links (Activate, Deactivate, Edit, Delete, Settings) for a specific plugin.<br>* **Array Manipulation:** The action links are stored as an associative array where the key is the link text (or ID) and the value is the URL. The code must...</code> | <code>public<br> function plugin_add_settings_link(<br> $links<br> ) {<br> $settings_link_html = '<a href="' . esc_url( self::get_settings_url() ) . '">' . __( 'Settings', 'link-linkid' ) . '</a>';<br> array_unshift( $links, $settings_link_html );<br><br> return $links;<br> }</code> | <code>function plugin_settings_link( $links){ <br> $settings_link = '<a href="options-general.php?page=esbs-plugin-settings">Settings</a>'; <br> array_unshift($links, $settings_link); <br> return $links; <br> }</code> |
226
- | <code>### Reasoning Chain<br><br>1. **Identify the Goal**: The user wants to parse a JSON Web Token (JWT) in Go specifically to read the payload (claims) *without* performing the cryptographic signature verification. This is often needed for debugging, logging, or when the token is trusted from a different source (e.g., a trusted internal service) and signature validation is handled elsewhere.<br><br>2. **Analyze the JWT Structure**: A JWT consists of three parts: `header.payload.signature`. The `payload` is a JSON object containing the claims. To extract claims without verification, we need to:<br> * Decode the Base64URL-encoded payload.<br> * Unmarshal the JSON into a Go struct or `map[string]interface{}`.<br> * **Crucially**, skip the step where the library checks the signature against the provided key.<br><br>3. **Select the Library**: The standard library for JWT in Go is `github.com/golang-jwt/jwt/v5` (or the older `v4`). The older `jwt-go` library is deprecated.<br><br>4. **Determine the Implementa...</code> | <code>func ParseInsecure(token string, audience []string) (*SVID, error) {<br> return parse(token, audience, func(tok *jwt.JSONWebToken, td spiffeid.TrustDomain) (map[string]interface{}, error) {<br> // Obtain the token claims insecurely, i.e. without signature verification<br> claimsMap := make(map[string]interface{})<br> if err := tok.UnsafeClaimsWithoutVerification(&claimsMap); err != nil {<br> return nil, jwtsvidErr.New("unable to get claims from token: %v", err)<br> }<br><br> return claimsMap, nil<br> })<br>}</code> | <code>func ParseAndValidate(token string, bundles jwtbundle.Source, audience []string) (*SVID, error) {<br> return parse(token, audience, func(tok *jwt.JSONWebToken, trustDomain spiffeid.TrustDomain) (map[string]interface{}, error) {<br> // Obtain the key ID from the header<br> keyID := tok.Headers[0].KeyID<br> if keyID == "" {<br> return nil, jwtsvidErr.New("token header missing key id")<br> }<br><br> // Get JWT Bundle<br> bundle, err := bundles.GetJWTBundleForTrustDomain(trustDomain)<br> if err != nil {<br> return nil, jwtsvidErr.New("no bundle found for trust domain %q", trustDomain)<br> }<br><br> // Find JWT authority using the key ID from the token header<br> authority, ok := bundle.FindJWTAuthority(keyID)<br> if !ok {<br> return nil, jwtsvidErr.New("no JWT authority %q found for trust domain %q", keyID, trustDomain)<br> }<br><br> // Obtain and verify the token claims using the obtained JWT authority<br> claimsMap := make(map[string]interface{})<br> if err := tok.Claims(authority, &claimsMap); err != nil {<br> return nil, jwtsvidEr...</code> |
227
- * Loss: <code>pylate.losses.cached_contrastive.CachedContrastive</code>
228
-
229
- ### Training Hyperparameters
230
- #### Non-Default Hyperparameters
231
-
232
- - `per_device_train_batch_size`: 256
233
- - `per_device_eval_batch_size`: 256
234
- - `learning_rate`: 5e-06
235
- - `warmup_ratio`: 0.05
236
- - `bf16`: True
237
- - `tf32`: True
238
- - `dataloader_num_workers`: 8
239
- - `dataloader_prefetch_factor`: 4
240
- - `dataloader_persistent_workers`: True
241
-
242
- #### All Hyperparameters
243
- <details><summary>Click to expand</summary>
244
-
245
- - `overwrite_output_dir`: False
246
- - `do_predict`: False
247
- - `eval_strategy`: no
248
- - `prediction_loss_only`: True
249
- - `per_device_train_batch_size`: 256
250
- - `per_device_eval_batch_size`: 256
251
- - `per_gpu_train_batch_size`: None
252
- - `per_gpu_eval_batch_size`: None
253
- - `gradient_accumulation_steps`: 1
254
- - `eval_accumulation_steps`: None
255
- - `torch_empty_cache_steps`: None
256
- - `learning_rate`: 5e-06
257
- - `weight_decay`: 0.0
258
- - `adam_beta1`: 0.9
259
- - `adam_beta2`: 0.999
260
- - `adam_epsilon`: 1e-08
261
- - `max_grad_norm`: 1.0
262
- - `num_train_epochs`: 3
263
- - `max_steps`: -1
264
- - `lr_scheduler_type`: linear
265
- - `lr_scheduler_kwargs`: {}
266
- - `warmup_ratio`: 0.05
267
- - `warmup_steps`: 0
268
- - `log_level`: passive
269
- - `log_level_replica`: warning
270
- - `log_on_each_node`: True
271
- - `logging_nan_inf_filter`: True
272
- - `save_safetensors`: True
273
- - `save_on_each_node`: False
274
- - `save_only_model`: False
275
- - `restore_callback_states_from_checkpoint`: False
276
- - `no_cuda`: False
277
- - `use_cpu`: False
278
- - `use_mps_device`: False
279
- - `seed`: 42
280
- - `data_seed`: None
281
- - `jit_mode_eval`: False
282
- - `use_ipex`: False
283
- - `bf16`: True
284
- - `fp16`: False
285
- - `fp16_opt_level`: O1
286
- - `half_precision_backend`: auto
287
- - `bf16_full_eval`: False
288
- - `fp16_full_eval`: False
289
- - `tf32`: True
290
- - `local_rank`: 0
291
- - `ddp_backend`: None
292
- - `tpu_num_cores`: None
293
- - `tpu_metrics_debug`: False
294
- - `debug`: []
295
- - `dataloader_drop_last`: False
296
- - `dataloader_num_workers`: 8
297
- - `dataloader_prefetch_factor`: 4
298
- - `past_index`: -1
299
- - `disable_tqdm`: False
300
- - `remove_unused_columns`: True
301
- - `label_names`: None
302
- - `load_best_model_at_end`: False
303
- - `ignore_data_skip`: False
304
- - `fsdp`: []
305
- - `fsdp_min_num_params`: 0
306
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
307
- - `fsdp_transformer_layer_cls_to_wrap`: None
308
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
309
- - `deepspeed`: None
310
- - `label_smoothing_factor`: 0.0
311
- - `optim`: adamw_torch
312
- - `optim_args`: None
313
- - `adafactor`: False
314
- - `group_by_length`: False
315
- - `length_column_name`: length
316
- - `ddp_find_unused_parameters`: None
317
- - `ddp_bucket_cap_mb`: None
318
- - `ddp_broadcast_buffers`: False
319
- - `dataloader_pin_memory`: True
320
- - `dataloader_persistent_workers`: True
321
- - `skip_memory_metrics`: True
322
- - `use_legacy_prediction_loop`: False
323
- - `push_to_hub`: False
324
- - `resume_from_checkpoint`: None
325
- - `hub_model_id`: None
326
- - `hub_strategy`: every_save
327
- - `hub_private_repo`: None
328
- - `hub_always_push`: False
329
- - `gradient_checkpointing`: False
330
- - `gradient_checkpointing_kwargs`: None
331
- - `include_inputs_for_metrics`: False
332
- - `include_for_metrics`: []
333
- - `eval_do_concat_batches`: True
334
- - `fp16_backend`: auto
335
- - `push_to_hub_model_id`: None
336
- - `push_to_hub_organization`: None
337
- - `mp_parameters`:
338
- - `auto_find_batch_size`: False
339
- - `full_determinism`: False
340
- - `torchdynamo`: None
341
- - `ray_scope`: last
342
- - `ddp_timeout`: 1800
343
- - `torch_compile`: False
344
- - `torch_compile_backend`: None
345
- - `torch_compile_mode`: None
346
- - `dispatch_batches`: None
347
- - `split_batches`: None
348
- - `include_tokens_per_second`: False
349
- - `include_num_input_tokens_seen`: False
350
- - `neftune_noise_alpha`: None
351
- - `optim_target_modules`: None
352
- - `batch_eval_metrics`: False
353
- - `eval_on_start`: False
354
- - `use_liger_kernel`: False
355
- - `eval_use_gather_object`: False
356
- - `average_tokens_across_devices`: False
357
- - `prompts`: None
358
- - `batch_sampler`: batch_sampler
359
- - `multi_dataset_batch_sampler`: proportional
360
-
361
- </details>
362
-
363
- ### Training Logs
364
- <details><summary>Click to expand</summary>
365
-
366
- | Epoch | Step | Training Loss |
367
- |:------:|:----:|:-------------:|
368
- | 0.0256 | 1 | 2.3632 |
369
- | 0.0513 | 2 | 2.3367 |
370
- | 0.0769 | 3 | 2.448 |
371
- | 0.1026 | 4 | 2.4189 |
372
- | 0.1282 | 5 | 2.1217 |
373
- | 0.1538 | 6 | 2.1491 |
374
- | 0.1795 | 7 | 1.9582 |
375
- | 0.2051 | 8 | 1.9204 |
376
- | 0.2308 | 9 | 1.6757 |
377
- | 0.2564 | 10 | 1.4951 |
378
- | 0.2821 | 11 | 1.3773 |
379
- | 0.3077 | 12 | 1.1778 |
380
- | 0.3333 | 13 | 1.088 |
381
- | 0.3590 | 14 | 1.0256 |
382
- | 0.3846 | 15 | 1.0174 |
383
- | 0.4103 | 16 | 0.8424 |
384
- | 0.4359 | 17 | 0.9435 |
385
- | 0.4615 | 18 | 0.854 |
386
- | 0.4872 | 19 | 0.8846 |
387
- | 0.5128 | 20 | 0.9211 |
388
- | 0.5385 | 21 | 0.7185 |
389
- | 0.5641 | 22 | 0.8183 |
390
- | 0.5897 | 23 | 0.7488 |
391
- | 0.6154 | 24 | 0.696 |
392
- | 0.6410 | 25 | 0.6371 |
393
- | 0.6667 | 26 | 0.6456 |
394
- | 0.6923 | 27 | 0.6259 |
395
- | 0.7179 | 28 | 0.5277 |
396
- | 0.7436 | 29 | 0.7078 |
397
- | 0.7692 | 30 | 0.7901 |
398
- | 0.7949 | 31 | 0.6332 |
399
- | 0.8205 | 32 | 0.4658 |
400
- | 0.8462 | 33 | 0.6804 |
401
- | 0.8718 | 34 | 0.6232 |
402
- | 0.8974 | 35 | 0.611 |
403
- | 0.9231 | 36 | 0.6147 |
404
- | 0.9487 | 37 | 0.5991 |
405
- | 0.9744 | 38 | 0.6732 |
406
- | 1.0 | 39 | 0.5281 |
407
- | 1.0256 | 40 | 0.5556 |
408
- | 1.0513 | 41 | 0.4985 |
409
- | 1.0769 | 42 | 0.5527 |
410
- | 1.1026 | 43 | 0.4919 |
411
- | 1.1282 | 44 | 0.5443 |
412
- | 1.1538 | 45 | 0.6086 |
413
- | 1.1795 | 46 | 0.5949 |
414
- | 1.2051 | 47 | 0.5734 |
415
- | 1.2308 | 48 | 0.6677 |
416
- | 1.2564 | 49 | 0.5189 |
417
- | 1.2821 | 50 | 0.666 |
418
- | 1.3077 | 51 | 0.4927 |
419
- | 1.3333 | 52 | 0.5356 |
420
- | 1.3590 | 53 | 0.5792 |
421
- | 1.3846 | 54 | 0.4162 |
422
- | 1.4103 | 55 | 0.5923 |
423
- | 1.4359 | 56 | 0.4905 |
424
- | 1.4615 | 57 | 0.4645 |
425
- | 1.4872 | 58 | 0.7121 |
426
- | 1.5128 | 59 | 0.5809 |
427
- | 1.5385 | 60 | 0.4401 |
428
- | 1.5641 | 61 | 0.458 |
429
- | 1.5897 | 62 | 0.4659 |
430
- | 1.6154 | 63 | 0.5638 |
431
- | 1.6410 | 64 | 0.4875 |
432
- | 1.6667 | 65 | 0.4903 |
433
- | 1.6923 | 66 | 0.5373 |
434
- | 1.7179 | 67 | 0.3934 |
435
- | 1.7436 | 68 | 0.5693 |
436
- | 1.7692 | 69 | 0.4524 |
437
- | 1.7949 | 70 | 0.4949 |
438
- | 1.8205 | 71 | 0.466 |
439
- | 1.8462 | 72 | 0.4837 |
440
- | 1.8718 | 73 | 0.5391 |
441
- | 1.8974 | 74 | 0.5266 |
442
- | 1.9231 | 75 | 0.4747 |
443
- | 1.9487 | 76 | 0.4502 |
444
- | 1.9744 | 77 | 0.5449 |
445
- | 2.0 | 78 | 0.4349 |
446
- | 2.0256 | 79 | 0.4566 |
447
- | 2.0513 | 80 | 0.482 |
448
- | 2.0769 | 81 | 0.5553 |
449
- | 2.1026 | 82 | 0.4606 |
450
- | 2.1282 | 83 | 0.4938 |
451
- | 2.1538 | 84 | 0.4303 |
452
- | 2.1795 | 85 | 0.4068 |
453
- | 2.2051 | 86 | 0.4398 |
454
- | 2.2308 | 87 | 0.4359 |
455
- | 2.2564 | 88 | 0.4599 |
456
- | 2.2821 | 89 | 0.4835 |
457
- | 2.3077 | 90 | 0.404 |
458
- | 2.3333 | 91 | 0.5046 |
459
- | 2.3590 | 92 | 0.4678 |
460
- | 2.3846 | 93 | 0.3891 |
461
- | 2.4103 | 94 | 0.435 |
462
- | 2.4359 | 95 | 0.5688 |
463
- | 2.4615 | 96 | 0.4319 |
464
- | 2.4872 | 97 | 0.4667 |
465
- | 2.5128 | 98 | 0.5857 |
466
- | 2.5385 | 99 | 0.5194 |
467
- | 2.5641 | 100 | 0.4741 |
468
- | 2.5897 | 101 | 0.5226 |
469
- | 2.6154 | 102 | 0.4168 |
470
- | 2.6410 | 103 | 0.4488 |
471
- | 2.6667 | 104 | 0.4922 |
472
- | 2.6923 | 105 | 0.4309 |
473
- | 2.7179 | 106 | 0.4832 |
474
- | 2.7436 | 107 | 0.4496 |
475
- | 2.7692 | 108 | 0.5548 |
476
- | 2.7949 | 109 | 0.4355 |
477
- | 2.8205 | 110 | 0.4305 |
478
- | 2.8462 | 111 | 0.3955 |
479
- | 2.8718 | 112 | 0.2876 |
480
- | 2.8974 | 113 | 0.4263 |
481
- | 2.9231 | 114 | 0.4874 |
482
- | 2.9487 | 115 | 0.4602 |
483
- | 2.9744 | 116 | 0.4725 |
484
- | 3.0 | 117 | 0.5401 |
485
-
486
- </details>
487
-
488
- ### Framework Versions
489
- - Python: 3.12.3
490
- - Sentence Transformers: 4.0.2
491
- - PyLate: 1.2.0
492
- - Transformers: 4.48.2
493
- - PyTorch: 2.10.0a0+a36e1d39eb.nv26.01.42222806
494
- - Accelerate: 1.13.0
495
- - Datasets: 4.4.2
496
- - Tokenizers: 0.21.4
497
-
498
-
499
  ## Citation
500
 
501
- ### BibTeX
502
 
503
- #### Sentence Transformers
504
  ```bibtex
505
- @inproceedings{reimers-2019-sentence-bert,
506
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
507
- author = "Reimers, Nils and Gurevych, Iryna",
508
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
509
- month = "11",
510
- year = "2019",
511
- publisher = "Association for Computational Linguistics",
512
- url = "https://arxiv.org/abs/1908.10084"
513
  }
514
- ```
515
 
516
- #### PyLate
517
- ```bibtex
518
- @misc{PyLate,
519
- title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
520
- author={Chaffin, Antoine and Sourty, Raphaël},
521
- url={https://github.com/lightonai/pylate},
522
- year={2024}
523
  }
524
- ```
525
 
526
- #### CachedContrastive
527
- ```bibtex
528
- @misc{gao2021scaling,
529
- title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
530
- author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
531
- year={2021},
532
- eprint={2101.06983},
533
- archivePrefix={arXiv},
534
- primaryClass={cs.LG}
535
  }
536
  ```
537
 
538
- <!--
539
- ## Glossary
540
-
541
- *Clearly define terms in order to be accessible across audiences.*
542
- -->
543
-
544
- <!--
545
- ## Model Card Authors
546
-
547
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
548
- -->
549
-
550
- <!--
551
- ## Model Card Contact
552
-
553
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
554
- -->
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - code
6
+ library_name: PyLate
7
  tags:
8
  - ColBERT
9
  - PyLate
10
  - sentence-transformers
11
+ - code-search
12
+ - code-retrieval
13
+ - late-interaction
14
+ - reasoning
15
+ base_model: lightonai/GTE-ModernColBERT-v1
16
+ datasets:
17
+ - nomic-ai/cornstack-python-v1
18
+ - nomic-ai/cornstack-java-v1
19
+ - nomic-ai/cornstack-javascript-v1
20
+ - nomic-ai/cornstack-php-v1
21
+ - nomic-ai/cornstack-go-v1
22
+ - nomic-ai/cornstack-ruby-v1
23
  pipeline_tag: sentence-similarity
 
24
  ---
25
 
26
+ # Reason-Code-ModernColBERT
27
 
28
+ The **first ColBERT (late-interaction) model specifically designed for code search and retrieval**.
29
 
30
+ Combines the token-granular matching advantages of ColBERT with reasoning-enhanced queries, extending the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain.
31
 
32
+ ## Why Late-Interaction for Code?
 
 
 
 
 
 
 
 
 
33
 
34
+ All existing SOTA code search models (CodeXEmbed, Nomic Embed Code, Voyage Code) use bi-encoder / single-vector architectures. ColBERT's late-interaction approach computes token-level similarity (MaxSim), which is particularly well-suited for code because:
35
 
36
+ - Code has rich token-level structure (identifiers, operators, keywords, types)
37
+ - A query like "sort array in reverse order" needs to match specific code tokens (`.sort()`, `reverse=True`)
38
+ - MaxSim naturally captures partial matches between NL query tokens and code tokens
39
+ - On reasoning tasks, [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (150M) outperformed 7B dense models
40
 
41
+ ## Model Details
 
 
 
 
 
 
 
42
 
43
+ | Property | Value |
44
+ |---|---|
45
+ | **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) |
46
+ | **Architecture** | ColBERT (late-interaction, multi-vector) |
47
+ | **Parameters** | 150M |
48
+ | **Embedding dim** | 128 per token |
49
+ | **Document length** | 512 tokens |
50
+ | **Query length** | 128 tokens |
51
+ | **Similarity** | MaxSim |
52
+ | **Languages** | Python, Java, JavaScript, PHP, Go, Ruby |
53
+ | **License** | Apache 2.0 |
54
 
55
+ ## Training
 
 
56
 
57
+ ### Two-Stage Training Pipeline
58
 
59
+ **Stage 1: CoRNStack Base (1 epoch)**
60
+ - 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0)
61
+ - 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
62
+ - Loss: 2.42 → 0.63
63
 
64
+ **Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)**
65
+ - 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
66
+ - Queries require understanding algorithms, edge cases, design patterns, and complexity
67
+ - Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
68
+ - Loss: 2.36 → 0.54
69
 
70
+ ### Training Configuration
71
 
72
  ```python
73
+ # Both stages
74
+ model = ColBERT(document_length=512, query_length=128)
75
+ loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
76
+ batch_size = 256
77
+ optim = "adamw_torch"
78
+ bf16 = True
79
+
80
+ # Stage 1: lr=1e-5, 1 epoch, warmup=5%
81
+ # Stage 2: lr=5e-6, 3 epochs, warmup=5%
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ```
83
 
84
+ ### Hardware
 
 
 
 
 
 
 
 
85
 
86
+ Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
87
+ - Stage 1: ~130 min (391 steps)
88
+ - Stage 2: ~37 min (117 steps)
89
 
90
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ```python
93
+ from pylate import models
 
 
 
 
 
94
 
95
+ model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")
 
 
 
96
 
97
+ queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
98
+ code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"]
 
 
99
 
100
+ query_embeddings = model.encode(queries, is_query=True)
101
+ doc_embeddings = model.encode(code_docs, is_query=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  ```
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ## Citation
105
 
106
+ This model extends the methodology from:
107
 
 
108
  ```bibtex
109
+ @article{shao2025reasonir,
110
+ title={ReasonIR: Training Retrievers for Reasoning Tasks},
111
+ author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
112
+ journal={arXiv preprint arXiv:2504.20595},
113
+ year={2025}
 
 
 
114
  }
 
115
 
116
+ @misc{Reason-ModernColBERT,
117
+ title={Reason-ModernColBERT},
118
+ author={LightOn AI},
119
+ year={2025},
120
+ url={https://huggingface.co/lightonai/Reason-ModernColBERT}
 
 
121
  }
 
122
 
123
+ @inproceedings{cornstack2025,
124
+ title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
125
+ author={Gangisetty, Zach and others},
126
+ booktitle={ICLR},
127
+ year={2025}
 
 
 
 
128
  }
129
  ```
130
 
131
+ Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).