ctrltokyo
/

Reason-Code-ModernColBERT

@@ -1,554 +1,131 @@
 ---
 tags:
 - ColBERT
 - PyLate
 - sentence-transformers
-- sentence-similarity
-- feature-extraction
-- generated_from_trainer
-- dataset_size:9959
-- loss:CachedContrastive
 pipeline_tag: sentence-similarity
-library_name: PyLate
 ---
-# PyLate
-This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
-## Model Details
-### Model Description
-- **Model Type:** PyLate model
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Document Length:** 512 tokens
-- **Query Length:** 128 tokens
-- **Output Dimensionality:** 128 tokens
-- **Similarity Function:** MaxSim
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
-- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
-- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
-### Full Model Architecture
-```
-ColBERT(
-  (0): Transformer({'max_seq_length': 127, 'do_lower_case': False}) with Transformer model: ModernBertModel
-  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
-)
-```
-## Usage
-First install the PyLate library:
-```bash
-pip install -U pylate
-```
-### Retrieval
-PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
-#### Indexing documents
-First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
 ```python
-from pylate import indexes, models, retrieve
-# Step 1: Load the ColBERT model
-model = models.ColBERT(
-    model_name_or_path=pylate_model_id,
-)
-# Step 2: Initialize the Voyager index
-index = indexes.Voyager(
-    index_folder="pylate-index",
-    index_name="index",
-    override=True,  # This overwrites the existing index if any
-)
-# Step 3: Encode the documents
-documents_ids = ["1", "2", "3"]
-documents = ["document 1 text", "document 2 text", "document 3 text"]
-documents_embeddings = model.encode(
-    documents,
-    batch_size=32,
-    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
-    show_progress_bar=True,
-)
-# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
-index.add_documents(
-    documents_ids=documents_ids,
-    documents_embeddings=documents_embeddings,
-)
 ```
-Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
-```python
-# To load an index, simply instantiate it with the correct folder/name and without overriding it
-index = indexes.Voyager(
-    index_folder="pylate-index",
-    index_name="index",
-)
-```
-#### Retrieving top-k documents for queries
-Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
-To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
-```python
-# Step 1: Initialize the ColBERT retriever
-retriever = retrieve.ColBERT(index=index)
-# Step 2: Encode the queries
-queries_embeddings = model.encode(
-    ["query for document 3", "query for document 1"],
-    batch_size=32,
-    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
-    show_progress_bar=True,
-)
-# Step 3: Retrieve top-k documents
-scores = retriever.retrieve(
-    queries_embeddings=queries_embeddings,
-    k=10,  # Retrieve the top 10 matches for each query
-)
-```
-### Reranking
-If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
 ```python
-from pylate import rank, models
-queries = [
-    "query A",
-    "query B",
-]
-documents = [
-    ["document A", "document B"],
-    ["document 1", "document C", "document B"],
-]
-documents_ids = [
-    [1, 2],
-    [1, 3, 2],
-]
-model = models.ColBERT(
-    model_name_or_path=pylate_model_id,
-)
-queries_embeddings = model.encode(
-    queries,
-    is_query=True,
-)
-documents_embeddings = model.encode(
-    documents,
-    is_query=False,
-)
-reranked_documents = rank.rerank(
-    documents_ids=documents_ids,
-    queries_embeddings=queries_embeddings,
-    documents_embeddings=documents_embeddings,
-)
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Training Dataset
-#### Unnamed Dataset
-* Size: 9,959 training samples
-* Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | query                                                                                | positive                                                                             | negative                                                                           |
-  |:--------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
-  | type    | string                                                                               | string                                                                               | string                                                                             |
-  | details | <ul><li>min: 128 tokens</li><li>mean: 128.0 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 108.34 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 79.95 tokens</li><li>max: 128 tokens</li></ul> |
-* Samples:
-  | query                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
-  |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | <code>Here is the step-by-step reasoning to identify the correct code solution for reading an OVF descriptor file with robust error handling.<br><br>### 1. Identify the Kind of Code<br>The code required is a **Python utility function** (or a small script) that performs **file I/O operations**. Specifically, it needs to:<br>*   Accept a file path as an input argument.<br>*   Attempt to open and read the contents of a file (likely a text-based XML or text file, as OVF descriptors are XML).<br>*   Implement **exception handling** to gracefully manage scenarios where the file does not exist or cannot be read due to permissions or corruption.<br>*   Return the file content (string) or a parsed object (if XML parsing is included), or raise a specific, user-friendly error.<br><br>### 2. Relevant Programming Concepts & Patterns<br>*   **File I/O and Context Managers**: The code must use the `with open(...)` statement. This ensures the file handle is properly closed even if an error occurs during reading, preventing resource leak...</code>    | <code>def get_ovf_descriptor(ovf_path):<br>    if path.exists(ovf_path):<br>        with open(ovf_path, 'r') as f:<br>            try:<br>                ovfd = f.read()<br>                f.close()<br>                return ovfd<br>            except:<br>                print "Could not read file: %s" % ovf_path<br>                exit(1)</code>                                                                                                                                                                                          | <code>def read_vnf_descriptor(vnfd_id, vnf_vendor, vnf_version):<br>    if _catalog_backend is not None:<br>        return _catalog_backend.read_vnf_descriptor(vnfd_id, vnf_vendor,<br>                                                    vnf_version)<br>    return None</code>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-  | <code>Here is the step-by-step reasoning to identify the correct code solution for adding a custom 'Settings' link to the WordPress plugin action links.<br><br>### 1. What kind of code would answer this query?<br>The solution requires **PHP code** specifically designed for **WordPress plugin development**. It will not be a JavaScript snippet or a CSS style. The code must be a function that hooks into the WordPress plugin management system, likely using the `plugin_action_links_{plugin_basename}` filter.<br><br>### 2. Relevant Programming Concepts, Patterns, and Algorithms<br>*   **WordPress Hooks (Filters):** The core mechanism is the `apply_filters()` system. Specifically, the dynamic filter `plugin_action_links_{plugin_basename}` allows developers to modify the array of action links (Activate, Deactivate, Edit, Delete, Settings) for a specific plugin.<br>*   **Array Manipulation:** The action links are stored as an associative array where the key is the link text (or ID) and the value is the URL. The code must...</code>             | <code>public<br>		function plugin_add_settings_link(<br>			$links<br>		) {<br>			$settings_link_html = '<a href="' . esc_url( self::get_settings_url() ) . '">' . __( 'Settings', 'link-linkid' ) . '</a>';<br>			array_unshift( $links, $settings_link_html );<br><br>			return $links;<br>		}</code>                                                                                                                                                                                                                                                | <code>function plugin_settings_link( $links){ <br>        $settings_link = '<a href="options-general.php?page=esbs-plugin-settings">Settings</a>'; <br>        array_unshift($links, $settings_link); <br>        return $links; <br>    }</code>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-  | <code>### Reasoning Chain<br><br>1.  **Identify the Goal**: The user wants to parse a JSON Web Token (JWT) in Go specifically to read the payload (claims) *without* performing the cryptographic signature verification. This is often needed for debugging, logging, or when the token is trusted from a different source (e.g., a trusted internal service) and signature validation is handled elsewhere.<br><br>2.  **Analyze the JWT Structure**: A JWT consists of three parts: `header.payload.signature`. The `payload` is a JSON object containing the claims. To extract claims without verification, we need to:<br>    *   Decode the Base64URL-encoded payload.<br>    *   Unmarshal the JSON into a Go struct or `map[string]interface{}`.<br>    *   **Crucially**, skip the step where the library checks the signature against the provided key.<br><br>3.  **Select the Library**: The standard library for JWT in Go is `github.com/golang-jwt/jwt/v5` (or the older `v4`). The older `jwt-go` library is deprecated.<br><br>4.  **Determine the Implementa...</code> | <code>func ParseInsecure(token string, audience []string) (*SVID, error) {<br>	return parse(token, audience, func(tok *jwt.JSONWebToken, td spiffeid.TrustDomain) (map[string]interface{}, error) {<br>		// Obtain the token claims insecurely, i.e. without signature verification<br>		claimsMap := make(map[string]interface{})<br>		if err := tok.UnsafeClaimsWithoutVerification(&claimsMap); err != nil {<br>			return nil, jwtsvidErr.New("unable to get claims from token: %v", err)<br>		}<br><br>		return claimsMap, nil<br>	})<br>}</code> | <code>func ParseAndValidate(token string, bundles jwtbundle.Source, audience []string) (*SVID, error) {<br>	return parse(token, audience, func(tok *jwt.JSONWebToken, trustDomain spiffeid.TrustDomain) (map[string]interface{}, error) {<br>		// Obtain the key ID from the header<br>		keyID := tok.Headers[0].KeyID<br>		if keyID == "" {<br>			return nil, jwtsvidErr.New("token header missing key id")<br>		}<br><br>		// Get JWT Bundle<br>		bundle, err := bundles.GetJWTBundleForTrustDomain(trustDomain)<br>		if err != nil {<br>			return nil, jwtsvidErr.New("no bundle found for trust domain %q", trustDomain)<br>		}<br><br>		// Find JWT authority using the key ID from the token header<br>		authority, ok := bundle.FindJWTAuthority(keyID)<br>		if !ok {<br>			return nil, jwtsvidErr.New("no JWT authority %q found for trust domain %q", keyID, trustDomain)<br>		}<br><br>		// Obtain and verify the token claims using the obtained JWT authority<br>		claimsMap := make(map[string]interface{})<br>		if err := tok.Claims(authority, &claimsMap); err != nil {<br>			return nil, jwtsvidEr...</code> |
-* Loss: <code>pylate.losses.cached_contrastive.CachedContrastive</code>
-### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `per_device_train_batch_size`: 256
-- `per_device_eval_batch_size`: 256
-- `learning_rate`: 5e-06
-- `warmup_ratio`: 0.05
-- `bf16`: True
-- `tf32`: True
-- `dataloader_num_workers`: 8
-- `dataloader_prefetch_factor`: 4
-- `dataloader_persistent_workers`: True
-#### All Hyperparameters
-<details><summary>Click to expand</summary>
-- `overwrite_output_dir`: False
-- `do_predict`: False
-- `eval_strategy`: no
-- `prediction_loss_only`: True
-- `per_device_train_batch_size`: 256
-- `per_device_eval_batch_size`: 256
-- `per_gpu_train_batch_size`: None
-- `per_gpu_eval_batch_size`: None
-- `gradient_accumulation_steps`: 1
-- `eval_accumulation_steps`: None
-- `torch_empty_cache_steps`: None
-- `learning_rate`: 5e-06
-- `weight_decay`: 0.0
-- `adam_beta1`: 0.9
-- `adam_beta2`: 0.999
-- `adam_epsilon`: 1e-08
-- `max_grad_norm`: 1.0
-- `num_train_epochs`: 3
-- `max_steps`: -1
-- `lr_scheduler_type`: linear
-- `lr_scheduler_kwargs`: {}
-- `warmup_ratio`: 0.05
-- `warmup_steps`: 0
-- `log_level`: passive
-- `log_level_replica`: warning
-- `log_on_each_node`: True
-- `logging_nan_inf_filter`: True
-- `save_safetensors`: True
-- `save_on_each_node`: False
-- `save_only_model`: False
-- `restore_callback_states_from_checkpoint`: False
-- `no_cuda`: False
-- `use_cpu`: False
-- `use_mps_device`: False
-- `seed`: 42
-- `data_seed`: None
-- `jit_mode_eval`: False
-- `use_ipex`: False
-- `bf16`: True
-- `fp16`: False
-- `fp16_opt_level`: O1
-- `half_precision_backend`: auto
-- `bf16_full_eval`: False
-- `fp16_full_eval`: False
-- `tf32`: True
-- `local_rank`: 0
-- `ddp_backend`: None
-- `tpu_num_cores`: None
-- `tpu_metrics_debug`: False
-- `debug`: []
-- `dataloader_drop_last`: False
-- `dataloader_num_workers`: 8
-- `dataloader_prefetch_factor`: 4
-- `past_index`: -1
-- `disable_tqdm`: False
-- `remove_unused_columns`: True
-- `label_names`: None
-- `load_best_model_at_end`: False
-- `ignore_data_skip`: False
-- `fsdp`: []
-- `fsdp_min_num_params`: 0
-- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
-- `fsdp_transformer_layer_cls_to_wrap`: None
-- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
-- `deepspeed`: None
-- `label_smoothing_factor`: 0.0
-- `optim`: adamw_torch
-- `optim_args`: None
-- `adafactor`: False
-- `group_by_length`: False
-- `length_column_name`: length
-- `ddp_find_unused_parameters`: None
-- `ddp_bucket_cap_mb`: None
-- `ddp_broadcast_buffers`: False
-- `dataloader_pin_memory`: True
-- `dataloader_persistent_workers`: True
-- `skip_memory_metrics`: True
-- `use_legacy_prediction_loop`: False
-- `push_to_hub`: False
-- `resume_from_checkpoint`: None
-- `hub_model_id`: None
-- `hub_strategy`: every_save
-- `hub_private_repo`: None
-- `hub_always_push`: False
-- `gradient_checkpointing`: False
-- `gradient_checkpointing_kwargs`: None
-- `include_inputs_for_metrics`: False
-- `include_for_metrics`: []
-- `eval_do_concat_batches`: True
-- `fp16_backend`: auto
-- `push_to_hub_model_id`: None
-- `push_to_hub_organization`: None
-- `mp_parameters`:
-- `auto_find_batch_size`: False
-- `full_determinism`: False
-- `torchdynamo`: None
-- `ray_scope`: last
-- `ddp_timeout`: 1800
-- `torch_compile`: False
-- `torch_compile_backend`: None
-- `torch_compile_mode`: None
-- `dispatch_batches`: None
-- `split_batches`: None
-- `include_tokens_per_second`: False
-- `include_num_input_tokens_seen`: False
-- `neftune_noise_alpha`: None
-- `optim_target_modules`: None
-- `batch_eval_metrics`: False
-- `eval_on_start`: False
-- `use_liger_kernel`: False
-- `eval_use_gather_object`: False
-- `average_tokens_across_devices`: False
-- `prompts`: None
-- `batch_sampler`: batch_sampler
-- `multi_dataset_batch_sampler`: proportional
-</details>
-### Training Logs
-<details><summary>Click to expand</summary>
-| Epoch  | Step | Training Loss |
-|:------:|:----:|:-------------:|
-| 0.0256 | 1    | 2.3632        |
-| 0.0513 | 2    | 2.3367        |
-| 0.0769 | 3    | 2.448         |
-| 0.1026 | 4    | 2.4189        |
-| 0.1282 | 5    | 2.1217        |
-| 0.1538 | 6    | 2.1491        |
-| 0.1795 | 7    | 1.9582        |
-| 0.2051 | 8    | 1.9204        |
-| 0.2308 | 9    | 1.6757        |
-| 0.2564 | 10   | 1.4951        |
-| 0.2821 | 11   | 1.3773        |
-| 0.3077 | 12   | 1.1778        |
-| 0.3333 | 13   | 1.088         |
-| 0.3590 | 14   | 1.0256        |
-| 0.3846 | 15   | 1.0174        |
-| 0.4103 | 16   | 0.8424        |
-| 0.4359 | 17   | 0.9435        |
-| 0.4615 | 18   | 0.854         |
-| 0.4872 | 19   | 0.8846        |
-| 0.5128 | 20   | 0.9211        |
-| 0.5385 | 21   | 0.7185        |
-| 0.5641 | 22   | 0.8183        |
-| 0.5897 | 23   | 0.7488        |
-| 0.6154 | 24   | 0.696         |
-| 0.6410 | 25   | 0.6371        |
-| 0.6667 | 26   | 0.6456        |
-| 0.6923 | 27   | 0.6259        |
-| 0.7179 | 28   | 0.5277        |
-| 0.7436 | 29   | 0.7078        |
-| 0.7692 | 30   | 0.7901        |
-| 0.7949 | 31   | 0.6332        |
-| 0.8205 | 32   | 0.4658        |
-| 0.8462 | 33   | 0.6804        |
-| 0.8718 | 34   | 0.6232        |
-| 0.8974 | 35   | 0.611         |
-| 0.9231 | 36   | 0.6147        |
-| 0.9487 | 37   | 0.5991        |
-| 0.9744 | 38   | 0.6732        |
-| 1.0    | 39   | 0.5281        |
-| 1.0256 | 40   | 0.5556        |
-| 1.0513 | 41   | 0.4985        |
-| 1.0769 | 42   | 0.5527        |
-| 1.1026 | 43   | 0.4919        |
-| 1.1282 | 44   | 0.5443        |
-| 1.1538 | 45   | 0.6086        |
-| 1.1795 | 46   | 0.5949        |
-| 1.2051 | 47   | 0.5734        |
-| 1.2308 | 48   | 0.6677        |
-| 1.2564 | 49   | 0.5189        |
-| 1.2821 | 50   | 0.666         |
-| 1.3077 | 51   | 0.4927        |
-| 1.3333 | 52   | 0.5356        |
-| 1.3590 | 53   | 0.5792        |
-| 1.3846 | 54   | 0.4162        |
-| 1.4103 | 55   | 0.5923        |
-| 1.4359 | 56   | 0.4905        |
-| 1.4615 | 57   | 0.4645        |
-| 1.4872 | 58   | 0.7121        |
-| 1.5128 | 59   | 0.5809        |
-| 1.5385 | 60   | 0.4401        |
-| 1.5641 | 61   | 0.458         |
-| 1.5897 | 62   | 0.4659        |
-| 1.6154 | 63   | 0.5638        |
-| 1.6410 | 64   | 0.4875        |
-| 1.6667 | 65   | 0.4903        |
-| 1.6923 | 66   | 0.5373        |
-| 1.7179 | 67   | 0.3934        |
-| 1.7436 | 68   | 0.5693        |
-| 1.7692 | 69   | 0.4524        |
-| 1.7949 | 70   | 0.4949        |
-| 1.8205 | 71   | 0.466         |
-| 1.8462 | 72   | 0.4837        |
-| 1.8718 | 73   | 0.5391        |
-| 1.8974 | 74   | 0.5266        |
-| 1.9231 | 75   | 0.4747        |
-| 1.9487 | 76   | 0.4502        |
-| 1.9744 | 77   | 0.5449        |
-| 2.0    | 78   | 0.4349        |
-| 2.0256 | 79   | 0.4566        |
-| 2.0513 | 80   | 0.482         |
-| 2.0769 | 81   | 0.5553        |
-| 2.1026 | 82   | 0.4606        |
-| 2.1282 | 83   | 0.4938        |
-| 2.1538 | 84   | 0.4303        |
-| 2.1795 | 85   | 0.4068        |
-| 2.2051 | 86   | 0.4398        |
-| 2.2308 | 87   | 0.4359        |
-| 2.2564 | 88   | 0.4599        |
-| 2.2821 | 89   | 0.4835        |
-| 2.3077 | 90   | 0.404         |
-| 2.3333 | 91   | 0.5046        |
-| 2.3590 | 92   | 0.4678        |
-| 2.3846 | 93   | 0.3891        |
-| 2.4103 | 94   | 0.435         |
-| 2.4359 | 95   | 0.5688        |
-| 2.4615 | 96   | 0.4319        |
-| 2.4872 | 97   | 0.4667        |
-| 2.5128 | 98   | 0.5857        |
-| 2.5385 | 99   | 0.5194        |
-| 2.5641 | 100  | 0.4741        |
-| 2.5897 | 101  | 0.5226        |
-| 2.6154 | 102  | 0.4168        |
-| 2.6410 | 103  | 0.4488        |
-| 2.6667 | 104  | 0.4922        |
-| 2.6923 | 105  | 0.4309        |
-| 2.7179 | 106  | 0.4832        |
-| 2.7436 | 107  | 0.4496        |
-| 2.7692 | 108  | 0.5548        |
-| 2.7949 | 109  | 0.4355        |
-| 2.8205 | 110  | 0.4305        |
-| 2.8462 | 111  | 0.3955        |
-| 2.8718 | 112  | 0.2876        |
-| 2.8974 | 113  | 0.4263        |
-| 2.9231 | 114  | 0.4874        |
-| 2.9487 | 115  | 0.4602        |
-| 2.9744 | 116  | 0.4725        |
-| 3.0    | 117  | 0.5401        |
-</details>
-### Framework Versions
-- Python: 3.12.3
-- Sentence Transformers: 4.0.2
-- PyLate: 1.2.0
-- Transformers: 4.48.2
-- PyTorch: 2.10.0a0+a36e1d39eb.nv26.01.42222806
-- Accelerate: 1.13.0
-- Datasets: 4.4.2
-- Tokenizers: 0.21.4
 ## Citation
-### BibTeX
-#### Sentence Transformers
 ```bibtex
-@inproceedings{reimers-2019-sentence-bert,
-    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
-    author = "Reimers, Nils and Gurevych, Iryna",
-    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
-    month = "11",
-    year = "2019",
-    publisher = "Association for Computational Linguistics",
-    url = "https://arxiv.org/abs/1908.10084"
 }
-```
-#### PyLate
-```bibtex
-@misc{PyLate,
-title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
-author={Chaffin, Antoine and Sourty, Raphaël},
-url={https://github.com/lightonai/pylate},
-year={2024}
 }
-```
-#### CachedContrastive
-```bibtex
-@misc{gao2021scaling,
-    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
-    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
-    year={2021},
-    eprint={2101.06983},
-    archivePrefix={arXiv},
-    primaryClass={cs.LG}
 }
 ```
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
+license: apache-2.0
+language:
+- en
+- code
+library_name: PyLate
 tags:
 - ColBERT
 - PyLate
 - sentence-transformers
+- code-search
+- code-retrieval
+- late-interaction
+- reasoning
+base_model: lightonai/GTE-ModernColBERT-v1
+datasets:
+- nomic-ai/cornstack-python-v1
+- nomic-ai/cornstack-java-v1
+- nomic-ai/cornstack-javascript-v1
+- nomic-ai/cornstack-php-v1
+- nomic-ai/cornstack-go-v1
+- nomic-ai/cornstack-ruby-v1
 pipeline_tag: sentence-similarity
 ---
+# Reason-Code-ModernColBERT
+The **first ColBERT (late-interaction) model specifically designed for code search and retrieval**.
+Combines the token-granular matching advantages of ColBERT with reasoning-enhanced queries, extending the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain.
+## Why Late-Interaction for Code?
+All existing SOTA code search models (CodeXEmbed, Nomic Embed Code, Voyage Code) use bi-encoder / single-vector architectures. ColBERT's late-interaction approach computes token-level similarity (MaxSim), which is particularly well-suited for code because:
+- Code has rich token-level structure (identifiers, operators, keywords, types)
+- A query like "sort array in reverse order" needs to match specific code tokens (`.sort()`, `reverse=True`)
+- MaxSim naturally captures partial matches between NL query tokens and code tokens
+- On reasoning tasks, [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (150M) outperformed 7B dense models
+## Model Details
+| Property | Value |
+|---|---|
+| **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) |
+| **Architecture** | ColBERT (late-interaction, multi-vector) |
+| **Parameters** | 150M |
+| **Embedding dim** | 128 per token |
+| **Document length** | 512 tokens |
+| **Query length** | 128 tokens |
+| **Similarity** | MaxSim |
+| **Languages** | Python, Java, JavaScript, PHP, Go, Ruby |
+| **License** | Apache 2.0 |
+## Training
+### Two-Stage Training Pipeline
+**Stage 1: CoRNStack Base (1 epoch)**
+- 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0)
+- 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
+- Loss: 2.42 → 0.63
+**Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)**
+- 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
+- Queries require understanding algorithms, edge cases, design patterns, and complexity
+- Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
+- Loss: 2.36 → 0.54
+### Training Configuration
 ```python
+# Both stages
+model = ColBERT(document_length=512, query_length=128)
+loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
+batch_size = 256
+optim = "adamw_torch"
+bf16 = True
+# Stage 1: lr=1e-5, 1 epoch, warmup=5%
+# Stage 2: lr=5e-6, 3 epochs, warmup=5%
 ```
+### Hardware
+Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
+- Stage 1: ~130 min (391 steps)
+- Stage 2: ~37 min (117 steps)
+## Usage
 ```python
+from pylate import models
+model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")
+queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
+code_docs = ["def sort_desc(arr):\n    return sorted(arr, reverse=True)"]
+query_embeddings = model.encode(queries, is_query=True)
+doc_embeddings = model.encode(code_docs, is_query=False)
 ```
 ## Citation
+This model extends the methodology from:
 ```bibtex
+@article{shao2025reasonir,
+  title={ReasonIR: Training Retrievers for Reasoning Tasks},
+  author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
+  journal={arXiv preprint arXiv:2504.20595},
+  year={2025}
 }
+@misc{Reason-ModernColBERT,
+  title={Reason-ModernColBERT},
+  author={LightOn AI},
+  year={2025},
+  url={https://huggingface.co/lightonai/Reason-ModernColBERT}
 }
+@inproceedings{cornstack2025,
+  title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
+  author={Gangisetty, Zach and others},
+  booktitle={ICLR},
+  year={2025}
 }
 ```
+Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).