chadvanderbilt commited on
Commit
1f76206
Β·
verified Β·
1 Parent(s): aecf178

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -3
README.md CHANGED
@@ -1,3 +1,217 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TCGA & IMPACT Genomic Biomarker WSI Training Checkpoints
2
+
3
+ This repository hosts the full set of 200th-epoch classification checkpoints
4
+ used for genomic biomarker prediction across TCGA and IMPACT cohorts.
5
+
6
+ Checkpoints are organized strictly by:
7
+
8
+ - Dataset source (`TCGA` or `IMPACT`)
9
+ - Tumor type (e.g., `HNSC`, `UCS`, `BRCA`)
10
+ - Gene (e.g., `PIK3CA`, `FBXW7`, `BRAF`)
11
+ - Encoder (e.g., `virchow`, `gigapath_ft`)
12
+ - Data split index (`split_1`, `split_2`, ...)
13
+
14
+ ---
15
+
16
+ ## Repository Structure
17
+
18
+ The exact directory layout in this Hugging Face repo is:
19
+
20
+ ```text
21
+ TCGA_Genomic_Biomarker_WSI_Training/
22
+ β”œβ”€β”€ TCGA/
23
+ β”‚ └── checkpoints/
24
+ β”‚ └── <TUMOR>/
25
+ β”‚ └── <GENE>/
26
+ β”‚ └── TCGA_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
27
+ β”‚
28
+ └── IMPACT/
29
+ └── checkpoints/
30
+ └── <TUMOR>/
31
+ └── <GENE>/
32
+ └── IMPACT_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
33
+ ```
34
+
35
+ ### Examples
36
+
37
+ ```text
38
+ TCGA/checkpoints/HNSC/PIK3CA/
39
+ TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth
40
+ TCGA_trained_HNSC_PIK3CA_virchow_gma_2_200.pth
41
+ TCGA_trained_HNSC_PIK3CA_gigapath_ft_gma_1_200.pth
42
+
43
+ IMPACT/checkpoints/UCS/FBXW7/
44
+ IMPACT_trained_UCS_FBXW7_virchow_gma_1_200.pth
45
+ IMPACT_trained_UCS_FBXW7_gigapath_ft_gma_2_200.pth
46
+ ```
47
+
48
+ Each checkpoint filename is self-descriptive:
49
+
50
+ ```text
51
+ <SOURCE>_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Downloading
57
+
58
+ ### 1. Clone with Git LFS (recommended)
59
+
60
+ ```bash
61
+ git lfs install
62
+ git clone https://huggingface.co/chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training
63
+ cd TCGA_Genomic_Biomarker_WSI_Training
64
+ ```
65
+
66
+ ### 2. Download an individual checkpoint
67
+
68
+ ```python
69
+ from huggingface_hub import hf_hub_download
70
+
71
+ ckpt_path = hf_hub_download(
72
+ repo_id="chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training",
73
+ filename="TCGA/checkpoints/HNSC/PIK3CA/TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth"
74
+ )
75
+ print(ckpt_path)
76
+ ```
77
+
78
+ ---
79
+
80
+ ## Checksum Logs (SHA256)
81
+
82
+ Each upload run writes a checksum log under:
83
+
84
+ ```text
85
+ logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
86
+ ```
87
+
88
+ Each entry in this JSON file includes:
89
+
90
+ - `source` (`TCGA` or `IMPACT`)
91
+ - `tumor`
92
+ - `gene`
93
+ - `encoder`
94
+ - `split`
95
+ - `remote_path` (path inside this repo)
96
+ - `size_bytes`
97
+ - `sha256`
98
+ - `timestamp`
99
+
100
+ These logs allow you to verify that your local copies of the checkpoints
101
+ match the originals used at upload time.
102
+
103
+ ---
104
+
105
+ ## Verifying Checkpoints After Download
106
+
107
+ This repo includes a helper script `verify_checkpoints.py` for checksum verification.
108
+
109
+ ### Usage
110
+
111
+ From the root of the cloned repo:
112
+
113
+ ```bash
114
+ python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
115
+ ```
116
+
117
+ The script will:
118
+
119
+ 1. Read the JSON log.
120
+ 2. For each record, look up the file at `remote_path` under the repo root.
121
+ 3. Recompute SHA256 and size.
122
+ 4. Compare with the logged `sha256` and `size_bytes`.
123
+
124
+ Example output:
125
+
126
+ ```text
127
+ OK : 128
128
+ MISMATCH : 0
129
+ MISSING : 0
130
+ ```
131
+
132
+ - **OK** – file exists and matches checksum and size.
133
+ - **MISMATCH** – file exists but checksum or size does not match the log.
134
+ - **MISSING** – file listed in the log is not present on disk.
135
+
136
+ The script exits with a non-zero status code if there are any mismatches or missing files.
137
+
138
+ ---
139
+
140
+ ## `verify_checkpoints.py`
141
+
142
+ For convenience, the expected content of `verify_checkpoints.py` is:
143
+
144
+ ```python
145
+ import json, hashlib, sys
146
+ from pathlib import Path
147
+
148
+ def sha256_file(path, buf=1024*1024):
149
+ h = hashlib.sha256()
150
+ with open(path, "rb") as f:
151
+ while True:
152
+ chunk = f.read(buf)
153
+ if not chunk:
154
+ break
155
+ h.update(chunk)
156
+ return h.hexdigest()
157
+
158
+ def main(log_json: str):
159
+ log_file = Path(log_json)
160
+ if not log_file.is_file():
161
+ print(f"ERROR: log not found: {log_json}")
162
+ sys.exit(1)
163
+
164
+ with log_file.open() as f:
165
+ records = json.load(f)
166
+
167
+ repo_root = Path(__file__).resolve().parent
168
+
169
+ ok = mismatch = missing = 0
170
+
171
+ for rec in records:
172
+ remote_path = rec["remote_path"]
173
+ expected_sha = rec["sha256"]
174
+ expected_size = rec["size_bytes"]
175
+
176
+ local_path = repo_root / remote_path
177
+
178
+ if not local_path.exists():
179
+ print(f"[MISSING] {remote_path}")
180
+ missing += 1
181
+ continue
182
+
183
+ actual_size = local_path.stat().st_size
184
+ actual_sha = sha256_file(local_path)
185
+
186
+ if actual_sha == expected_sha and actual_size == expected_size:
187
+ ok += 1
188
+ else:
189
+ mismatch += 1
190
+ print(f"[MISMATCH] {remote_path}")
191
+ print(f" expected sha : {expected_sha}")
192
+ print(f" actual sha : {actual_sha}")
193
+ print(f" expected size: {expected_size}")
194
+ print(f" actual size : {actual_size}")
195
+
196
+ print()
197
+ print(f"OK : {ok}")
198
+ print(f"MISMATCH : {mismatch}")
199
+ print(f"MISSING : {missing}")
200
+
201
+ if mismatch or missing:
202
+ sys.exit(1)
203
+
204
+ if __name__ == "__main__":
205
+ if len(sys.argv) != 2:
206
+ print("Usage: python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json")
207
+ sys.exit(1)
208
+ main(sys.argv[1])
209
+ ```
210
+
211
+ You can either copy this script into your local clone, or use the version
212
+ shipped directly in the repository (if present).
213
+
214
+
215
+ ---
216
+ license: mit
217
+ ---