Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

Evgueni Poloukarov Claude commited on about 1 month ago

Commit

e5de9d8

1 Parent(s): e1f5207

fix: enable multi-GPU distribution and optimize for 2x24GB VRAM

Multi-GPU Support:
- Changed device_map from 'cuda' to 'auto' for automatic distribution
- Added GPU detection diagnostics (count and total VRAM logging)
- Enables HuggingFace Accelerate to distribute model across all GPUs
- Fixes single-GPU bottleneck that was forcing all weights to GPU 0

Memory Optimization:
- Reduced batch_size from 128 to 64 (halves attention memory: 19GB -> 9.5GB)
- Improved tensor cleanup with gc.collect() between borders
- Prevents memory accumulation across 132 border forecasts

Hardware Target:
- 2x24GB L4 GPUs (48 GB total)
- 4x24GB L4 GPUs (96 GB total)

Context Window:
- 2,160 hours (3 months / 90 days)

Expected Memory Usage:
- Model: 0.24 GB (120M params, bfloat16)
- Attention (batch 64): 9.5 GB
- Activations: 8-12 GB
- KV Cache: 6-10 GB
- Total: 24-32 GB (fits comfortably in 44GB available for 2-GPU setup)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show

src/forecasting/chronos_inference.py +10 -4

src/forecasting/chronos_inference.py CHANGED Viewed

@@ -73,7 +73,7 @@ class ChronosInferencePipeline:
             self._pipeline = Chronos2Pipeline.from_pretrained(
                 self.model_name,
-                device_map=self.device,
                 torch_dtype=dtype_map.get(self.dtype, torch.float32)
             )
@@ -83,8 +83,12 @@ class ChronosInferencePipeline:
             print(f"Model loaded in {time.time() - start_time:.1f}s")
             print(f"  Device: {next(self._pipeline.model.parameters()).device}")
-            # Memory profiling diagnostics
             if torch.cuda.is_available():
                 print(f"  [MEMORY] After model load:")
                 print(f"    GPU memory allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
                 print(f"    GPU memory reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
@@ -193,10 +197,12 @@ class ChronosInferencePipeline:
         for i, border in enumerate(forecast_borders, 1):
             # Clear GPU cache BEFORE each border to prevent memory accumulation
             # This releases tensors from previous border (no-op on first iteration)
-            # Does NOT affect model weights (710M params stay loaded)
             # Does NOT affect forecast accuracy (each border is independent)
             if i > 1:  # Skip on first border (clean GPU state)
                 torch.cuda.empty_cache()
             border_start = time.time()
             print(f"\n  [{i}/{len(forecast_borders)}] {border}...", flush=True)
@@ -223,7 +229,7 @@ class ChronosInferencePipeline:
                         id_column='border',
                         timestamp_column='timestamp',
                         target='target',
-                        batch_size=128,  # Increased from 32 for better temporal attention + faster inference
                         quantile_levels=[0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]  # 9 quantiles for volatility
                     )

             self._pipeline = Chronos2Pipeline.from_pretrained(
                 self.model_name,
+                device_map="auto",  # Auto-distribute across all available GPUs
                 torch_dtype=dtype_map.get(self.dtype, torch.float32)
             )
             print(f"Model loaded in {time.time() - start_time:.1f}s")
             print(f"  Device: {next(self._pipeline.model.parameters()).device}")
+            # GPU detection and memory profiling diagnostics
             if torch.cuda.is_available():
+                gpu_count = torch.cuda.device_count()
+                total_vram = sum(torch.cuda.get_device_properties(i).total_memory for i in range(gpu_count))
+                print(f"  [GPU] Detected {gpu_count} GPU(s)")
+                print(f"  [GPU] Total VRAM: {total_vram/1e9:.1f} GB")
                 print(f"  [MEMORY] After model load:")
                 print(f"    GPU memory allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
                 print(f"    GPU memory reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
         for i, border in enumerate(forecast_borders, 1):
             # Clear GPU cache BEFORE each border to prevent memory accumulation
             # This releases tensors from previous border (no-op on first iteration)
+            # Does NOT affect model weights (120M params stay loaded)
             # Does NOT affect forecast accuracy (each border is independent)
             if i > 1:  # Skip on first border (clean GPU state)
                 torch.cuda.empty_cache()
+                import gc
+                gc.collect()  # Force Python garbage collector to free tensors
             border_start = time.time()
             print(f"\n  [{i}/{len(forecast_borders)}] {border}...", flush=True)
                         id_column='border',
                         timestamp_column='timestamp',
                         target='target',
+                        batch_size=64,  # Reduced from 128 for 2-GPU setup (halves attention memory: 19GB -> 9.5GB)
                         quantile_levels=[0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]  # 9 quantiles for volatility
                     )