JacobLinCool commited on 11 days ago

Commit

66e0fbe

verified ·

1 Parent(s): 975d5cb

Upload folder using huggingface_hub

Browse files

Files changed (42) hide show

.gitignore +2 -1
.ruff_cache/0.12.5/14293067367466839361 +0 -0
README.md +68 -37
exp/baseline1/__init__.py +0 -0
exp/baseline1/data.py +128 -0
exp/baseline1/eval.py +322 -0
exp/baseline1/model.py +62 -0
exp/baseline1/train.py +183 -0
exp/baseline1/utils.py +53 -0
exp/baseline2/__init__.py +0 -0
exp/baseline2/data.py +137 -0
exp/baseline2/eval.py +324 -0
exp/baseline2/model.py +139 -0
exp/baseline2/train.py +215 -0
outputs/baseline1/beats/README.md +10 -0
outputs/baseline1/beats/config.json +3 -0
outputs/baseline1/beats/final/README.md +10 -0
outputs/baseline1/beats/final/config.json +3 -0
outputs/baseline1/beats/final/model.safetensors +3 -0
outputs/baseline1/beats/logs/events.out.tfevents.1766351314.msiit232.1284330.0 +3 -0
outputs/baseline1/beats/model.safetensors +3 -0
outputs/baseline1/downbeats/README.md +10 -0
outputs/baseline1/downbeats/config.json +3 -0
outputs/baseline1/downbeats/final/README.md +10 -0
outputs/baseline1/downbeats/final/config.json +3 -0
outputs/baseline1/downbeats/final/model.safetensors +3 -0
outputs/baseline1/downbeats/logs/events.out.tfevents.1766353075.msiit232.1284330.1 +3 -0
outputs/baseline1/downbeats/model.safetensors +3 -0
outputs/baseline2/beats/README.md +10 -0
outputs/baseline2/beats/config.json +15 -0
outputs/baseline2/beats/final/README.md +10 -0
outputs/baseline2/beats/final/config.json +15 -0
outputs/baseline2/beats/final/model.safetensors +3 -0
outputs/baseline2/beats/logs/events.out.tfevents.1766356346.msiit232.1356098.0 +3 -0
outputs/baseline2/beats/model.safetensors +3 -0
outputs/baseline2/downbeats/README.md +10 -0
outputs/baseline2/downbeats/config.json +15 -0
outputs/baseline2/downbeats/final/README.md +10 -0
outputs/baseline2/downbeats/final/config.json +15 -0
outputs/baseline2/downbeats/final/model.safetensors +3 -0
outputs/baseline2/downbeats/logs/events.out.tfevents.1766359276.msiit232.1356098.1 +3 -0
outputs/baseline2/downbeats/model.safetensors +3 -0

.gitignore CHANGED Viewed

@@ -10,4 +10,5 @@ wheels/
 .venv
 outputs/*
-!outputs/baseline/

 .venv
 outputs/*
+!outputs/baseline1/
+!outputs/baseline2/

.ruff_cache/0.12.5/14293067367466839361 CHANGED Viewed

Binary files a/.ruff_cache/0.12.5/14293067367466839361 and b/.ruff_cache/0.12.5/14293067367466839361 differ

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high
 | Split | Tracks | Duration | Description |
 |-------|--------|----------|-------------|
-| `train` | ~900 | 1-3 min each | Training data with beat/downbeat annotations |
 | `test` | ~100 | 1-3 min each | Held-out test set for final evaluation |
 ### Data Features
@@ -118,6 +118,17 @@ Downbeat Detection:
 Combined Weighted F1: X.XXXX  (average of beat and downbeat)
 ```
 ---
 ## Quick Start
@@ -128,35 +139,40 @@ Combined Weighted F1: X.XXXX  (average of beat and downbeat)
 uv sync
 ```
-### Train Baseline Model
 ```bash
-# Train both beat and downbeat models
-uv run -m exp.baseline.train
-# Train specific model only
-uv run -m exp.baseline.train --target beats
-uv run -m exp.baseline.train --target downbeats
 ```
 ### Run Evaluation
 ```bash
-# Basic evaluation
-uv run -m exp.baseline.eval
 # Full evaluation with visualization and audio
-uv run -m exp.baseline.eval --visualize --synthesize --summary-plot
 # Evaluate on more samples with custom output directory
-uv run -m exp.baseline.eval --num-samples 50 --output-dir outputs/my_eval
 ```
 ### Evaluation Options
 | Option | Description |
 |--------|-------------|
-| `--model-dir DIR` | Model directory (default: `outputs/baseline`) |
 | `--num-samples N` | Number of samples to evaluate (default: 20) |
 | `--output-dir DIR` | Output directory (default: `outputs/eval`) |
 | `--visualize` | Generate visualization plots for each track |
@@ -175,7 +191,7 @@ uv run -m exp.baseline.eval --num-samples 50 --output-dir outputs/my_eval
 Generate plots comparing predicted vs ground truth beats:
 ```bash
-uv run -m exp.baseline.eval --visualize --viz-tracks 10
 ```
 Output: `outputs/eval/plots/track_XXX.png`
@@ -185,7 +201,7 @@ Output: `outputs/eval/plots/track_XXX.png`
 Generate audio files with click sounds overlaid on the original music:
 ```bash
-uv run -m exp.baseline.eval --synthesize
 ```
 Output files in `outputs/eval/audio/`:
@@ -198,37 +214,48 @@ Output files in `outputs/eval/audio/`:
 Generate bar charts summarizing F1 scores and continuity metrics:
 ```bash
-uv run -m exp.baseline.eval --summary-plot
 ```
 Output: `outputs/eval/evaluation_summary.png`
 ---
-## Baseline Model
-The provided baseline implements the **Onset Detection CNN (ODCNN)** architecture:
-### Architecture
 - **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
 - **CNN Backbone**: 3 convolutional blocks with max pooling
 - **Output**: Frame-level beat/downbeat probability
-### Training Details
-- **Optimizer**: SGD with momentum (0.9)
-- **Learning Rate**: 0.05 with cosine annealing
-- **Loss**: Binary Cross-Entropy
-- **Epochs**: 50
-- **Batch Size**: 512
-### Inference Pipeline
-1. Compute multi-view mel spectrogram on GPU
-2. Sliding window inference (±7 frames context = ±70ms)
-3. Hamming window smoothing
-4. Peak picking with threshold (0.5) and minimum distance (5 frames)
 ---
@@ -237,21 +264,25 @@ The provided baseline implements the **Onset Detection CNN (ODCNN)** architectur
 ```
 exp-onset/
 ├── exp/
-│   ├── baseline/          # Baseline model implementation
 │   │   ├── model.py       # ODCNN architecture
-│   │   ├── train.py       # Training script
-│   │   ├── eval.py        # Evaluation with viz/audio
-│   │   ├── data.py        # Dataset wrapper
-│   │   └── utils.py       # Spectrogram processing
 │   └── data/
 │       ├── load.py        # Dataset loading & preprocessing
 │       ├── eval.py        # Evaluation metrics (F1, CML, AML)
 │       ├── audio.py       # Click track synthesis
 │       └── viz.py         # Visualization utilities
 ├── outputs/
-│   ├── baseline/          # Trained models
-│   │   ├── beats/         # Beat detection model
-│   │   └── downbeats/     # Downbeat detection model
 │   └── eval/              # Evaluation outputs
 │       ├── plots/         # Visualization images
 │       ├── audio/         # Click track audio files

 | Split | Tracks | Duration | Description |
 |-------|--------|----------|-------------|
+| `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations |
 | `test` | ~100 | 1-3 min each | Held-out test set for final evaluation |
 ### Data Features
 Combined Weighted F1: X.XXXX  (average of beat and downbeat)
 ```
+### Benchmark Results
+Results evaluated on 100 tracks from the test set:
+| Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) |
+|-------|-------------|---------|-------------|-------------|-----------------|
+| **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 |
+| **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** |
+*Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.*
 ---
 ## Quick Start
 uv sync
 ```
+### Train Models
 ```bash
+# Train Baseline 1 (ODCNN)
+uv run -m exp.baseline1.train
+# Train Baseline 2 (ResNet-SE)
+uv run -m exp.baseline2.train
+# Train specific target only (e.g. for Baseline 2)
+uv run -m exp.baseline2.train --target beats
+uv run -m exp.baseline2.train --target downbeats
 ```
 ### Run Evaluation
 ```bash
+# Evaluation (replace baseline1 with baseline2 to evaluate the new model)
+uv run -m exp.baseline1.eval
 # Full evaluation with visualization and audio
+uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot
 # Evaluate on more samples with custom output directory
+uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1
 ```
 ### Evaluation Options
 | Option | Description |
 |--------|-------------|
+| Option | Description |
+|--------|-------------|
+| `--model-dir DIR` | Model directory (default: `outputs/baseline1`) |
 | `--num-samples N` | Number of samples to evaluate (default: 20) |
 | `--output-dir DIR` | Output directory (default: `outputs/eval`) |
 | `--visualize` | Generate visualization plots for each track |
 Generate plots comparing predicted vs ground truth beats:
 ```bash
+uv run -m exp.baseline1.eval --visualize --viz-tracks 10
 ```
 Output: `outputs/eval/plots/track_XXX.png`
 Generate audio files with click sounds overlaid on the original music:
 ```bash
+uv run -m exp.baseline1.eval --synthesize
 ```
 Output files in `outputs/eval/audio/`:
 Generate bar charts summarizing F1 scores and continuity metrics:
 ```bash
+uv run -m exp.baseline1.eval --summary-plot
 ```
 Output: `outputs/eval/evaluation_summary.png`
 ---
+## Models
+### Baseline 1: ODCNN
+A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>.
+The original baseline implements the **Onset Detection CNN (ODCNN)** architecture:
+#### Architecture
 - **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
 - **CNN Backbone**: 3 convolutional blocks with max pooling
 - **Output**: Frame-level beat/downbeat probability
+- **Inference**: ±7 frames context (±70ms)
+### Baseline 2: ResNet-SE
+Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>.
+A modernized architecture designed to capture longer temporal context:
+#### Architecture
+- **Input**: Mel spectrogram with larger context
+- **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks
+- **Context**: **±50 frames (~1s)** window
+- **Features**: Deeper network (4 stages) with effective channel attention
+- **Parameters**: ~400k (Small & Efficient)
+### Training Details
+Both models use similar training loops:
+- **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2)
+- **Learning Rate**: Cosine annealing
+- **Loss**: Binary Cross-Entropy
+- **Epochs**: 50 (Baseline 1) / 3 (Baseline 2)
+- **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2)
 ---
 ```
 exp-onset/
 ├── exp/
+│   ├── baseline1/         # Baseline 1 (ODCNN)
 │   │   ├── model.py       # ODCNN architecture
+│   │   ├── train.py
+│   │   ├── eval.py
+│   │   ├── data.py
+│   │   └── utils.py
+│   ├── baseline2/         # Baseline 2 (ResNet-SE)
+│   │   ├── model.py       # ResNet-SE
+│   │   ├── train.py
+│   │   ├── eval.py
+│   │   └── data.py
 │   └── data/
 │       ├── load.py        # Dataset loading & preprocessing
 │       ├── eval.py        # Evaluation metrics (F1, CML, AML)
 │       ├── audio.py       # Click track synthesis
 │       └── viz.py         # Visualization utilities
 ├── outputs/
+│   ├── baseline1/         # Trained models (Baseline 1)
+│   ├── baseline2/         # Trained models (Baseline 2)
 │   └── eval/              # Evaluation outputs
 │       ├── plots/         # Visualization images
 │       ├── audio/         # Click track audio files

exp/baseline1/__init__.py ADDED Viewed

File without changes

exp/baseline1/data.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import torch
+from torch.utils.data import Dataset
+import numpy as np
+from tqdm import tqdm
+from .utils import extract_context
+class BeatTrackingDataset(Dataset):
+    def __init__(
+        self, hf_dataset, target_type="beats", sample_rate=16000, hop_length=160
+    ):
+        """
+        Args:
+            hf_dataset: HuggingFace dataset object
+            target_type (str): "beats" or "downbeats". Determines which labels are treated as positive.
+        """
+        self.sr = sample_rate
+        self.hop_length = hop_length
+        self.target_type = target_type
+        # Context window size in samples (7 frames = 70ms at 100fps)
+        self.context_frames = 7
+        self.context_samples = (self.context_frames * 2 + 1) * hop_length + max(
+            [368, 736, 1488]
+        )  # extra for FFT window
+        # Cache audio arrays in memory for fast access
+        self.audio_cache = []
+        self.indices = []
+        self._prepare_indices(hf_dataset)
+    def _prepare_indices(self, hf_dataset):
+        """
+        Prepares balanced indices and caches audio.
+        Paper Section 4.5: Uses "Fuzzier" training examples (neighbors weighted less).
+        """
+        print(f"Preparing dataset indices for target: {self.target_type}...")
+        for i, item in tqdm(
+            enumerate(hf_dataset), total=len(hf_dataset), desc="Building indices"
+        ):
+            # Cache audio array (convert to numpy if tensor)
+            audio = item["audio"]["array"]
+            if hasattr(audio, "numpy"):
+                audio = audio.numpy()
+            self.audio_cache.append(audio)
+            # Calculate total frames available in audio
+            audio_len = len(audio)
+            n_frames = int(audio_len / self.hop_length)
+            # Select ground truth based on target_type
+            if self.target_type == "downbeats":
+                # Only downbeats are positives
+                gt_times = item["downbeats"]
+            else:
+                # All beats are positives (downbeats are also beats)
+                gt_times = item["beats"]
+            # Convert to list if tensor
+            if hasattr(gt_times, "tolist"):
+                gt_times = gt_times.tolist()
+            gt_frames = set([int(t * self.sr / self.hop_length) for t in gt_times])
+            # --- Positive Examples (with Fuzziness) ---
+            # "define a single frame before and after each annotated onset to be additional positive examples"
+            pos_frames = set()
+            for bf in gt_frames:
+                if 0 <= bf < n_frames:
+                    self.indices.append((i, bf, 1.0))  # Center frame (Sharp onset)
+                    pos_frames.add(bf)
+                # Neighbors weighted at 0.25
+                if 0 <= bf - 1 < n_frames:
+                    self.indices.append((i, bf - 1, 0.25))
+                    pos_frames.add(bf - 1)
+                if 0 <= bf + 1 < n_frames:
+                    self.indices.append((i, bf + 1, 0.25))
+                    pos_frames.add(bf + 1)
+            # --- Negative Examples ---
+            # Paper uses "all others as negative", but we balance 2:1 for stable SGD.
+            num_pos = len(pos_frames)
+            num_neg = num_pos * 2
+            count = 0
+            attempts = 0
+            while count < num_neg and attempts < num_neg * 5:
+                f = np.random.randint(0, n_frames)
+                if f not in pos_frames:
+                    self.indices.append((i, f, 0.0))
+                    count += 1
+                attempts += 1
+        print(
+            f"Dataset ready. {len(self.indices)} samples, {len(self.audio_cache)} tracks cached."
+        )
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        track_idx, frame_idx, label = self.indices[idx]
+        # Fast lookup from cache
+        audio = self.audio_cache[track_idx]
+        audio_len = len(audio)
+        # Calculate sample range for context window
+        center_sample = frame_idx * self.hop_length
+        half_context = self.context_samples // 2
+        start = center_sample - half_context
+        end = center_sample + half_context
+        # Handle padding if needed
+        pad_left = max(0, -start)
+        pad_right = max(0, end - audio_len)
+        start = max(0, start)
+        end = min(audio_len, end)
+        # Extract audio chunk
+        chunk = audio[start:end]
+        if pad_left > 0 or pad_right > 0:
+            chunk = np.pad(chunk, (pad_left, pad_right), mode="constant")
+        waveform = torch.tensor(chunk, dtype=torch.float32)
+        return waveform, torch.tensor([label], dtype=torch.float32)

exp/baseline1/eval.py ADDED Viewed

	@@ -0,0 +1,322 @@

+import torch
+import numpy as np
+from tqdm import tqdm
+from scipy.signal import find_peaks
+import argparse
+import os
+from .model import ODCNN
+from .utils import MultiViewSpectrogram
+from ..data.load import ds
+from ..data.eval import evaluate_all, format_results
+def get_activation_function(model, waveform, device):
+    """
+    Computes probability curve over time.
+    """
+    processor = MultiViewSpectrogram().to(device)
+    waveform = waveform.unsqueeze(0).to(device)
+    with torch.no_grad():
+        spec = processor(waveform)
+        # Normalize
+        mean = spec.mean(dim=(2, 3), keepdim=True)
+        std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+        spec = (spec - mean) / std
+        # Batchify with sliding window
+        spec = torch.nn.functional.pad(spec, (7, 7))  # Pad time
+        windows = spec.unfold(3, 15, 1)  # (1, 3, 80, Time, 15)
+        windows = windows.permute(3, 0, 1, 2, 4).squeeze(1)  # (Time, 3, 80, 15)
+        # Inference
+        activations = []
+        batch_size = 512
+        for i in range(0, len(windows), batch_size):
+            batch = windows[i : i + batch_size]
+            out = model(batch)
+            activations.append(out.cpu().numpy())
+    return np.concatenate(activations).flatten()
+def pick_peaks(activations, hop_length=160, sr=16000):
+    """
+    Smooth with Hamming window and report local maxima.
+    """
+    # Smoothing
+    window = np.hamming(5)
+    window /= window.sum()
+    smoothed = np.convolve(activations, window, mode="same")
+    # Peak Picking
+    peaks, _ = find_peaks(smoothed, height=0.5, distance=5)
+    timestamps = peaks * hop_length / sr
+    return timestamps.tolist()
+def visualize_track(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    time_range: tuple[float, float] | None = None,
+):
+    """
+    Create and save visualizations for a single track.
+    """
+    from ..data.viz import plot_waveform_with_beats, save_figure
+    os.makedirs(output_dir, exist_ok=True)
+    # Full waveform plot
+    fig = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title=f"Track {track_idx}: Beat Comparison",
+        time_range=time_range,
+    )
+    save_figure(fig, os.path.join(output_dir, f"track_{track_idx:03d}.png"))
+def synthesize_audio(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    click_volume: float = 0.5,
+):
+    """
+    Create and save audio files with click tracks for a single track.
+    """
+    from ..data.audio import create_comparison_audio, save_audio
+    os.makedirs(output_dir, exist_ok=True)
+    # Create comparison audio
+    audio_pred, audio_gt, audio_both = create_comparison_audio(
+        audio,
+        pred_beats,
+        pred_downbeats,
+        gt_beats,
+        gt_downbeats,
+        sr=sr,
+        click_volume=click_volume,
+    )
+    # Save audio files
+    save_audio(
+        audio_pred, os.path.join(output_dir, f"track_{track_idx:03d}_pred.wav"), sr
+    )
+    save_audio(audio_gt, os.path.join(output_dir, f"track_{track_idx:03d}_gt.wav"), sr)
+    save_audio(
+        audio_both, os.path.join(output_dir, f"track_{track_idx:03d}_both.wav"), sr
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate beat tracking models with visualization and audio synthesis"
+    )
+    parser.add_argument(
+        "--model-dir",
+        type=str,
+        default="outputs/baseline1",
+        help="Base directory containing trained models (with 'beats' and 'downbeats' subdirs)",
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=116,
+        help="Number of samples to evaluate",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/eval_baseline1",
+        help="Directory to save visualizations and audio",
+    )
+    parser.add_argument(
+        "--visualize",
+        action="store_true",
+        help="Generate visualization plots for each track",
+    )
+    parser.add_argument(
+        "--synthesize",
+        action="store_true",
+        help="Generate audio files with click tracks",
+    )
+    parser.add_argument(
+        "--viz-tracks",
+        type=int,
+        default=5,
+        help="Number of tracks to visualize/synthesize (default: 5)",
+    )
+    parser.add_argument(
+        "--time-range",
+        type=float,
+        nargs=2,
+        default=None,
+        metavar=("START", "END"),
+        help="Time range for visualization in seconds (default: full track)",
+    )
+    parser.add_argument(
+        "--click-volume",
+        type=float,
+        default=0.5,
+        help="Volume of click sounds relative to audio (0.0 to 1.0)",
+    )
+    parser.add_argument(
+        "--summary-plot",
+        action="store_true",
+        help="Generate summary evaluation plot",
+    )
+    args = parser.parse_args()
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load BOTH models using from_pretrained
+    beat_model = None
+    downbeat_model = None
+    has_beats = False
+    has_downbeats = False
+    beats_dir = os.path.join(args.model_dir, "beats")
+    downbeats_dir = os.path.join(args.model_dir, "downbeats")
+    if os.path.exists(os.path.join(beats_dir, "model.safetensors")):
+        beat_model = ODCNN.from_pretrained(beats_dir).to(DEVICE)
+        beat_model.eval()
+        has_beats = True
+        print(f"Loaded Beat Model from {beats_dir}")
+    else:
+        print(f"Warning: No beat model found in {beats_dir}")
+    if os.path.exists(os.path.join(downbeats_dir, "model.safetensors")):
+        downbeat_model = ODCNN.from_pretrained(downbeats_dir).to(DEVICE)
+        downbeat_model.eval()
+        has_downbeats = True
+        print(f"Loaded Downbeat Model from {downbeats_dir}")
+    else:
+        print(f"Warning: No downbeat model found in {downbeats_dir}")
+    if not has_beats and not has_downbeats:
+        print("No models found. Please run training first.")
+        return
+    predictions = []
+    ground_truths = []
+    audio_data = []  # Store audio for visualization/synthesis
+    # Eval on specified number of tracks
+    test_set = ds["train"].select(range(args.num_samples))
+    print("Running evaluation...")
+    for i, item in enumerate(tqdm(test_set)):
+        waveform = torch.tensor(item["audio"]["array"], dtype=torch.float32)
+        waveform_device = waveform.to(DEVICE)
+        pred_entry = {"beats": [], "downbeats": []}
+        # 1. Predict Beats
+        if has_beats:
+            act_b = get_activation_function(beat_model, waveform_device, DEVICE)
+            pred_entry["beats"] = pick_peaks(act_b)
+        # 2. Predict Downbeats
+        if has_downbeats:
+            act_d = get_activation_function(downbeat_model, waveform_device, DEVICE)
+            pred_entry["downbeats"] = pick_peaks(act_d)
+        predictions.append(pred_entry)
+        ground_truths.append({"beats": item["beats"], "downbeats": item["downbeats"]})
+        # Store audio for later visualization/synthesis
+        if args.visualize or args.synthesize:
+            if i < args.viz_tracks:
+                audio_data.append(
+                    {
+                        "audio": waveform.numpy(),
+                        "sr": item["audio"]["sampling_rate"],
+                        "pred": pred_entry,
+                        "gt": ground_truths[-1],
+                    }
+                )
+    # Run evaluation
+    results = evaluate_all(predictions, ground_truths)
+    print(format_results(results))
+    # Create output directory
+    if args.visualize or args.synthesize or args.summary_plot:
+        os.makedirs(args.output_dir, exist_ok=True)
+    # Generate visualizations
+    if args.visualize:
+        print(f"\nGenerating visualizations for {len(audio_data)} tracks...")
+        viz_dir = os.path.join(args.output_dir, "plots")
+        for i, data in enumerate(tqdm(audio_data, desc="Visualizing")):
+            time_range = tuple(args.time_range) if args.time_range else None
+            visualize_track(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                viz_dir,
+                i,
+                time_range=time_range,
+            )
+        print(f"Saved visualizations to {viz_dir}")
+    # Generate audio with clicks
+    if args.synthesize:
+        print(f"\nSynthesizing audio for {len(audio_data)} tracks...")
+        audio_dir = os.path.join(args.output_dir, "audio")
+        for i, data in enumerate(tqdm(audio_data, desc="Synthesizing")):
+            synthesize_audio(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                audio_dir,
+                i,
+                click_volume=args.click_volume,
+            )
+        print(f"Saved audio files to {audio_dir}")
+        print("  *_pred.wav - Original audio with predicted beat clicks")
+        print("  *_gt.wav   - Original audio with ground truth beat clicks")
+        print("  *_both.wav - Original audio with both predicted and GT clicks")
+    # Generate summary plot
+    if args.summary_plot:
+        from ..data.viz import plot_evaluation_summary, save_figure
+        print("\nGenerating summary plot...")
+        fig = plot_evaluation_summary(results, title="Beat Tracking Evaluation Summary")
+        summary_path = os.path.join(args.output_dir, "evaluation_summary.png")
+        save_figure(fig, summary_path)
+        print(f"Saved summary plot to {summary_path}")
+if __name__ == "__main__":
+    main()

exp/baseline1/model.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+class ODCNN(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, dropout_rate=0.5):
+        super().__init__()
+        # Input 3 channels, 80 bands
+        # Conv 1: 7x3 filters -> 10 maps
+        self.conv1 = nn.Conv2d(3, 10, kernel_size=(3, 7))
+        self.relu1 = nn.ReLU()  #  ReLU improvement
+        self.pool1 = nn.MaxPool2d(kernel_size=(3, 1), stride=(3, 1))
+        # Conv 2: 3x3 filters -> 20 maps
+        self.conv2 = nn.Conv2d(10, 20, kernel_size=(3, 3))
+        self.relu2 = nn.ReLU()
+        self.pool2 = nn.MaxPool2d(kernel_size=(3, 1), stride=(3, 1))
+        # Flatten size calculation based on architecture
+        # (20 feature maps * 8 freq bands * 7 time frames)
+        self.flatten_size = 20 * 8 * 7
+        # Dropout on FC inputs
+        self.dropout = nn.Dropout(p=dropout_rate)
+        # 256 Hidden Units
+        self.fc1 = nn.Linear(self.flatten_size, 256)
+        self.relu_fc = nn.ReLU()
+        # Output Unit
+        self.fc2 = nn.Linear(256, 1)
+        self.sigmoid = nn.Sigmoid()
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.relu1(x)
+        x = self.pool1(x)
+        x = self.conv2(x)
+        x = self.relu2(x)
+        x = self.pool2(x)
+        x = x.view(x.size(0), -1)
+        x = self.dropout(x)
+        x = self.fc1(x)
+        x = self.relu_fc(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.sigmoid(x)
+        return x
+if __name__ == "__main__":
+    from torchinfo import summary
+    model = ODCNN()
+    summary(model, (1, 3, 80, 15))

exp/baseline1/train.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import tqdm
+import argparse
+import os
+from .model import ODCNN
+from .data import BeatTrackingDataset
+from .utils import MultiViewSpectrogram
+from ..data.load import ds
+def train(target_type: str, output_dir: str):
+    # Note: Paper uses SGD with Momentum, Dropout, and ReLU
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    BATCH_SIZE = 512
+    EPOCHS = 50
+    LR = 0.05
+    MOMENTUM = 0.9
+    NUM_WORKERS = 4
+    print(f"--- Training Model for target: {target_type} ---")
+    print(f"Output directory: {output_dir}")
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # TensorBoard writer
+    writer = SummaryWriter(log_dir=os.path.join(output_dir, "logs"))
+    # Data - use existing train/test splits
+    train_dataset = BeatTrackingDataset(ds["train"], target_type=target_type)
+    val_dataset = BeatTrackingDataset(ds["test"], target_type=target_type)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=True,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=False,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")
+    # Model
+    model = ODCNN(dropout_rate=0.5).to(DEVICE)
+    # GPU Spectrogram Preprocessor
+    preprocessor = MultiViewSpectrogram(sample_rate=16000, hop_length=160).to(DEVICE)
+    # Optimizer
+    optimizer = optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM)
+    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
+    criterion = nn.BCELoss()  # Binary Cross Entropy
+    best_val_loss = float("inf")
+    global_step = 0
+    for epoch in range(EPOCHS):
+        # Training
+        model.train()
+        total_train_loss = 0
+        for waveform, y in tqdm(
+            train_loader,
+            desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Train",
+            leave=False,
+        ):
+            waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+            # Compute spectrogram on GPU
+            with torch.no_grad():
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                # Extract center context (T should be ~15 frames)
+                x = spec[:, :, :, 7:22]  # center 15 frames
+            optimizer.zero_grad()
+            output = model(x)
+            loss = criterion(output, y)
+            loss.backward()
+            optimizer.step()
+            total_train_loss += loss.item()
+            global_step += 1
+            # Log batch loss
+            writer.add_scalar("train/batch_loss", loss.item(), global_step)
+        avg_train_loss = total_train_loss / len(train_loader)
+        # Validation
+        model.eval()
+        total_val_loss = 0
+        with torch.no_grad():
+            for waveform, y in tqdm(
+                val_loader,
+                desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Val",
+                leave=False,
+            ):
+                waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+                # Compute spectrogram on GPU
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                # Extract center context
+                x = spec[:, :, :, 7:22]
+                output = model(x)
+                loss = criterion(output, y)
+                total_val_loss += loss.item()
+        avg_val_loss = total_val_loss / len(val_loader)
+        # Log epoch metrics
+        writer.add_scalar("train/epoch_loss", avg_train_loss, epoch)
+        writer.add_scalar("val/loss", avg_val_loss, epoch)
+        writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], epoch)
+        # Step the scheduler
+        scheduler.step()
+        print(
+            f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} - "
+            f"Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}"
+        )
+        # Save best model
+        if avg_val_loss < best_val_loss:
+            best_val_loss = avg_val_loss
+            model.save_pretrained(output_dir)
+            print(f"  -> Saved best model (val_loss: {best_val_loss:.4f})")
+    writer.close()
+    # Save final model
+    final_dir = os.path.join(output_dir, "final")
+    model.save_pretrained(final_dir)
+    print(f"Saved final model to {final_dir}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--target",
+        type=str,
+        choices=["beats", "downbeats"],
+        default=None,
+        help="Train a model for 'beats' or 'downbeats'. If not specified, trains both.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/baseline1",
+        help="Directory to save model and logs",
+    )
+    args = parser.parse_args()
+    # Determine which targets to train
+    targets = [args.target] if args.target else ["beats", "downbeats"]
+    for target in targets:
+        output_dir = os.path.join(args.output_dir, target)
+        train(target, output_dir)

exp/baseline1/utils.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import torch
+import torch.nn as nn
+import torchaudio.transforms as T
+import numpy as np
+class MultiViewSpectrogram(nn.Module):
+    def __init__(self, sample_rate=16000, n_mels=80, hop_length=160):
+        super().__init__()
+        #  Window sizes: 23ms, 46ms, 93ms
+        self.win_lengths = [368, 736, 1488]
+        self.transforms = nn.ModuleList()
+        for win_len in self.win_lengths:
+            n_fft = 2 ** int(np.ceil(np.log2(win_len)))
+            mel = T.MelSpectrogram(
+                sample_rate=sample_rate,
+                n_fft=n_fft,
+                win_length=win_len,
+                hop_length=hop_length,
+                f_min=27.5,
+                f_max=16000.0,
+                n_mels=n_mels,
+                power=1.0,
+                center=True,
+            )
+            self.transforms.append(mel)
+    def forward(self, waveform):
+        specs = []
+        for transform in self.transforms:
+            # Scale magnitudes logarithmically
+            s = transform(waveform)
+            s = torch.log(s + 1e-9)
+            specs.append(s)
+        return torch.stack(specs, dim=1)
+def extract_context(spec, center_frame, context=7):
+    # Context of +/- 70ms (7 frames)
+    channels, n_mels, total_time = spec.shape
+    start = center_frame - context
+    end = center_frame + context + 1
+    pad_left = max(0, -start)
+    pad_right = max(0, end - total_time)
+    if pad_left > 0 or pad_right > 0:
+        spec = torch.nn.functional.pad(spec, (pad_left, pad_right))
+        start += pad_left
+        end += pad_left
+    return spec[:, :, start:end]

exp/baseline2/__init__.py ADDED Viewed

File without changes

exp/baseline2/data.py ADDED Viewed

	@@ -0,0 +1,137 @@

+import torch
+from torch.utils.data import Dataset
+import numpy as np
+from tqdm import tqdm
+class BeatTrackingDataset(Dataset):
+    def __init__(
+        self,
+        hf_dataset,
+        target_type="beats",
+        sample_rate=16000,
+        hop_length=160,
+        context_frames=50,
+    ):
+        """
+        Args:
+            hf_dataset: HuggingFace dataset object
+            target_type (str): "beats" or "downbeats". Determines which labels are treated as positive.
+            context_frames (int): Number of frames before and after the center frame.
+                                  Total frames = 2 * context_frames + 1.
+                                  Default 50 means 101 frames (~1s).
+        """
+        self.sr = sample_rate
+        self.hop_length = hop_length
+        self.target_type = target_type
+        self.context_frames = context_frames
+        # Context window size in samples
+        # We need enough samples for the center frame +/- context frames
+        # PLUS the window size of the largest FFT to compute the edges correctly.
+        # Largest window in MultiViewSpectrogram is 1488.
+        self.context_samples = (self.context_frames * 2 + 1) * hop_length + 1488
+        # Cache audio arrays in memory for fast access
+        self.audio_cache = []
+        self.indices = []
+        self._prepare_indices(hf_dataset)
+    def _prepare_indices(self, hf_dataset):
+        """
+        Prepares balanced indices and caches audio.
+        Uses the same "Fuzzier" training examples strategy as the baseline.
+        """
+        print(f"Preparing dataset indices for target: {self.target_type}...")
+        for i, item in tqdm(
+            enumerate(hf_dataset), total=len(hf_dataset), desc="Building indices"
+        ):
+            # Cache audio array (convert to numpy if tensor)
+            audio = item["audio"]["array"]
+            if hasattr(audio, "numpy"):
+                audio = audio.numpy()
+            self.audio_cache.append(audio)
+            # Calculate total frames available in audio
+            audio_len = len(audio)
+            n_frames = int(audio_len / self.hop_length)
+            # Select ground truth based on target_type
+            if self.target_type == "downbeats":
+                gt_times = item["downbeats"]
+            else:
+                gt_times = item["beats"]
+            # Convert to list if tensor
+            if hasattr(gt_times, "tolist"):
+                gt_times = gt_times.tolist()
+            gt_frames = set([int(t * self.sr / self.hop_length) for t in gt_times])
+            # --- Positive Examples (with Fuzziness) ---
+            pos_frames = set()
+            for bf in gt_frames:
+                if 0 <= bf < n_frames:
+                    self.indices.append((i, bf, 1.0))  # Center frame
+                    pos_frames.add(bf)
+                # Neighbors weighted at 0.25
+                if 0 <= bf - 1 < n_frames:
+                    self.indices.append((i, bf - 1, 0.25))
+                    pos_frames.add(bf - 1)
+                if 0 <= bf + 1 < n_frames:
+                    self.indices.append((i, bf + 1, 0.25))
+                    pos_frames.add(bf + 1)
+            # --- Negative Examples ---
+            # Balance 2:1
+            num_pos = len(pos_frames)
+            num_neg = num_pos * 2
+            count = 0
+            attempts = 0
+            while count < num_neg and attempts < num_neg * 5:
+                f = np.random.randint(0, n_frames)
+                if f not in pos_frames:
+                    self.indices.append((i, f, 0.0))
+                    count += 1
+                attempts += 1
+        print(
+            f"Dataset ready. {len(self.indices)} samples, {len(self.audio_cache)} tracks cached."
+        )
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        track_idx, frame_idx, label = self.indices[idx]
+        # Fast lookup from cache
+        audio = self.audio_cache[track_idx]
+        audio_len = len(audio)
+        # Calculate sample range for context window
+        center_sample = frame_idx * self.hop_length
+        half_context = self.context_samples // 2
+        # We want the window centered around center_sample
+        start = center_sample - half_context
+        end = center_sample + half_context
+        # Handle padding if needed
+        pad_left = max(0, -start)
+        pad_right = max(0, end - audio_len)
+        valid_start = max(0, start)
+        valid_end = min(audio_len, end)
+        # Extract audio chunk
+        chunk = audio[valid_start:valid_end]
+        if pad_left > 0 or pad_right > 0:
+            chunk = np.pad(chunk, (pad_left, pad_right), mode="constant")
+        waveform = torch.tensor(chunk, dtype=torch.float32)
+        return waveform, torch.tensor([label], dtype=torch.float32)

exp/baseline2/eval.py ADDED Viewed

	@@ -0,0 +1,324 @@

+import torch
+import numpy as np
+from tqdm import tqdm
+from scipy.signal import find_peaks
+import argparse
+import os
+from .model import ResNet
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+from ..data.eval import evaluate_all, format_results
+def get_activation_function(model, waveform, device):
+    """
+    Computes probability curve over time.
+    """
+    processor = MultiViewSpectrogram().to(device)
+    waveform = waveform.unsqueeze(0).to(device)
+    with torch.no_grad():
+        spec = processor(waveform)
+        # Normalize
+        mean = spec.mean(dim=(2, 3), keepdim=True)
+        std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+        spec = (spec - mean) / std
+        # Batchify with sliding window
+        # Context frames = 50, so total window = 101.
+        # Pad time by 50 on each side.
+        spec = torch.nn.functional.pad(spec, (50, 50))  # Pad time
+        windows = spec.unfold(3, 101, 1)  # (1, 3, 80, Time, 101)
+        windows = windows.permute(3, 0, 1, 2, 4).squeeze(1)  # (Time, 3, 80, 101)
+        # Inference
+        activations = []
+        batch_size = 128  # Reduced batch size
+        for i in range(0, len(windows), batch_size):
+            batch = windows[i : i + batch_size]
+            out = model(batch)
+            activations.append(out.cpu().numpy())
+    return np.concatenate(activations).flatten()
+def pick_peaks(activations, hop_length=160, sr=16000):
+    """
+    Smooth with Hamming window and report local maxima.
+    """
+    # Smoothing
+    window = np.hamming(5)
+    window /= window.sum()
+    smoothed = np.convolve(activations, window, mode="same")
+    # Peak Picking
+    peaks, _ = find_peaks(smoothed, height=0.5, distance=5)
+    timestamps = peaks * hop_length / sr
+    return timestamps.tolist()
+def visualize_track(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    time_range: tuple[float, float] | None = None,
+):
+    """
+    Create and save visualizations for a single track.
+    """
+    from ..data.viz import plot_waveform_with_beats, save_figure
+    os.makedirs(output_dir, exist_ok=True)
+    # Full waveform plot
+    fig = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title=f"Track {track_idx}: Beat Comparison",
+        time_range=time_range,
+    )
+    save_figure(fig, os.path.join(output_dir, f"track_{track_idx:03d}.png"))
+def synthesize_audio(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    click_volume: float = 0.5,
+):
+    """
+    Create and save audio files with click tracks for a single track.
+    """
+    from ..data.audio import create_comparison_audio, save_audio
+    os.makedirs(output_dir, exist_ok=True)
+    # Create comparison audio
+    audio_pred, audio_gt, audio_both = create_comparison_audio(
+        audio,
+        pred_beats,
+        pred_downbeats,
+        gt_beats,
+        gt_downbeats,
+        sr=sr,
+        click_volume=click_volume,
+    )
+    # Save audio files
+    save_audio(
+        audio_pred, os.path.join(output_dir, f"track_{track_idx:03d}_pred.wav"), sr
+    )
+    save_audio(audio_gt, os.path.join(output_dir, f"track_{track_idx:03d}_gt.wav"), sr)
+    save_audio(
+        audio_both, os.path.join(output_dir, f"track_{track_idx:03d}_both.wav"), sr
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate beat tracking models with visualization and audio synthesis"
+    )
+    parser.add_argument(
+        "--model-dir",
+        type=str,
+        default="outputs/baseline2",
+        help="Base directory containing trained models (with 'beats' and 'downbeats' subdirs)",
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=116,
+        help="Number of samples to evaluate",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/eval_baseline2",
+        help="Directory to save visualizations and audio",
+    )
+    parser.add_argument(
+        "--visualize",
+        action="store_true",
+        help="Generate visualization plots for each track",
+    )
+    parser.add_argument(
+        "--synthesize",
+        action="store_true",
+        help="Generate audio files with click tracks",
+    )
+    parser.add_argument(
+        "--viz-tracks",
+        type=int,
+        default=5,
+        help="Number of tracks to visualize/synthesize (default: 5)",
+    )
+    parser.add_argument(
+        "--time-range",
+        type=float,
+        nargs=2,
+        default=None,
+        metavar=("START", "END"),
+        help="Time range for visualization in seconds (default: full track)",
+    )
+    parser.add_argument(
+        "--click-volume",
+        type=float,
+        default=0.5,
+        help="Volume of click sounds relative to audio (0.0 to 1.0)",
+    )
+    parser.add_argument(
+        "--summary-plot",
+        action="store_true",
+        help="Generate summary evaluation plot",
+    )
+    args = parser.parse_args()
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load BOTH models using from_pretrained
+    beat_model = None
+    downbeat_model = None
+    has_beats = False
+    has_downbeats = False
+    beats_dir = os.path.join(args.model_dir, "beats")
+    downbeats_dir = os.path.join(args.model_dir, "downbeats")
+    if os.path.exists(os.path.join(beats_dir, "model.safetensors")):
+        beat_model = ResNet.from_pretrained(beats_dir).to(DEVICE)
+        beat_model.eval()
+        has_beats = True
+        print(f"Loaded Beat Model from {beats_dir}")
+    else:
+        print(f"Warning: No beat model found in {beats_dir}")
+    if os.path.exists(os.path.join(downbeats_dir, "model.safetensors")):
+        downbeat_model = ResNet.from_pretrained(downbeats_dir).to(DEVICE)
+        downbeat_model.eval()
+        has_downbeats = True
+        print(f"Loaded Downbeat Model from {downbeats_dir}")
+    else:
+        print(f"Warning: No downbeat model found in {downbeats_dir}")
+    if not has_beats and not has_downbeats:
+        print("No models found. Please run training first.")
+        return
+    predictions = []
+    ground_truths = []
+    audio_data = []  # Store audio for visualization/synthesis
+    # Eval on specified number of tracks
+    test_set = ds["train"].select(range(args.num_samples))
+    print("Running evaluation...")
+    for i, item in enumerate(tqdm(test_set)):
+        waveform = torch.tensor(item["audio"]["array"], dtype=torch.float32)
+        waveform_device = waveform.to(DEVICE)
+        pred_entry = {"beats": [], "downbeats": []}
+        # 1. Predict Beats
+        if has_beats:
+            act_b = get_activation_function(beat_model, waveform_device, DEVICE)
+            pred_entry["beats"] = pick_peaks(act_b)
+        # 2. Predict Downbeats
+        if has_downbeats:
+            act_d = get_activation_function(downbeat_model, waveform_device, DEVICE)
+            pred_entry["downbeats"] = pick_peaks(act_d)
+        predictions.append(pred_entry)
+        ground_truths.append({"beats": item["beats"], "downbeats": item["downbeats"]})
+        # Store audio for later visualization/synthesis
+        if args.visualize or args.synthesize:
+            if i < args.viz_tracks:
+                audio_data.append(
+                    {
+                        "audio": waveform.numpy(),
+                        "sr": item["audio"]["sampling_rate"],
+                        "pred": pred_entry,
+                        "gt": ground_truths[-1],
+                    }
+                )
+    # Run evaluation
+    results = evaluate_all(predictions, ground_truths)
+    print(format_results(results))
+    # Create output directory
+    if args.visualize or args.synthesize or args.summary_plot:
+        os.makedirs(args.output_dir, exist_ok=True)
+    # Generate visualizations
+    if args.visualize:
+        print(f"\nGenerating visualizations for {len(audio_data)} tracks...")
+        viz_dir = os.path.join(args.output_dir, "plots")
+        for i, data in enumerate(tqdm(audio_data, desc="Visualizing")):
+            time_range = tuple(args.time_range) if args.time_range else None
+            visualize_track(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                viz_dir,
+                i,
+                time_range=time_range,
+            )
+        print(f"Saved visualizations to {viz_dir}")
+    # Generate audio with clicks
+    if args.synthesize:
+        print(f"\nSynthesizing audio for {len(audio_data)} tracks...")
+        audio_dir = os.path.join(args.output_dir, "audio")
+        for i, data in enumerate(tqdm(audio_data, desc="Synthesizing")):
+            synthesize_audio(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                audio_dir,
+                i,
+                click_volume=args.click_volume,
+            )
+        print(f"Saved audio files to {audio_dir}")
+        print("  *_pred.wav - Original audio with predicted beat clicks")
+        print("  *_gt.wav   - Original audio with ground truth beat clicks")
+        print("  *_both.wav - Original audio with both predicted and GT clicks")
+    # Generate summary plot
+    if args.summary_plot:
+        from ..data.viz import plot_evaluation_summary, save_figure
+        print("\nGenerating summary plot...")
+        fig = plot_evaluation_summary(results, title="Beat Tracking Evaluation Summary")
+        summary_path = os.path.join(args.output_dir, "evaluation_summary.png")
+        save_figure(fig, summary_path)
+        print(f"Saved summary plot to {summary_path}")
+if __name__ == "__main__":
+    main()

exp/baseline2/model.py ADDED Viewed

	@@ -0,0 +1,139 @@

+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+class SEBlock(nn.Module):
+    def __init__(self, channels, reduction=16):
+        super().__init__()
+        self.avg_pool = nn.AdaptiveAvgPool2d(1)
+        self.fc = nn.Sequential(
+            nn.Linear(channels, channels // reduction, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Linear(channels // reduction, channels, bias=False),
+            nn.Sigmoid(),
+        )
+    def forward(self, x):
+        b, c, _, _ = x.size()
+        y = self.avg_pool(x).view(b, c)
+        y = self.fc(y).view(b, c, 1, 1)
+        return x * y.expand_as(x)
+class ResBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
+        super().__init__()
+        self.conv1 = nn.Conv2d(
+            in_channels,
+            out_channels,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            bias=False,
+        )
+        self.bn1 = nn.BatchNorm2d(out_channels)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(
+            out_channels, out_channels, kernel_size=3, padding=1, bias=False
+        )
+        self.bn2 = nn.BatchNorm2d(out_channels)
+        self.se = SEBlock(out_channels)
+        self.downsample = downsample
+    def forward(self, x):
+        identity = x
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.se(out)
+        out += identity
+        out = self.relu(out)
+        return out
+class ResNet(nn.Module, PyTorchModelHubMixin):
+    def __init__(
+        self, layers=[2, 2, 2, 2], channels=[16, 24, 48, 96], dropout_rate=0.5
+    ):
+        super().__init__()
+        self.in_channels = 16
+        # Stem
+        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(16)
+        self.relu = nn.ReLU(inplace=True)
+        # Stages
+        self.layer1 = self._make_layer(channels[0], layers[0], stride=1)
+        self.layer2 = self._make_layer(channels[1], layers[1], stride=2)
+        self.layer3 = self._make_layer(channels[2], layers[2], stride=2)
+        self.layer4 = self._make_layer(channels[3], layers[3], stride=2)
+        self.dropout = nn.Dropout(p=dropout_rate)
+        # Final classification head
+        # H, W will reduce. Assuming input is (3, 80, 101)
+        # L1: (16, 80, 101) (stride 1)
+        # L2: (32, 40, 51)  (stride 2)
+        # L3: (64, 20, 26)  (stride 2)
+        # L4: (128, 10, 13) (stride 2)
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(channels[3], 1)
+        self.sigmoid = nn.Sigmoid()
+    def _make_layer(self, out_channels, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.in_channels != out_channels:
+            downsample = nn.Sequential(
+                nn.Conv2d(
+                    self.in_channels,
+                    out_channels,
+                    kernel_size=1,
+                    stride=stride,
+                    bias=False,
+                ),
+                nn.BatchNorm2d(out_channels),
+            )
+        layers = []
+        layers.append(ResBlock(self.in_channels, out_channels, stride, downsample))
+        self.in_channels = out_channels
+        for _ in range(1, blocks):
+            layers.append(ResBlock(self.in_channels, out_channels))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        # x: (B, 3, 80, 101)
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.avgpool(x)  # (B, 128, 1, 1)
+        x = torch.flatten(x, 1)  # (B, 128)
+        x = self.dropout(x)
+        x = self.fc(x)
+        x = self.sigmoid(x)
+        return x
+if __name__ == "__main__":
+    from torchinfo import summary
+    model = ResNet()
+    summary(model, (1, 3, 80, 101))

exp/baseline2/train.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import tqdm
+import argparse
+import os
+from .model import ResNet
+from .data import BeatTrackingDataset
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+def train(target_type: str, output_dir: str):
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    BATCH_SIZE = 128  # Reduced batch size due to larger context
+    EPOCHS = 3
+    LR = 0.001  # Adjusted LR for Adam (ResNet usually prefers Adam/AdamW)
+    NUM_WORKERS = 4
+    CONTEXT_FRAMES = 50  # +/- 50 frames -> 101 frames total
+    PATIENCE = 5  # Early stopping patience
+    print(f"--- Training Model for target: {target_type} ---")
+    print(f"Output directory: {output_dir}")
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # TensorBoard writer
+    writer = SummaryWriter(log_dir=os.path.join(output_dir, "logs"))
+    # Data
+    train_dataset = BeatTrackingDataset(
+        ds["train"], target_type=target_type, context_frames=CONTEXT_FRAMES
+    )
+    val_dataset = BeatTrackingDataset(
+        ds["test"], target_type=target_type, context_frames=CONTEXT_FRAMES
+    )
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=True,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=False,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")
+    # Model
+    model = ResNet(dropout_rate=0.5).to(DEVICE)
+    # GPU Spectrogram Preprocessor
+    preprocessor = MultiViewSpectrogram(sample_rate=16000, hop_length=160).to(DEVICE)
+    # Optimizer - Using AdamW for ResNet
+    optimizer = optim.AdamW(model.parameters(), lr=LR, weight_decay=1e-4)
+    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
+    criterion = nn.BCELoss()  # Binary Cross Entropy
+    best_val_loss = float("inf")
+    patience_counter = 0
+    global_step = 0
+    for epoch in range(EPOCHS):
+        # Training
+        model.train()
+        total_train_loss = 0
+        for waveform, y in tqdm(
+            train_loader,
+            desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Train",
+            leave=False,
+        ):
+            waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+            # Compute spectrogram on GPU
+            with torch.no_grad():
+                spec = preprocessor(waveform)  # (B, 3, 80, T_raw)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = CONTEXT_FRAMES * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                elif T_curr < target_T:
+                    # This shouldn't happen if dataset is correct, but just in case pad
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                else:
+                    x = spec
+            optimizer.zero_grad()
+            output = model(x)
+            loss = criterion(output, y)
+            loss.backward()
+            optimizer.step()
+            total_train_loss += loss.item()
+            global_step += 1
+            # Log batch loss
+            writer.add_scalar("train/batch_loss", loss.item(), global_step)
+        avg_train_loss = total_train_loss / len(train_loader)
+        # Validation
+        model.eval()
+        total_val_loss = 0
+        with torch.no_grad():
+            for waveform, y in tqdm(
+                val_loader,
+                desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Val",
+                leave=False,
+            ):
+                waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+                # Compute spectrogram on GPU
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = CONTEXT_FRAMES * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                else:
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                output = model(x)
+                loss = criterion(output, y)
+                total_val_loss += loss.item()
+        avg_val_loss = total_val_loss / len(val_loader)
+        # Log epoch metrics
+        writer.add_scalar("train/epoch_loss", avg_train_loss, epoch)
+        writer.add_scalar("val/loss", avg_val_loss, epoch)
+        writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], epoch)
+        # Step the scheduler
+        scheduler.step()
+        print(
+            f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} - "
+            f"Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}"
+        )
+        # Save best model
+        if avg_val_loss < best_val_loss:
+            best_val_loss = avg_val_loss
+            patience_counter = 0
+            model.save_pretrained(output_dir)
+            print(f"  -> Saved best model (val_loss: {best_val_loss:.4f})")
+        else:
+            patience_counter += 1
+            print(f"  -> No improvement (patience: {patience_counter}/{PATIENCE})")
+        if patience_counter >= PATIENCE:
+            print("Early stopping triggered.")
+            break
+    writer.close()
+    # Save final model
+    final_dir = os.path.join(output_dir, "final")
+    model.save_pretrained(final_dir)
+    print(f"Saved final model to {final_dir}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--target",
+        type=str,
+        choices=["beats", "downbeats"],
+        default=None,
+        help="Train a model for 'beats' or 'downbeats'. If not specified, trains both.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/baseline2",
+        help="Directory to save model and logs",
+    )
+    args = parser.parse_args()
+    # Determine which targets to train
+    targets = [args.target] if args.target else ["beats", "downbeats"]
+    for target in targets:
+        output_dir = os.path.join(args.output_dir, target)
+        train(target, output_dir)

outputs/baseline1/beats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/beats/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/beats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/beats/final/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/beats/final/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0ee01ee41360f0b486e16d6022f896a19f9ead901be0180bdbd9cad2a3b8597
+size 1159372

outputs/baseline1/beats/logs/events.out.tfevents.1766351314.msiit232.1284330.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b2d91a22ba01091bf072f5a5e8f12fc7d49801d6538914c973ccb2700978934
+size 17749022

outputs/baseline1/beats/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e7a0d5178bc5dfeee6da26345e7956aeb6bf64a21be7e541db4bcc37b290249
+size 1159372

outputs/baseline1/downbeats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/downbeats/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/downbeats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/downbeats/final/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/downbeats/final/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:870e3425ffd366be9a0e8fafcda62fa28b2c25917c8354570edc53a67e132d38
+size 1159372

outputs/baseline1/downbeats/logs/events.out.tfevents.1766353075.msiit232.1284330.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8744916b2c1a8255cd5379e6956a4ad2acbf8bcc1fcfaed21ca11285a771550c
+size 4272622

outputs/baseline1/downbeats/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8895be0bff1c3210f46b04c490596490fe03081728e17fffb33c80369b472134
+size 1159372

outputs/baseline2/beats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline2/beats/config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "channels": [
+    16,
+    24,
+    48,
+    96
+  ],
+  "dropout_rate": 0.5,
+  "layers": [
+    2,
+    2,
+    2,
+    2
+  ]
+}

outputs/baseline2/beats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline2/beats/final/config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "channels": [
+    16,
+    24,
+    48,
+    96
+  ],
+  "dropout_rate": 0.5,
+  "layers": [
+    2,
+    2,
+    2,
+    2
+  ]
+}

outputs/baseline2/beats/final/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e60f98fe8152c2238d2bf7b2bc800a13aaf8a3cb665c3bfaa8e7dbc656362212
+size 1629940

outputs/baseline2/beats/logs/events.out.tfevents.1766356346.msiit232.1356098.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0b974a684478fa38672c299c1a441df5c051abc37e30e6f06d26502378c7c1d
+size 4245699

outputs/baseline2/beats/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e60f98fe8152c2238d2bf7b2bc800a13aaf8a3cb665c3bfaa8e7dbc656362212
+size 1629940

outputs/baseline2/downbeats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline2/downbeats/config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "channels": [
+    16,
+    24,
+    48,
+    96
+  ],
+  "dropout_rate": 0.5,
+  "layers": [
+    2,
+    2,
+    2,
+    2
+  ]
+}

outputs/baseline2/downbeats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline2/downbeats/final/config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "channels": [
+    16,
+    24,
+    48,
+    96
+  ],
+  "dropout_rate": 0.5,
+  "layers": [
+    2,
+    2,
+    2,
+    2
+  ]
+}

outputs/baseline2/downbeats/final/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72c3a009b2e5d067d69d53755d96cee23a6305de3aa9b8336c3b817a0f0f8e77
+size 1629940

outputs/baseline2/downbeats/logs/events.out.tfevents.1766359276.msiit232.1356098.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc03f9d0a5e525864dd519cd176d0ed71520315a89b9853403a72afac4e77921
+size 1011363

outputs/baseline2/downbeats/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72c3a009b2e5d067d69d53755d96cee23a6305de3aa9b8336c3b817a0f0f8e77
+size 1629940