# H100 Lightweight Training Configuration Guide This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset. ## 🎯 Overview The H100 Lightweight configuration is designed for: - **Rapid experimentation** on H100 GPUs - **Efficient training** with 80K carefully selected samples - **Quick iteration** for research and development - **Cost-effective** training sessions ## 🚀 Key Features ### **Optimized for H100** - **Batch Size**: 16 (larger than A100 configs) - **Gradient Accumulation**: 4 (reduced for faster updates) - **Learning Rate**: 8e-6 (slightly higher for rapid convergence) - **Sequence Length**: 8192 (full context window) ### **Dataset Sampling** - **Source**: OpenHermes-FR dataset - **Sample Size**: 80,000 random samples - **Validation**: 1,000 samples (if available) - **Reproducibility**: Fixed random seed (42) ### **Training Optimizations** - **Warmup Steps**: 50 (reduced for rapid training) - **Evaluation**: Every 50 steps - **Logging**: Every 5 steps - **Saving**: Every 200 steps - **Checkpoints**: Keep only 2 (save storage) ## 📊 Configuration Details ### **Model Configuration** ```python model_name="HuggingFaceTB/SmolLM3-3B" max_seq_length=8192 use_flash_attention=True use_gradient_checkpointing=True ``` ### **Training Parameters** ```python batch_size=16 gradient_accumulation_steps=4 learning_rate=8e-6 warmup_steps=50 max_epochs=1 ``` ### **H100-Specific Optimizations** ```python dataloader_num_workers=4 dataloader_pin_memory=True gradient_clipping=1.0 group_by_length=True pad_to_multiple_of=8 ``` ### **Memory Optimizations** ```python save_total_limit=2 early_stopping_patience=3 max_grad_norm=1.0 warmup_ratio=0.1 ``` ## 🔧 Usage ### **Interactive Selection** ```bash ./launch.sh # Select "H100 Lightweight (Rapid)" when prompted ``` ### **Expected Training Time** - **H100**: ~2-4 hours (depending on hardware) - **A100**: ~4-6 hours - **V100**: ~6-8 hours ### **Memory Requirements** - **GPU Memory**: 40GB+ (H100 recommended) - **System RAM**: 32GB+ - **Storage**: 50GB+ for dataset and checkpoints ## 📈 Performance Characteristics ### **Training Speed** - **Steps per Second**: ~2-3 (on H100) - **Samples per Second**: ~32-48 - **Effective Batch Size**: 64 (16 × 4) ### **Convergence** - **Expected Loss**: 1.2-1.8 (after 1 epoch) - **Evaluation Frequency**: Every 50 steps - **Early Stopping**: After 3 evaluations without improvement ### **Dataset Efficiency** - **80K samples**: ~1.3% of full OpenHermes-FR - **Random sampling**: Ensures diversity - **Fixed seed**: Reproducible results ## 🎯 Use Cases ### **Perfect For** - **Rapid prototyping** of new ideas - **Hyperparameter tuning** experiments - **Model comparison** studies - **Research validation** before full training - **Educational purposes** and learning ### **Not Recommended For** - **Production models** (use Multiple Passes instead) - **Competition submissions** (use full dataset) - **Research papers** (use complete training) ## 🔄 Comparison with Other Configurations | Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case | |---------------|--------------|------------|--------|---------------|----------| | **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning | | **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments | | **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research | | **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production | ## 🛠️ Customization ### **Modifying Sample Size** ```bash # In the launch script, you can modify: DATASET_SAMPLE_SIZE=50000 # For 50K samples DATASET_SAMPLE_SIZE=100000 # For 100K samples ``` ### **Adjusting Training Parameters** ```bash # Modify in config/train_smollm3_h100_lightweight.py: batch_size=12 # Smaller batch size learning_rate=6e-6 # Lower learning rate warmup_steps=100 # More warmup steps ``` ### **Changing Dataset** ```bash # Modify the dataset name in the configuration: dataset_name="your-custom-dataset" ``` ## 📊 Monitoring and Results ### **Trackio Integration** - **Real-time metrics**: Loss, learning rate, gradient norm - **Training curves**: Visual progress tracking - **Resource usage**: GPU utilization, memory consumption - **Artifacts**: Model checkpoints, logs ### **Expected Metrics** - **Training Loss**: Starts ~3.0, ends ~1.5 - **Validation Loss**: Should be close to training loss - **Learning Rate**: Cosine decay from 8e-6 to 2e-6 - **Gradient Norm**: Should stay below 1.0 ### **Success Indicators** - **Converging loss**: Steady decrease over time - **Stable gradients**: Consistent gradient norms - **Good validation**: Validation loss follows training loss - **No overfitting**: Validation loss doesn't increase ## 🚨 Troubleshooting ### **Common Issues** #### **Out of Memory (OOM)** ```bash # Reduce batch size in config: batch_size=12 # Instead of 16 gradient_accumulation_steps=6 # Instead of 4 ``` #### **Slow Training** ```bash # Check GPU utilization: nvidia-smi # Ensure CUDA is properly installed python -c "import torch; print(torch.cuda.is_available())" ``` #### **Poor Convergence** ```bash # Try different learning rate: learning_rate=6e-6 # Instead of 8e-6 # Or increase warmup: warmup_steps=100 # Instead of 50 ``` #### **Dataset Issues** ```bash # Check dataset loading: python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))" ``` ### **Performance Tips** 1. **Use H100 if available**: Significantly faster than A100 2. **Monitor GPU memory**: Keep utilization below 90% 3. **Check logs regularly**: Look for convergence issues 4. **Save checkpoints**: Don't lose progress 5. **Use early stopping**: Prevent overfitting ## 📋 Example Workflow ### **Complete H100 Lightweight Training** ```bash # 1. Setup python setup_launch.py # 2. Check requirements python check_requirements.py # 3. Run interactive pipeline ./launch.sh # 4. Select configuration # Choose: "H100 Lightweight (Rapid)" # 5. Monitor training # Watch Trackio Space for real-time progress # 6. Check results # Model will be pushed to HF Hub # Summary in training_summary.md ``` ### **Expected Output** ``` ✅ Dataset prepared: 80000 train samples, 1000 validation samples 📈 Training started with 5000 total steps ⏱️ Estimated time: 2-4 hours 📊 Monitor progress at: https://huggingface.co/spaces/... ``` ## 🎉 Benefits ### **Speed** - **3-4x faster** than full dataset training - **Rapid iteration** for research - **Quick validation** of ideas ### **Efficiency** - **Reduced costs** (less GPU time) - **Lower storage** requirements - **Faster experimentation** cycle ### **Quality** - **Still high quality** results - **Good for prototyping** - **Suitable for many use cases** ## 🔮 Future Enhancements ### **Planned Improvements** - **Adaptive sampling**: Smart dataset selection - **Multi-GPU support**: Distributed training - **Advanced monitoring**: More detailed metrics - **Auto-tuning**: Automatic hyperparameter optimization ### **Extensibility** - **Custom datasets**: Easy integration - **Different models**: Support for other architectures - **Advanced sampling**: Stratified, balanced sampling --- **Happy Rapid Training on H100! 🚀**