You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Multimodal Gemma-270M

A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

🎯 Model Inference Examples

Here are real inference results from our trained model:

🐱 Animal Detection

Cats on Couch White Cat Sleeping
Cat Prediction White Cat

πŸ• Dog Recognition

Golden Retriever in Park
Dog Prediction

🏠 Room & Scene Understanding

Modern Kitchen Clean Kitchen
Kitchen 1 Kitchen 2

πŸ• Food & Objects

Food Scene Apple on Table
Food Apple

πŸ›Ή Activity & People

Skate Park Family Dining
Skate Park Family

πŸ“Š Training Details

Parameter Value
Training Samples 157,712 (Full LLaVA dataset)
Epochs 3
Final Training Loss 1.333
Final Validation Loss 1.430
Total Parameters 539M
Trainable Parameters 18.6M (3.4%)
GPU NVIDIA A100 40GB
Training Time ~9 hours
Batch Size 20 (effective: 40)
Precision bf16-mixed

πŸ“ˆ Benchmark Results

Benchmark Score
Basic VQA 53.8% (7/13 correct)
POPE Hallucination 20.0%

VQA Breakdown

  • βœ… Animal identification (cats, dogs)
  • βœ… Room identification (kitchen, living room)
  • βœ… Object presence detection
  • ⚠️ Color identification (moderate)
  • ⚠️ Detailed attributes (needs improvement)

πŸ—οΈ Architecture

Component Details
Language Model Google Gemma-3-270M with LoRA adapters
Vision Encoder OpenAI CLIP ViT-Large/14 (frozen, 428M params)
Vision Projector MLP (3.4M params)
LoRA r=16, alpha=32, dropout=0.1

πŸš€ Usage

from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image

# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()

# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)

πŸ“ Files

File Size Description
final_model.ckpt 1.2GB Full model checkpoint
inference_results/ 13.8MB Example predictions with images

πŸ”— Links

πŸ“š References

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

  • Google for Gemma models
  • OpenAI for CLIP
  • LLaVA team for multimodal architecture inspiration
  • PyTorch Lightning team
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train sagar007/multigemma

Space using sagar007/multigemma 1

Papers for sagar007/multigemma