Instructions to use Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Sana
How to use Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers with Sana:
# Load the model and infer image from text import torch from app.sana_pipeline import SanaPipeline from torchvision.utils import save_image sana = SanaPipeline("configs/sana_config/1024ms/Sana_1600M_img1024.yaml") sana.from_pretrained("hf://Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers") image = sana( prompt='a cyberpunk cat with a neon sign that says "Sana"', height=1024, width=1024, guidance_scale=5.0, pag_guidance_scale=2.0, num_inference_steps=18, ) - Diffusers
How to use Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Inference
- Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -61,7 +61,7 @@ Source code is available at https://github.com/NVlabs/Sana.
|
|
| 61 |
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts.
|
| 62 |
It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders ([Gemma2-2B-IT](https://huggingface.co/google/gemma-2-2b-it))
|
| 63 |
and one 32x spatial-compressed latent feature encoder ([DC-AE](https://hanlab.mit.edu/projects/dc-ae)).
|
| 64 |
-
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/NVlabs/Sana) and the [
|
| 65 |
|
| 66 |
### Model Sources
|
| 67 |
|
|
|
|
| 61 |
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts.
|
| 62 |
It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders ([Gemma2-2B-IT](https://huggingface.co/google/gemma-2-2b-it))
|
| 63 |
and one 32x spatial-compressed latent feature encoder ([DC-AE](https://hanlab.mit.edu/projects/dc-ae)).
|
| 64 |
+
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/NVlabs/Sana) and the [SANA-1.5 report on arXiv](https://arxiv.org/abs/2501.18427).
|
| 65 |
|
| 66 |
### Model Sources
|
| 67 |
|