Lumina Dimoo

An omni discrete diffusion model for multimodal generation and understanding. Built with a unified discrete diffusion architecture, Lumina Dimoo handles inputs and outputs across modalities with strong sampling efficiency and balanced quality.

Primary focus: text-to-image, image-to-image, and image understanding with practical performance on common tasks and benchmarks.

Lumina Dimoo teaser

What is Lumina Dimoo?

Lumina Dimoo is a foundational model designed to work across multiple modalities using a fully discrete diffusion process. This means the same core approach is applied for producing images from text, editing images with image inputs, and answering questions about images. A discrete formulation keeps the model’s interface consistent across tasks and allows straightforward sampling schedules.

The model supports three practical families of tasks: text-to-image generation for prompts with simple or compound structure; image-to-image tasks such as editing, inpainting, and subject-driven changes; and image understanding tasks where the input is an image and the output is text. A single model supports these patterns, focusing on predictable behavior and clear controls.

Sampling is designed to be efficient. Compared with typical autoregressive or hybrid approaches, Lumina Dimoo aims to shorten the number of steps while keeping fidelity and instruction following. A dedicated cache helps reuse intermediate states, reducing repeated computation and improving throughput.

Across public benchmarks, the model reaches strong scores for composition, counting, color consistency, and spatial relations, showing that a discrete diffusion backbone can provide competitive quality for general tasks in a straightforward way.

Key Ideas

Unified discrete diffusion

Inputs and outputs are tokenized and processed by a fully discrete diffusion pipeline. Using the same abstraction across tasks lowers the mental overhead for operators and makes it easier to bring new tasks into the same framework.

Multimodal coverage

Text-to-image for broad prompts and detailed scenes, image-to-image for editing and restoration, and image understanding for answers and descriptions. The scope is practical and covers frequent needs in content creation and analysis.

Sampling efficiency

A shorter schedule with a caching method to remove repeated work. This helps when running many prompts, batched edits, or interactive sessions where quick feedback is important.

Consistent results

Focus on instruction following, layout, counting, and attribute control. The model aims for steady behavior across prompt styles and input quality.

Practical capabilities

Text-to-image

  • Composed scenes with multiple subjects
  • Color and attribute control
  • Layout hints and relative placement
  • High-resolution output

Image-to-image

  • Editing and inpainting
  • Subject-driven generation
  • Background updates
  • Tone and color adjustments

Image understanding

  • Targeted Q&A
  • Captions and summaries
  • Attribute listing
  • Structure and layout descriptions

Efficiency and scale

  • Shorter sampling schedules
  • Cache for repeated states
  • Batch-friendly setup
  • Predictable memory use

How it fits day-to-day work

Text-to-image serves quick concept drafts, options for layouts, and reference scenes. It supports high-resolution output with prompts that can spell out subjects, color, placement, and style. For longer prompts, the model aims to keep the main subject and relations intact.

Image-to-image supports common edits: change or extend backgrounds, fix color, swap subjects, or apply inpainting to fill missing parts. Subject-driven generation keeps the main subject recognizable after changes. For restoration, gentle updates help align tone and texture without strong artifacts.

Image understanding helps organize and describe images. It can answer targeted questions, summarize content, and list attributes. This is useful for search, tagging, and quality checks.

Pros and Cons

Pros

  • High-quality alpha mattes
  • Handles imperfect segmentation masks
  • Stable object tracking performance
  • Refinement without retraining
  • Interactive and flexible workflow

Cons

  • Relies on initial segmentation mask
  • Memory-intensive for recurrent refinement
  • Performance varies with object complexity

Getting started

You can set up Lumina Dimoo locally with common tools. A typical flow is to prepare a Python environment, install dependencies, obtain model files, and run a small script for text-to-image or image editing. For a step-by-step guide, see the Installation page in the navigation.

  • Create and activate a virtual environment
  • Install required packages
  • Download checkpoints
  • Run a quick script to verify output

Installation

This guide shows a simple way to run Lumina Dimoo locally for text-to-image and image-to-image tasks. Adjust steps to match your OS and hardware.

1. Prepare environment

  • Install Python 3.10 or later
  • Install Git
  • Have a GPU-ready setup if you plan to generate at higher resolution

2. Create a virtual environment

python3 -m venv .venv source .venv/bin/activate  # Windows: .venv\\Scripts\\activate

3. Install dependencies

Install core libraries used for inference. Replace versions as needed for your system.

pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu121 pip install transformers diffusers accelerate safetensors pillow einops numpy

4. Obtain model files

Download the public checkpoints for Lumina Dimoo from your preferred source and place them in a directory, for example: models/lumina-dimoo/

  • Create a folder: models/lumina-dimoo
  • Put the weights and config files inside that folder

5. Run a quick test

The following is a placeholder Python script outline to confirm your environment works. Adapt it to your actual checkpoint paths.

import torch from PIL import Image # Placeholder: load your pipeline here # pipe = YourLuminaDimooPipeline.from_pretrained("./models/lumina-dimoo").to("cuda") prompt = "a calm lake with pine trees and soft sunlight" # image = pipe(prompt, num_inference_steps=25).images[0] # image.save("output.png") print("Done")

6. Tips

  • Lower num_inference_steps for faster drafts; raise it for quality
  • Use a small negative prompt to avoid unwanted elements
  • For image-to-image, start with modest changes and iterate

Note: This page is a simple starter. Refer to official repositories and model cards for exact installation commands and options.

FAQs