Lumina Dimoo: Diffusion Model for Multimodal Generation

Lumina Dimoo is a unified discrete diffusion model for multimodal work. It produces images from text, edits images with guidance, and explains images through descriptions and answers. A single backbone and a discrete token space make the pipeline consistent across tasks.

Focus and scope

The model supports text-to-image, image-to-image, and image understanding. It aims for predictable behavior across prompts and inputs, with attention to layout, color, attributes, and relations. A shorter sampling schedule and a caching step help keep iteration times practical on standard hardware.

Design choices

Fully discrete diffusion for inputs and outputs.
Shared backbone for tasks to reduce complexity.
Sampling cache to avoid repeated work.
Controls for subject identity, color, placement, and attributes.

Who it is for

Practitioners who need steady image generation, practical editing, and structured descriptions. Teams can integrate Lumina Dimoo into content pipelines, research prototypes, and internal tools where reliability and speed matter.

This site summarizes public information about Lumina-DiMOO for educational purposes.

About Lumina Dimoo

Focus and scope

Design choices

Who it is for