Mamoda2.5: Enhancing Unified Multimodal Model
with DiT-MoE

Mamoda Team, ByteDance

Mammoth2.5 benchmark results: video generation, video editing, and inference speed comparison

Top-tier open-source video generation on VBench 2.0 with 12–18× faster inference. SOTA video editing — #1 on OpenVE-Bench, FiVE-Bench, and Reco-Bench, surpassing Kling O1.

Experimental Highlights

Mamoda2.5 is powered by Qwen3-VL-8B and a fine-grained Mixture-of-Experts (MoE) Diffusion Transformer (DiT) with 128 routed experts, bringing multimodal understanding, generation, and editing together in a single AR-Diffusion architecture.

MoE DiT architecture and expert ablations

Effective Fine-Grained MoE Architecture

Mamoda2.5 employs a fine-grained Mixture-of-Experts (MoE) design with 128 routed experts and Top-8 routing to scale the DiT backbone to 25B total parameters while activating only ~3B per forward pass (~12%). Combined with the high-compression Wan2.2 VAE (4×16×16), Mamoda2.5 completes 720p 93-frame video generation in 110 seconds on a single device — over 12× faster than Wan2.2 A14B, 5× faster than HunyuanVideo 1.5, and 18× faster than LongCat Video. Ablation studies confirm a ~2.2× convergence speedup over dense models under matched activated parameters.

Unified AR-Diffusion architecture across modalities

Unified Visual Generation & Editing

A single AR-Diffusion framework built on Qwen3-VL-8B for multimodal understanding and an MoE DiT backbone for generation. One unified model supports text-to-image, text-to-video, image editing, and video editing, eliminating the need for separate task-specific models. The 30-step editing model achieves 12.8× faster inference than VInO; with the distilled 4-step model, editing latency drops to just 9.2 seconds — a 95.9× speedup over VInO and 41.7× over OmniVideo2.

Video Editing

Mammoth2.5 achieves state-of-the-art video editing performance, ranking #1 on OpenVE-Bench, FiVE-Bench, and Reco-Bench. Below are examples across Creative Editing, Replace, Add, Remove, and Style Transfer.

Creative Editing

⏳

0:00 / 0:00

Change the person to a tree person

⏳

0:00 / 0:00

Change horse to ice horse

⏳

0:00 / 0:00

Transform Hand into Robotic Hand

⏳

0:00 / 0:00

Add fire effects to the hand

⏳

0:00 / 0:00

Transform person to elf with floating mythical beasts

⏳

0:00 / 0:00

Change fruit to animate egg with smile

Replace

⏳

0:00 / 0:00

Edit weather

⏳

0:00 / 0:00

Edit season

⏳

0:00 / 0:00

Edit gender

⏳

0:00 / 0:00

Edit age

⏳

0:00 / 0:00

Edit color

⏳

0:00 / 0:00

Edit background

⏳

0:00 / 0:00

Replace clothing

⏳

0:00 / 0:00

Replace with Ultraman

⏳

0:00 / 0:00

Replace material

⏳

0:00 / 0:00

Replace object

Add

⏳

0:00 / 0:00

Add glasses

⏳

0:00 / 0:00

Add beard

⏳

0:00 / 0:00

Add dog

⏳

0:00 / 0:00

Add bird

⏳

0:00 / 0:00

Add ball

⏳

0:00 / 0:00

Add frog

⏳

0:00 / 0:00

Add drone

⏳

0:00 / 0:00

Add honeybee

⏳

0:00 / 0:00

Add backpack

⏳

0:00 / 0:00

Add Doraemon

Remove

⏳

0:00 / 0:00

Remove left person

⏳

0:00 / 0:00

Remove right person

⏳

0:00 / 0:00

Remove jumping person

⏳

0:00 / 0:00

Remove woman

⏳

0:00 / 0:00

Remove hat

⏳

0:00 / 0:00

Remove tattoo

⏳

0:00 / 0:00

Remove basket

⏳

0:00 / 0:00

Remove object in hand

⏳

0:00 / 0:00

Remove animal

⏳

0:00 / 0:00

Remove dark grey car in the bottom

Style

⏳

0:00 / 0:00

Chinese ink style

⏳

0:00 / 0:00

Ghibli style

⏳

0:00 / 0:00

Impressionist style

⏳

0:00 / 0:00

Ukiyo-e style

⏳

0:00 / 0:00

Watercolor style

⏳

0:00 / 0:00

Pixel art style

Text-to-Video

Mammoth2.5 leverages a fine-grained MoE DiT architecture to deliver top-tier text-to-video generation quality with over 12× faster inference than dense models of comparable capacity.

Instruction Following

☕

0:00 / 0:00

俯视角度，一位有着深色，略带凌乱的长卷发的年轻中国女性，佩戴着闪耀的珍珠项链和圆形金色耳环，她凌乱的头发被风吹散，她微微抬头，望向天空，神情十分哀伤，眼中含着泪水。嘴唇涂着红色口红。背景是带有华丽红色花纹的图案。画面呈现复古电影风格，色调低饱和，带着轻微柔焦，烘托情绪氛围，质感仿佛20世纪90年代的经典胶片风格，营造出怀旧且富有戏剧性的感觉。

☕

0:00 / 0:00

A medium shot of a chameleon carefully crawling along a tree branch, its feet gripping tightly to the bark. The camera captures the slow, deliberate movements, the slight shifting of colors, and the independent movement of its eyes.

☕

0:00 / 0:00

Rim lighting, close-up, soft light, low saturation, cool tones, centered composition, level shot, in the frame, an elderly foreign man's facial contours are clear, wearing a black wide-brimmed hat, white long hair slightly exposed from under the hat brim, draped over the collar of a gray wool coat. His eyes are slightly closed, lips slightly parted, seeming to be whispering something. The background is blurred, revealing cool-toned urban streetscape. A hand extends from the right side of the frame, holding a black pistol, the muzzle gently pressed against the old man's temple, creating a tense and quiet atmosphere.

☕

0:00 / 0:00

Retro cyberpunk style - Under the flickering neon lights, a cyber warrior in a leather jacket walks through an abandoned electronic factory. The camera pulls back from his back, revealing a futuristic city night view.

☕

0:00 / 0:00

A man wearing a green raincoat and boots walks through a dense forest in the rain, the trees are tall and create a canopy overhead, the rain is visible as it falls through the trees, the ground is covered in fallen leaves, the scene is moody and atmospheric, captured with a handheld camera, the man is slightly hunched, protecting himself from the rain, the forest is dark and mysterious, the rain creates a peaceful ambiance.

☕

0:00 / 0:00

In a magical, floating island world, a young adventurer with a jetpack soars through the sky, dodging floating rocks and mystical creatures. The camera follows the adventurer from behind, offering a sweeping view of the vast, interconnected islands, each with its unique terrain and ecosystem. The animation features fluid, high-speed flying sequences, with the adventurer narrowly avoiding obstacles and discovering hidden treasure.

Motion Quality

☕

0:00 / 0:00

A dynamic tracking shot following a cheetah as it sprints across a savanna, the camera matching its speed and movement. The camera captures the power and grace of the cheetah’s run, the dust kicked up by its paws, and the blurred background of the fast-paced chase.

☕

0:00 / 0:00

The camera remains still, a muscular athlete in a blue jersey rides a mountain bike uphill, the background is a forested trail, bright midday sunlight.

☕

0:00 / 0:00

A man wearing a black leather jacket and sunglasses rides a motorcycle down a winding mountain road, the road is carved into the mountainside, the scenery is breathtaking with steep cliffs and deep valleys, the sky is clear and blue, the camera follows the motorcycle from behind, capturing the speed and freedom of the ride, the motorcycle is sleek and black, the man's jacket flutters in the wind, the scene is exhilarating and cinematic.

Camera & Scenes

☕

0:00 / 0:00

The camera gracefully orbits around a majestic medieval fortress, capturing its towering stone walls and intricate battlements under the golden glow of the setting sun. As it circles, the fortress reveals its hidden courtyards and lush gardens, where ivy climbs ancient walls and flags flutter gently in the breeze. The camera's movement highlights the fortress's strategic position atop a rugged hill, offering panoramic views of the surrounding verdant valleys and distant mountains. Shadows lengthen as the sun dips lower, casting a warm, amber hue over the scene, enhancing the fortress's timeless grandeur and historical significance.

☕

0:00 / 0:00

一座空旷的现代阁楼里，有一张铺展在地板中央的建筑蓝图。忽然间，图纸上的线条泛起微光，仿佛被某种无形的力量唤醒。紧接着，那些发光的线条开始向上延伸，从平面中挣脱，勾勒出立体的轮廓——就像在空中进行一场无声的3D打印。随后，奇迹在加速发生：极简的橡木办公桌、优雅的伊姆斯风格皮质椅、高挑的工业风金属书架，还有几盏爱迪生灯泡，以光纹为骨架迅速“生长”出来。转瞬间，线条被真实的材质填充——木材的温润、皮革的质感、金属的冷静，都在眨眼间完整呈现。最终，所有家具稳固落地，蓝图的光芒悄然褪去。一个完整的办公空间，就这样从二维的图纸中诞生。

☕

0:00 / 0:00

A lush, vibrant garden brimming with colorful flowers and verdant foliage stretches across the frame, with sunlight casting dappled patterns on the ground. The camera tilts up to reveal towering sunflowers swaying gently in the breeze, their golden petals glowing under the clear blue sky. As the view ascends, a charming wooden trellis covered in climbing roses comes into sight, adding a touch of rustic elegance. Finally, the scene opens up to a picturesque view of the garden's edge, where a quaint wooden bench sits invitingly beneath the shade of a grand oak tree, completing the serene and enchanting atmosphere.

More Cases

Additional video generation examples showcasing various scenarios and styles.

☕

0:00 / 0:00

A man wearing a denim jacket and jeans plays guitar on a street corner, he is focused on his music, the street is urban and gritty, the scene is captured with a handheld camera, the musician is passionate, his fingers move expertly on the guitar strings, the street creates an authentic backdrop, the scene is both artistic and urban.

☕

0:00 / 0:00

In a traditional kitchen setting with various clay pots and containers, a woman wearing a headscarf and a patterned dress is seen mixing ingredients in a bowl. She then reaches for another bowl containing a different ingredient and begins to pour it into the mixing bowl.

☕

0:00 / 0:00

A rectangular omelet with a golden-brown, crispy surface is cooking in a black pan on a stove. The omelet contains visible pieces of red and green bell peppers and other vegetables. In the sixth frame, a red spatula appears and begins to slide under the omelet. The spatula continues to slide under the omelet in the seventh frame, lifting a portion of it. By the eighth frame, the spatula is fully under the omelet, starting to lift it. The omelet is lifted slightly in the ninth frame, with the spatula supporting it. The omelet is further lifted in the tenth frame, revealing more of its underside. The omelet is almost fully lifted in the eleventh frame, with the spatula still under it. The omelet is held in the twelfth frame, with the spatula supporting it. The omelet remains in the twelfth frame, with the spatula still under it.

☕

0:00 / 0:00

When the forest night wraps itself in velvet folds — in the shot, the Milky Way drapes over the pine grove like a torn stardust veil, while deep blue, silky ‘waves’ spread across the woods. When the wind brushes the treetops, even the rustle of pine needles is wrapped in cozy dreaminess. In the slow-panning shot, stars flicker through the tree gaps, and the velvet folds ripple like quiet breaths — it’s like stepping into a night where nature is stitched with fairy tales, so still it can catch every wandering, light thought.

☕

0:00 / 0:00

A green off-road vehicle is airborne, with its front wheels lifted off the ground and the rear wheels still on the ground, creating a dust trail behind it. The vehicle is moving towards the right side of the frame, with a clear sky and a barren landscape in the background. The vehicle's driver is visible through the windshield, holding the steering wheel.

☕

0:00 / 0:00

The video showcases a stunning appearance of a celestial fairy who seems to come from another world, set against a wasteland backdrop. She wears tattered yet magnificent clothing, with a pair of bizarre wings made from debris of ruins spread behind her back, soaring above the desolate landscape. The camera follows her flight path, rising from a low altitude to the vast sky, highlighting the striking contrast between her graceful figure and the desolate wasteland world. Every flap of her wings seems to tell a story of survival and hope.

☕

0:00 / 0:00

Aerial view of a volcanic eruption at night, the volcano is erupting with bright orange lava, the lava flows down the mountainside, the scene is dramatic and powerful, captured with a drone, the lava is glowing brightly against the dark landscape, the smoke and ash are visible, the scene is both beautiful and terrifying, the camera slowly moves around the eruption, showcasing the scale and power of nature.

☕

0:00 / 0:00

In Van Gogh's The Starry Night, a skateboarder in modern clothing weaves between the twisted cypresses. The light and shadow effects under the starry sky complement the skateboard's trajectory.

☕

0:00 / 0:00

A man in a black cap and jacket is in a store with shelves stocked with various items in the background. He holds a piece of sushi wrapped in seaweed and wrapped in paper. The man rotates the sushi in his hand while speaking. He then lowers his hand, still holding the sushi.

☕

0:00 / 0:00

A person is holding a piece of raw pork with their left hand while using a knife in their right hand to slice it into thin pieces. The knife moves back and forth as the person carefully cuts through the meat, ensuring even slices. The background is blurred, focusing on the close-up action of cutting the pork.

☕

0:00 / 0:00

A man in a gray shirt is sitting on a boat, holding and looking at some papers. The camera then shifts to the right, revealing a man in a white shirt sitting next to him, who is talking and gesturing with his hands. The man in the white shirt continues to talk while the man in the gray shirt looks at his phone. The background shows a scenic view of a lake with mountains and other boats in the distance.

☕

0:00 / 0:00

The video captures a busy city street surveillance scene with a wide-angle lens, offering a broad and expansive view. Skyscrapers line both sides of the street, with a constant flow of traffic and pedestrians, reflecting the hustle and dynamism of urban life.

BibTeX

@article{mamoda2.5,
   title={Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE},
   author={Mamoda Team},
   journal={arXiv preprint arXiv:2605.02641},
   year={2025}
 }

Mamoda2.5: Enhancing Unified Multimodal Modelwith DiT-MoE

BibTeX

Mamoda2.5: Enhancing Unified Multimodal Model
with DiT-MoE