Seedream 3.0 Wensheng Graph Model Technical Report Released

Written by
Iris Vance
Updated on:June-29th-2025
Recommendation

ByteDance Seedream 3.0 technical report in-depth analysis, exploring the latest breakthroughs in the field of image generation.

Core content:
1. Seedream 3.0 performance improvement highlights and technical innovations
2. 2K direct output, 3 seconds image output and other core function application scenarios
3. Performance that competes with the industry's top models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
ByteDance Seed team officially released the Seedream 3.0 technical report. Seedream 3.0 is a native high-resolution, bilingual image generation model. Compared with Seedream 2.0, the overall performance of this version has been greatly improved, especially in resolution, raw image structure accuracy, quantity accuracy, multi-object attribute relationship, small font generation and layout, aesthetic effect, and authenticity.

Specific highlights are as follows:

  • Native 2K direct output, adaptable to multi-scale scenes: 2K resolution images can be directly output without post-processing, meeting the visual needs of scenes from mobile phones to giant posters;

  • 3-second image generation greatly improves creative efficiency: For poster design, visual creativity and other needs, it can quickly generate high-quality images in about 3 seconds, realizing real-time creative interaction of "what you think is what you get";

  • Small fonts are more accurate and text layout effects are enhanced: Optimize industry challenges such as high-fidelity generation of small fonts and semantic typesetting of multi-line texts, so that AI has commercial-grade graphic design capabilities;

  • Aesthetics and structure are improved to generate infectious images: instruction compliance is further enhanced, the structural collapse of the human body and objects is improved, and the AI ​​sense of the output images is further weakened, achieving an aesthetic improvement from "clear to see" to "infectious".

Arxiv : https://arxiv.org/abs/2504.11346

Technology presentation page: https://team.doubao.com/tech/seedream3_0

The research and development of Seedream 3.0 began at the end of 2024. By investigating the actual needs of designers and other groups, the Seedream team not only incorporated industry consensus indicators such as image and text matching, structure, and aesthetics into the direction of attack, but also took the challenge of small font generation and complex text typesetting, 2K high-definition direct output, fast image generation and other industry problems as core goals.
In April 2025, Seedream 3.0 was officially launched and is now fully available on platforms such as Doubao and Jimeng.
The subjective and objective evaluation results in dimensions such as structure, aesthetics, portrait, text usability, and user preference (Elo) show that the overall performance of Seedream 3.0 has been significantly improved compared to version 2.0, especially in long text rendering and realistic portrait generation.
Seedream 3.0’s performance in different dimensions. The data in each dimension in this figure is based on the best indicator and has been normalized.
In the authoritative arena Artificial Analysis, Seedream 3.0 competes with GPT-4o, Imagen 3, Midjourney v6.1, FLUX 1.1 Pro, Ideogram 3.0 and other cultural graph models, and was once ranked first in the recent rankings.
Artificial Analysis Ranking (as of the afternoon of April 15)

It is worth mentioning that Seedream 3.0 excels in poster design and creative generation, which meets the daily work needs of designers.

This article will introduce the technical implementation methods of Seedream 3.0 from the aspects of data collection and processing, pre-training, post-training, and inference acceleration.

 1. Data optimization: Defect-aware dataset expansion and improved data distribution 
Large-scale, high-quality training data is essential for generative AI. Seedream 3.0 optimizes the data collection and preprocessing process in the following three aspects:
  • Training strategies for image defect perception greatly increase the amount of available data
In order to ensure the quality of training data in Seedream 2.0, a relatively conservative data screening strategy was adopted, and a large number of images with minor defects (watermarks, subtitles, mosaics, etc.) were removed. In Seedream 3.0, the team adopted a new defect-aware training strategy, using self-developed detectors to accurately locate the location and area of ​​defects, retaining images with smaller defects, and using latent space masks during training to avoid the impact of image defects on the loss function. This design has expanded the effective data set by more than 20%, while still ensuring stable training of the model.
  • Visual semantics collaborative sampling strategy to effectively balance data distribution
Traditional methods of constructing text graph datasets usually face the challenge of uneven data distribution. To solve this problem, the team proposed a sampling strategy for visual and semantic 2D collaboration: in terms of vision, a hierarchical clustering method is used to ensure the balance of different visual modalities; in terms of semantics, TF-IDF (term frequency-inverse document frequency) technology is used to effectively solve the long-tail distribution of text descriptions. Through the collaborative optimization of both vision and semantics, the balance of semantic concepts in visual patterns is greatly improved.
  • Develop image and text retrieval system to further improve data distribution
The Seedream 3.0 team developed a graph-text retrieval system that achieved relatively leading performance on public evaluation sets. Based on this graph-text retrieval system, the team filtered and calibrated the distribution of existing data sets, further improving the quality of training data and laying the foundation for the training of large text-graph models.

 2. Pre-training: Focus on multi-resolution generation and semantic alignment 
During the pre-training phase, the team made several improvements to the model architecture and training strategies to achieve goals such as multi-language semantic understanding, more accurate text rendering, and multi-resolution high-quality image output:
  • Cross-modal rotation position encoding to further enhance text rendering capabilities
In order to further enhance the image-text matching capability, the team expanded the Scaling RoPE proposed in the previous version into a cross-modal rotational position encoding (Cross-modality RoPE). Most traditional methods use a 2D RoPE solution for image features and a 1D RoPE solution for text features, which is not conducive to the alignment of the two modal features. In Cross-modality RoPE, the team treats text features as a two-dimensional feature with a shape of [1, L] and applies 2D RoPE on it. At the same time, the starting column ID of the 2D RoPE of the text is calculated from the ending column ID of the 2D RoPE of the image. This design can better model the relationship between features of different modalities and the relative position of features within each modality, and is one of the key factors that enable the Seedream 3.0 model to achieve stronger text rendering capabilities.
  • Multi-resolution mixed training makes 2K image output possible
The previous version of the model used an additional refiner to generate high-resolution images, which added additional inference overhead. In Seedream 3.0, the team took advantage of the Transformers architecture's flexible processing capabilities for variable-length input sequences and adopted a multi-resolution mixed training strategy. In the first stage of pre-training, the team trained on low-resolution images with an average resolution of 256×256; in the second stage, the team mixed images of different resolutions and aspect ratios with an average resolution of 512 ×512 to 2048 ×2048. In order to improve training efficiency, the team also designed a load balancing strategy to ensure that the sequence lengths on different GPUs are roughly equal. The final trained model can generate images of multiple resolutions and achieve 2K image output without additional refiners.
  • Stream matching and feature alignment loss functions to efficiently model data distribution
Unlike Seedream 2.0, which uses the Score Matching loss function of the denoising diffusion model, Seedream 3.0 uses the Flow Matching loss function to achieve the prediction of the conditional velocity field. In order to better adapt to the changes in the signal-to-noise ratio of multi-resolution hybrid training, the team dynamically adjusts the distribution of time steps in the flow matching training process according to the average resolution size in different training stages. In addition, the team also uses the feature alignment loss function (REPA) to assist the model to converge faster in the pre-training stage, which is also an effective verification of the feature alignment loss function on industrial-grade large-scale text graph models.

 3. Post-training RLHF: further improving aesthetics and expanding the upper limit of the model 
In the post-training stage, the team designed multiple versions of aesthetic descriptions in the CT and SFT stages, and expanded the scale of the reward model in the RLHF stage to enable it to have multi-dimensional quality discrimination capabilities, thereby comprehensively improving the performance of the generated model.
  • Multi-granularity aesthetic description
Seedream 3.0 has specially trained multiple versions of Caption models for data from the CT and SFT stages. These Caption models provide accurate descriptions in professional fields such as aesthetics, style, and typography. This ensures that the model can respond to various prompts more effectively. These multi-granular Captions not only improve the controllability of the model, but also help to collaborate with PE to improve the overall performance of the model.
  • Reward model expansion
Unlike Seedream 2.0, which uses CLIP as a reward model, Seedream 3.0 further optimizes the reward model and increases its parameter size. Seedream 3.0 uses the Visual Language Model (VLM) as a reward model. The team draws on the experience of generative RM in LLM to model rewards, which makes it easier to improve the accuracy and robustness of rewards through the original scaling capabilities of LLM. At the same time, the team expanded the number of parameters in the reward model from 0.8B to more than 20B, and discovered certain reward model scaling rules.

 4. Efficient reasoning: 1K resolution image generation takes only 3 seconds end-to-end 
Seedream 3.0 uses multiple strategies to accelerate inference. In addition to quantizing the model, an important acceleration dimension for the diffusion model is the distillation of the number of sampling steps during inference. Seedream 3.0 uses a self-developed inference acceleration algorithm, which specifically includes the following key points:
  • Consistent noise prediction improves the stability of the sampling process
The noise prediction value of the traditional diffusion model varies greatly at each time step during the sampling process. This instability in the sampling process is one of the reasons why it requires a large number of sampling steps. To solve this problem, the team proposed to let the network predict the global noise expectation, which has strong consistency throughout the sampling process, thereby effectively compressing the total number of sampling steps.
  • Sampling important time steps to accelerate the model distillation training process
In order to improve the efficiency of model distillation, the team proposed the important time step sampling technology. This technology trains a network to predict the distribution of important sampling time steps for each sample, and obtains the optimal time step for model distillation based on this distribution. Combined with the important time step sampling technology, the team was able to complete the model distillation training process within 64 GPU days.
Using the above technologies, the team achieved lossless acceleration of the model, and was able to achieve efficient image generation while keeping indicators such as image and text matching, aesthetic quality, and structural accuracy almost intact. The end-to-end time for 1K resolution image generation is only 3 seconds.

 Last words 

After the release of the Seedream 3.0 model, it has gained certain recognition in terms of improvements in poster creation, generation efficiency, structure and aesthetics.

In the future, the Seedream team hopes to further research and explore in the following directions: Explore more efficient structural design: build better, lower-cost, and faster-generating text graph models; improve the intelligence level of the model: expand the model's understanding of world knowledge and give the model the ability to interweave and generate; explore scaling phenomena in dimensions such as data, model scale, and reward models, and apply cognitive accumulation to the next generation of models.

In the future, the team will continue to share technical experience and work with the industry to promote the development of visual generation.