Microsoft open-sources real-time interaction model to improve Agent's dynamic and complex processing capabilities

Microsoft's latest research breakthrough, the MineWorld model performs well in dynamic interactive environments.
Core content:
1. The MineWorld model combines the "Minecraft" game to evaluate the agent's processing capabilities
2. The model performance test results far exceed Oasis, demonstrating efficient dynamic interaction capabilities
3. A detailed explanation of the MineWorld architecture, including the Transformer decoder and vision and action markers
Microsoft Research has open-sourced a real-time interactive world model - MineWorld .
MineWorld is developed with Transformer at its core and in combination with the popular sandbox game Minecraft. This is because games are one of the best scenarios for evaluating and training Agents in terms of perception, decision-making, prediction, and comprehensive processing capabilities in dynamic and complex environments.
According to test data, MineWorld far exceeds the well-known world model Oasis in many aspects . In terms of video quality, the FVD value of MineWorld with 300 million parameters is 246 , which is lower than Oasis 's 377 , and the SSIM value is 0.38 , which is higher than Oasis 's 0.36 .
In terms of controllability, the F1 scores of MineWorld 's 300 million and 700 million parameter models reached 0.70 , and the 1.2 billion parameter model reached 0.73 , which is much higher than Oasis 's 0.41 ; the camera control L1 loss is also lower. In terms of inference speed, MineWorld generates 5.91 frames per second, far exceeding Oasis 's 2.58 frames.
Open source address: https://github.com/microsoft/MineWorld
MineWorldArchitecture
The architecture of MineWorld mainly consists of four parts: Transformer decoder, visual marker, action marker and parallel decoding algorithm .
Transformer decoder is MineWorld The core module of the Transformer is mainly responsible for generating subsequent game scenes based on the input token sequence. The researchers used the LLaMA architecture to build the Transformer decoder.
During training, the Transformer decoder concatenates visual tokens and action tokens alternately into a long sequence and trains in an autoregressive manner. At each step, the model predicts the next token based on all previous tokens . This training method enables the model to simultaneously learn the conditional relationship between game states and the association between actions and states.
In the inference phase, the Transformer decoder can generate subsequent game scenarios based on the current game state and actions input. In addition, since the model is exposed to both action and state tokens during training , it has the potential to be a policy model, that is, it can predict reasonable actions based on the current state.
The role of the visual tagger is to convert image data in the game scene into discrete tokens . The researchers used a pre-trained VQ-VAE model and fine-tuned it on the Minecraft dataset to adapt to the specific characteristics of the game scene.
The visual tagger compresses the spatial resolution of each frame from the original 360×640 to 224×384 and further divides it into 14×24 image blocks, each of which corresponds to a discrete token .
Finally, each game scene is represented as a tag sequence of length 336. This compression method not only greatly reduces the amount of calculation, but also retains the main features of the image, providing efficient data representation for subsequent model training.
The role of the action marker is to convert the player's operations (such as keyboard keys and mouse movements) into discrete tokens . In Minecraft, the player's actions include continuous mouse movements and discrete keyboard operations.
To handle these different types of actions, the researchers used two methods: discretization of continuous actions. For the perspective rotation controlled by the mouse, the researchers quantified it into discrete angle tokens , dividing the rotation angles of the X- axis and Y- axis into 11 intervals respectively , and each interval corresponds to a discrete token .
Classification of discrete actions,For keyboard operations, the researchers divided them into 7 mutually exclusive categories based on the mutually exclusive relationship between actions (such as forward and backward cannot occur at the same time), and each category corresponds to a unique token .
In order to achieve efficient real-time interaction, the researchers proposed a novel parallel decoding algorithm. Traditional autoregressive decoding methods usually predict one tag at a time when generating images or videos. Although this method can ensure the quality of generation, it is less efficient when processing high-resolution images or long videos.
In order to improve the decoding speed, MineWorld 's parallel decoding algorithm takes advantage of the spatial redundancy between image markers. After generating a marker, it will simultaneously predict the markers in the adjacent rows and columns. Especially when processing high-resolution images, the generation efficiency is significantly improved.
Benefits of MineWorld for Agents
In a complex environment, intelligent agents are faced with a large amount of visual and behavioral information. MineWorld converts game scenes and actions into discrete tokens , which can help intelligent agents understand the state of the environment and their own behavior, and learn the physical knowledge in "Minecraft", such as how objects interact and how the environment changes. This enables intelligent agents to accurately render outdoor environments, wood details, and explosion effects when generating subsequent game states, deeply and accurately perceive the environment, and lay a solid foundation for decision-making.
As a world model, MineWorld can predict future game states based on past observations and current actions. The intelligent agent can use this to evaluate the consequences of different actions and choose the optimal strategy. For example, in the game, it can decide to move forward, backward, and other actions based on the predicted state to achieve its goals.
MineWorld also uses the state-action relationship learned during training to help intelligent agents better understand the effects of actions, accurately execute decisions, and improve the success rate of actions.
When interacting with the environment, real-time performance is critical. MineWorld uses an innovative parallel decoding algorithm to generate 4-7 frames per second and quickly respond to player input. This allows the agent to obtain the latest environmental information in a timely manner and respond accordingly when interacting with players or other agents.
MineWorld has the ability to predict game states and actions at the same time, and can be used as an independent game for agents to play autonomously. After the initial game state and action are given, the agent continues to play by iteratively predicting future states and actions. In this process, it continuously learns and optimizes game strategies, and autonomously explores the best action path and strategy combination according to different game scenarios and goals, providing strong support for its application in complex game environments and similar scenarios.