Does embodied intelligence need to start with ImageNet?

Explore the development path and future prospects of embodied intelligence.
Core content:
1. The background of the rise of embodied intelligence and its importance
2. The relationship between human needs for the physical world and embodied intelligence
3. The failure modes and challenges that embodied intelligence may face
Introduction
If your ambition is in intelligence, whether you are working on embodied intelligence, big models, world models, or neuroscience, perhaps it is all the same.
I want to talk seriously about embodied intelligence. This article will outline the development of embodied intelligence and some recent thoughts on "intelligence", but many of the judgments and speculations lack solid evidence. Pointing out problems and discussions are especially welcome. Friends who are concerned about embodied intelligence technology issues can read the first half, while friends who are more concerned about intelligence can read the last part "Above embodied intelligence".
Why the sudden explosion of embodied intelligence:
Move bits, not atoms. is a "golden advice" that scientists and entrepreneurs often hear. Bits flow at the speed of light, but moving a brick is slow and laborious. In the Internet era, e-commerce replaces physical stores, and streaming media replaces DVDs; in the era of artificial intelligence, ImageNet integrates massive data on the Internet, and DeepSeek model training consumes electricity and data bits. The latest large language model can write code and do math Olympiads, approaching or even exceeding human intelligence. But embodied intelligence is still popular because we have to care about it.
Why should we care about embodied intelligence? (Atoms still matter)
Humans themselves cannot be fully digitized. Neither video games nor the metaverse can solve the fact that humans need to live as physical entities, so the entropy reduction process of the physical world always needs to be completed. When we talk about "influence", we actually refer to the extent to which human needs are met or changed, which explains why embodied intelligence is still widely talked about. If we acknowledge that we need to care about atoms, then it is obvious that existing intelligent agents dominated by large language models lack alignment with the physical world. Due to the lack of perception modalities and interaction capabilities, their understanding of the physical world is often not good enough. A typical example is the insufficient understanding of spatial information. The intelligence of the big model is first instilled and then inferred. It has read all the text materials in the world, and even borrowed other people's eyes to see part of the world, and then used reinforcement learning to infer and become a super intelligent entity. However, most of the knowledge and data do not come from the big model itself, which makes the model lack closed-loop feedback, so it cannot calibrate its own output and potentially cannot break through the existing knowledge distribution. It is worth discussing that if we are serious about building silicon-based life, we hope that these intelligent entities have their own experiences. Their unique sensors bring unique experiences. Although these experiences can be shared in the form of bits, they will still remain unique. A person who has lost the sense of touch can read and hear the feeling of "touch", but cannot have such an experience, but his hearing may be particularly sharp. These experiences construct what "I" is.
Several failure modes of embodied intelligence
Embodied intelligence is a field with clear goals but unclear paths. Unlike computer vision, which has long defined the three major tasks of "classification, detection, and segmentation", embodied intelligence has many seemingly reasonable paths. I think the following models will fail, and I can bet on a crazy Thursday.
Find the most interesting mission and do whatever it takes to complete it.
I don't mean to be a dampener, especially for roboticists. But a considerable part of traditional robotics research focuses on "special" robots or "special" tasks. A snake robot, a mouse robot, a robot to make dumplings, a robot to operate plastic bags or shake clothes. It can be a paper, a best paper, or even a paper in Science magazine, because it is novel and unique, completes a difficult task, uses a lot of cybernetics knowledge, or brings structural innovation. Although it is useful for science, it is of little use in promoting the development of embodied intelligence.
Although it is indeed a disappointment, if we look back at deep learning/computer vision, the driving force mainly comes from standardized datasets such as ImageNet and general models such as ResNet or Transformer. You might say that there are exceptions to everything. What if "that task" is assembly line sorting or parts polishing? That may be like speech-to-text or real-time translation. It may have economic value in the short term and even start a business, but it will be overwhelmed by the progress of embodied intelligence in the long run. Think of it this way. If you were an expert in robot folding clothes two years ago, you may feel disappointed when you see imitation learning to fold clothes now; if you were studying to let your small model add linguistic knowledge to summarize an article two years ago, then you must be a "big model expert" now, because your original job is gone.
Why not just create a virtual world and expect to solve all problems in the digital world?
People always expect that the physical world can be completely transformed into the digital world. Then, because it is digital, we can deal with bits, quickly increase the amount of data, and replicate the success of large models.
Simulation is definitely useful, but a common failure mode is to build/replicate the target scene in the physical simulation engine as much as possible. There are many problems here: a) The physical engine has inherent difficulties in simulating objects such as fluids and soft bodies. For example, you rarely see a simulated piece of plasticine with the same physical properties as a certain plasticine product, even though it looks very much like a piece of plasticine. b) There is always a certain trade-off between the speed and quality of simulation operation. There is no perfect solution to "fast is not good and good is not fast". c) In addition to the difficulties of physics, even with the support of 3DGS, it is extremely difficult to completely replicate the scene visually, especially in articulations, soft bodies, low textures, high-frequency textures, etc. You can look forward to generative simulation or world models, but I still think that we should not have too high expectations. Perhaps the world model should be more difficult to solve than embodied intelligence.
Collect massive amounts of data, hoping that existing algorithms combined with massive amounts of data can solve all problems
Another way is to compete with whoever has more data to train the best model. The data problem has almost become a recognized core problem in embodied intelligence, and the accumulation of data is likely to be the decisive factor in embodied intelligence. However, data is not simply a quantitative competition. Even with the condition of "sufficiently rich", I think the simple amount of data is unlikely to be a sufficient condition for the success of embodied intelligence. Readers who have some experience in imitation learning or VLA algorithms should often see that the robot moves correctly but cannot complete the task, or even deviates from the position of the object. On the one hand, we can blame the model for not being optimal for this kind of "recitation" of trajectories, but more importantly, the source of the data comes from humans. A simple analogy is that a child is taught to write by hand in childhood, but when the teacher lets go, he may make considerable progress, but it is far from being as beautiful as the word that the teacher just wrote. The difference in the source of the data here is easy to forget, whether it comes from the "model" or from "humans". Here, my judgment is that a large amount of real data must be necessary, but it cannot solve all problems, but only provides a priori for the final solution of all problems.
Several decision points on the path of embodied intelligence
Embodied intelligence is simple. The robot obtains sensory signals, makes decisions, and finally executes actions. After the actions are executed, the sensory signals are updated. Similar to autonomous driving, the modular solution has a steeper upward curve but a lower potential upper limit. The end-to-end solution requires data accumulation but the upper limit may be very high. Here we almost unthinkingly discuss only the end-to-end solution, that is, the sensory signals (and perhaps text information about the task or plan) are input into a neural network, and the corresponding actions are directly output.
The dumbest question: Is the input visual signal two-dimensional or three-dimensional?
This sounds like a somewhat funny question. If all image information is not lost and the amount of information in a three-dimensional signal is strictly greater than that in a two-dimensional signal, then naturally, we would choose a three-dimensional input. However, to this day, we still cannot make such a decision.
Why is 2D image input still so popular, and even more mainstream? We can try to close one eye for a few minutes. In theory, we should lose the 3D information, but in fact our daily tasks will hardly be affected. Because the light, shadow, semantics, and geometric information contained in the 2D image, combined with our prior knowledge of the world, are enough for us to complete quite a few tasks. In other words, if we look at a bottle alone, it may be difficult to tell whether it is a small bottle nearby or a giant bottle statue in the distance, but if we look at the scene, we will hardly make a wrong judgment. At the same time, 2D images are the easiest signals to capture with our daily photographic equipment, so 2D image data has an order of magnitude advantage.
So is the 3D signal still valuable? After the release of Pi0[1] on CoRL last year, I discussed with several friends: If image-based VLA can already achieve good performance, do we still need 3D perception as input? After Columbia University released the diffusion policy[2], my Embodied Intelligence Laboratory (TEA Lab) at Tsinghua University developed a 3D diffusion policy (DP3)[3], which achieved significant performance improvements. Our recently proposed H3DP[4] further improves the performance of imitation learning by utilizing depth maps. From this we can draw a preliminary conclusion: when the amount of data is small, 3D information does help improve model performance. This also suggests that future post-training may require the introduction of 3D input information. As for what will happen when the amount of data is large, we are not sure yet.
Does this mean that 3D cannot be scaled? Not really. There is a line connecting 2D and 3D, which is monocular 3D reconstruction such as the depth-anything [5] series. Whether the chain of massive video + motion data → image 3D reconstruction → large-scale pre-training will be better than training directly with video and motion data is still unknown. But intuitively, I think it will be better, but I am not sure. It is much easier to align two floating point numbers in the input than to align the gripper and handle through an image. This is also a rare advantage that robots have over humans. They can read and understand precise numbers.
The final piece of the embodied intelligence puzzle is still its Achilles’ heel: touch
The difficulty of embodied intelligence lies in object manipulation, and manipulation tasks require touch. This is an extremely smooth logical chain. And often, touch researchers (including myself) often suggest that touch is the last piece of the puzzle of embodied intelligence. However, when we look at touch research, we find that there seems to be a huge rift between touch researchers and embodied intelligence.
What kind of tactile sensors do we need? I think the biggest requirement for any embodied intelligent hardware is "cheap". On the basis of cheapness, we will study how to improve the signal-to-noise ratio, how to improve consistency, how to cover all fingertips or even the whole hand. There is an approximately inverse proportional curve between price and market size. This price can be the production and time cost, or the selling price of the product. Some readers may wonder why tactile technology has to be related to business? I think the best example is the robot dog. I don’t know how many readers have used the early robot dogs. There are endless problems such as breakage and overheating, but they are cheap. Compared with robot dogs that are 10 times the price, everyone is willing to buy another one or return it to the factory for repair. With more people using it and more iterations, this thing has really become easier to use, and the algorithms on it are endless, and it is proficient in parkour and off-road. So, when we talk about the relationship between price and market size, we are actually talking about how many smart heads are willing to iterate with you and put the algorithm up. Finally, let's make a bold statement: the price of a "hand" should be 1/10 of the price of a "human", and the total touch of all fingertips should not exceed 1/10 of the price of a hand . If this price is not reached, most buyers come to study "touch" rather than embodied intelligence. In TEA Lab, we have developed DTact[6] and 9DTact[7], which cost only 200 RMB or even less, and after being improved by skilled students, one person can make hundreds of them a day. Although the image quality is not as good as Gelsight, it is not expensive.
If you talk more with people who are into touch, you will find that there are many people who are making touch, but few who are using it. We certainly need to create better tactile sensors, just like we need clearer cameras, but how to make good use of the tactile signals we get is actually the real way to integrate touch into embodied intelligence. Turning a flat surface into a curved surface, adding a temperature sensor, and changing the camera into an optical fiber are of course all very valuable improvements. If we really want to use touch in embodied intelligence, we must first have data, that is, the same output for the same input, and collect data that can rival the volume of vision. So instead of developing new functions, why not find a process that can keep the surface rubber consistent and durable? Gelsight will break after 2 hours of high-intensity data collection, and Dtact may also break after dozens of hours. Human skin has the ability to regenerate, but rubber does not. Another thing is that adding touch often does not see performance improvement. The task of stacking cups is saturated by vision, and the task of stroking the headphone cable is very niche. So when people see papers on touch, the tasks in them are a bit tricky, and they even think: It's really hard to figure it out. This task really requires touch. This is what I did when I was doing tactile research - I really like our DO-Glove[8] work, which connects robot tactile sense with human tactile sense and found a series of tasks that require "force" or "tactile sense". But doing this will create a vicious cycle, where people who work on tactile sense only do tasks that require tactile sense, and this will prevent them from entering the big closed loop of embodied intelligence.
What is a big closed loop? It refers to the VLA and RL models that require a lot of data, which we will talk about later. Our recent RDP[9] in collaboration with Shanghai Jiaotong University and PolyTouch[10] by Ted Adelson, the inventor of Gelsight, have shown us some efforts to integrate haptics into this big closed loop. So in my opinion, there are two paths for haptics. The first is that haptics have very good effects and can solve very difficult tasks (such as making a glass of water slip in your hand without falling out of it), but I suspect that this path will fall into the first failure mode; the second path is to make haptics cheap and robust, so cheap that people can easily buy it and collect data at the same time, and maybe there will be a new world for haptics .
Replicating the success of large models: from imitation learning to VLA
In the past two years, imitation learning has gone from being a dusty topic to being highly sought after. There are many reasons for this, including the improvement in data quality brought about by the new configuration of Aloha[11] and the improvement in fitting ability and multi-peak behavior prediction ability brought about by the diffusion model[2]. Another important point is that we need to predict a series of actions instead of just one at a time. To a certain extent, they supervise each other in the time dimension, making the actions more clear and continuous. This is also very intuitive. When we operate objects, we often start from the end, with a future goal and then a series of actions. The form of imitation learning is extremely simple, with image input and action output. The goal is to directly optimize a certain distance between the generated action and the collected action. The simple form also brings its fragility: it is often unable to generalize in the face of disturbances. To solve this problem, we have done DemoGen[12], constructed some data to enhance generalization, and live broadcasted it a while ago, but this can only solve the problem in a small area. Thinking about it later, there are generally two paths extended, one is VLA and the other is reinforcement learning. Here we will talk about VLA first.
VLA is first pre-trained with massive data, and then post-trained with target task data. The pre-training process improves the basic capabilities of the model, the most important of which are rich scenes and actions. Because the data requirements in pre-training are more relaxed, those data that succeed by chance and correct the error from the edge of failure theoretically help VLA acquire better capabilities. But as usual, we are here to make fun of it, not to flatter it. VLA may not even have found the correct structure yet. The structure represented by pi0 looks extremely inconsistent. The autoregressive VLM in the front and the diffusion module in the back are rigidly spliced together. When the amount of data is high enough, the full-scale Transformer[13] or DiT[14] may still have the possibility of returning to the peak.
The invisible elephant in the room: reinforcement learning
AlphaGo[15] brought a wave of enthusiasm, turning reinforcement learning from a niche field into a panacea for artificial intelligence, and it became the talk of the town. Since I happened to start my doctoral career in 2016, I basically participated in and witnessed this trend: everyone was doing two things: 1) looking for suitable tasks; 2) improving the data efficiency and performance of algorithms (remember this here, it will be useful later!). Looking back from today's perspective, the reinforcement learning craze at that time did not meet expectations. After Go, we solved Mahjong, Texas Hold'em, Atari games, StarCraft and DOTA, and came to a conclusion - as long as the data cost is low enough, reinforcement learning or PPO[16] can always solve the problem. And a large number of algorithmic advances have also faded away with such conclusions. In the meantime, OpenAI has demonstrated its pursuit of the scaling law. The Dexterous Hand Turning Rubik's Cube[17] and the Red and Blue Man Hide-and-Seek game in which Professor Wu Yi participated[18] are all examples. Unfortunately, OpenAI also fell into the failure path of relying entirely on simulation, so at a certain point in time, it turned to the natural language track with more real data. And this track of reinforcement learning simulation-reality migration has gradually evolved into various robot dance and parkour full body control tracks.
Why don’t robots rely on this approach to perform manipulation tasks? Simulation is not good enough, and it can’t be done well—even if it’s just washing a handful of spinach with water. Why not do real-world reinforcement learning directly like humans? Because the data cost is too high. DeepSeek and GPT-o1 give us an idea—“prior”. The output space of language is also very huge, but why can we do RL? Because we have trimmed the output space: who trimmed it? The pre-trained large model itself. Interested readers can read “The Second Half of Artificial Intelligence” [19]. I think of the monkeys who tried hard to write Shakespeare. Although they could write it theoretically, the time would tend to be infinite; but if they were pre-trained monkeys, they would probably never be able to write it, but if there was a discriminator that kept telling them whether it was Shakespeare, they would probably write it quickly. Then, don’t the aforementioned VLA and imitation learning correspond to the pre-trained large prediction model? When the robot used reinforcement learning to grab the cup, it had roughly learned to grab the cup, but it was not 100% successful. This was the moment when reinforcement learning shined.
Why can reinforcement learning do what VLA can't do? In a nutshell, it's an English word called "grounding". It means that data and tasks should be closely combined. The massive data in VLA is passive. When the model sees a piece of data of successful pouring, it actually simply uses the image as a condition and generates the corresponding action. When this condition is disturbed to a certain extent, the model does not actually understand that only grabbing the handle is the key. In contrast, in reinforcement learning, every reward is constantly telling the model that only by grabbing the handle can it succeed, otherwise it will be considered a failure. This kind of closed loop where data comes from itself, there are right and wrong, and feedback is obtained can allow the robot to ultimately achieve a high success rate.
Reinforcement learning is so useful, is everyone using it now? Do you still remember the efforts to improve data efficiency? Algorithms have made great progress today. We started from BEE[20], and later DrM[21] and FoG[22], and are still working hard to improve the data efficiency of algorithms. Because real machine data is expensive, these efforts suddenly have a certain meaning. More importantly, the "a priori" VLA we mentioned is gradually taking shape. For example, pi0.5 is worth our expectations. TEA Lab's MENTOR[23] and Berkeley's HIL-SERL[24] both did real-machine reinforcement learning before VLA came. As long as there is a suitable action space, we can train it. But everything is not as optimistic as we thought. There are two major obstacles to real-machine reinforcement learning. One is that "resetting" the environment requires someone to keep an eye on it, which may be as much work as data collection; the other is the "reward function model". We can rely on VLM to do it, but whether such sparse rewards are sufficient for training remains to be explored. Another thing is that the American company Dyna Robotics recently chose to train a "task progress" reward function model. We have also done this, but the effect is not good, mainly because this reward function model is often not monotonically increasing, so we also look forward to further exploration.
In general, it is obvious that embodied intelligence requires reinforcement learning, but how to use it and when it can be used seem to have been ignored by everyone, which also leads to the elephant in the room not being seen.
Final question: Does embodied intelligence have to have an ImageNet moment to be implemented?
The "ImageNet" moment of embodied intelligence is a false proposition, or at least a very misleading one. The beauty of ImageNet is that after collecting a huge amount of data, the evaluation only needs to pre-select a part of the images and record their categories as labels. The only difference between different people using ImageNet is the quality of the model they use , so ImageNet has become a recognized arena. If we look at embodied intelligence with this kind of analysis, it is easy to find that the requirements of the "ImageNet moment" are much higher - except for the provided "ImageNet", other components should be exactly the same for different users. This means: 1) the scene can be reproduced; 2) the visual conditions such as perspective and light are consistent; 3) the robot model is unified; 4) the robot is consistent across entities. Even if the sun is at noon, New York and Shanghai are different, so it is almost impossible to achieve the first four things. At Xinghaitu, we strive to give everyone a stable entity; at Stanford, there is a project called UMI[26] that attempts to align the forms of all data. Have you ever thought that if embodied intelligence must have an ImageNet moment, it is not necessarily a data set, but an entity?
Embodied intelligence is a late-developing field. We are still struggling to pursue ImageNet before the first half of the competition is over, but there is a prophet called LLM who has been competing for half a day in the second half and has given us a glimpse into the future. So we still don’t know how to verify the significance of building a dataset, or whether we should directly test it in real scenarios and tasks like language models. When the methods and models are not fully understood, we also construct and collect a lot of data, and we are not sure which ontology to choose. It seems that the past and the future are intertwined. We now have a little clue, but only a little.
Beyond Embodied Intelligence
It doesn't matter, it's all the same. ——Zhang Beihai
The forms of intelligence may all end up in the same place. Vision, language, and robotics each have their own difficulties from an application perspective, but from an intelligence perspective, they are likely to be solving the same problem. In the past, people who studied natural language had to learn linguistics, people who studied vision had to understand neuroscience, and scholars of robotics needed to master cybernetics. Now, everyone is using transformers and massive amounts of data. So if everyone is concerned about the final answer, it is very likely that all fields will be stuck or solved at the same time.
For example, the scaling law we have mentioned many times, it is indeed very likely that only when the sample size increases exponentially, we can extract more essential laws. It is a bit mysterious here, and we have completely abandoned rigor, but in various natural environments, our practice time and the decibels of sound are indeed changing in magnitude, which will bring us new information and abilities. Interested readers can take a look at Bi Dao’s popular science Benford’s law. Our perception of the world and the distribution of digital statistics in the world are also exponentially distributed to a large extent, or more uniform after taking the logarithm. The priori brought by our DNA may be all the abilities that humans have extracted after going through a long period of exponential data.
Another interesting perspective is to look at it from the perspective of representation - Plato's representation hypothesis [25]. It believes that in neural network models, as the model size increases and the training tasks become more diverse, different models tend to be more consistent in the way they represent reality . What is representation? I think it is to reasonably extract and organize the information from existing events or things to obtain a new variable without adding new information, so that it is more suitable for solving problems (for example, in the chicken and rabbit problem, the representation of the rabbit is a four-legged animal). Of course, what is discussed here is the representation in the form of a vector extracted by the neural network.
The story begins with the "Cave" in Plato's "Republic", which is a thought experiment of Plato to explore what reality is. In the allegory of the cave, there is a group of prisoners who have been chained in a cave since childhood and know nothing about the world outside the cave. They have been facing the wall and can only see the shadows of various things behind them. Over time, these shadows have become the "reality" in their eyes. Philosophers are like prisoners released from the cave. They walk out of the cave and come to the sunlight, and gradually understand that the shadows on the wall are not "reality", but projections of "reality". Back to Plato's representation hypothesis, if we believe that this world has an underlying real existence, then pixels, language, touch, etc. are the "shadows of reality" we perceive. If we extract the representations of these "projections", in a sense we are extracting information about the real world they correspond to. Surprisingly, we found that two visual models trained on ImageNet and Places365 can interchange some neural network layers, the representation of the large language model can be used in visual model prediction, and even found that the large language model and the representation of the human brain have a strong connection... In essence, all models, including our own brains, are completing common tasks from huge amounts of data.
But why do they tend to have similar representations? The author agrees with the conjecture in the original article: when we have to do 100 tasks, the requirements for representation are much higher than when we have to do 10 tasks. Because the emergence of each task will cut off some less common representations. And as the amount of data/tasks increases (whether it is vision, language, or embodied intelligence), their representations are eventually restricted to similar spaces. On the other hand, since the larger the model, the easier it is to find this target space, this once again responds to why everyone now mentions "scaling law" or "information compression is wisdom", because the more data/tasks, the higher the quality of the representation and the more similar it tends to be, and the larger the model, the easier it is to find such a representation. But back to the third "failure path" we mentioned, blindly piling up data is effective in principle, but it may not be the most reliable way. And what is reliable, I have already talked about it in the aforementioned RL part.
Looking at it from a longer-term and more macro perspective, if your ambition is in intelligence, whether you are working on embodied intelligence, big models, world models, or neuroscience, perhaps it is all the same.
In conclusion:
The original intention was to sort out some of the research ideas of the research group and to sort out some threads from the complicated pool of embodied intelligence papers. But in the end, it seems that the rigor of being a scholar is completely abandoned, and it is inevitable that the article is irrelevant. There are some academic discussions, some rash criticisms, and some predictions that can neither be proved nor falsified. I just hope to arouse some thinking among those who have not given up thinking to this day.