Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

In the Deepseek era, can smart hardware bring a new "GPT moment"?

Written by

Silas Grey

Updated on:June-25th-2025

Since the beginning of the year, the DeepSeek-R1 inference model has pushed the AI wave to a new height, and various industries have quickly followed suit, and smart hardware has emerged in a new form in the era of large models. So what is the evolution of smart hardware in computing power? How should large companies deploy smart hardware and respond to changes in computing platforms?

Recently, InfoQ's "Geek Meet" X AICon live broadcast column specially invited Wang Song, co-founder and CTO of Future Intelligence, to serve as the host. Together with Gu Jian, partner of Li Weike Technology and head of the algorithm laboratory, and Zhang Guangyong, head of NetEase Youdao AI Infra, they discussed the challenges and opportunities of smart hardware in the GPT era as the AICon Global Artificial Intelligence Development and Application Conference 2025 Shanghai Station is about to be held.

Some of the highlights are as follows:

Returning to the essence of hardware design, the key is to match the capabilities of hardware and software around the usage scenarios.
In the future, glasses will definitely have their own computing platform.
For devices like headphones and glasses, the user experience will continue to improve with the integration of scenarios and the fusion of AI and hardware.

At the AICon Global Artificial Intelligence Development and Application Conference to be held in Shanghai on May 23-24, we have set up a special topic on [ Smart Hardware Implementation Practice ]. This topic will focus on the innovation trends and industry changes in the field of smart hardware, and invite relevant manufacturers to share the latest technological progress and explore future development directions from multiple dimensions.

Check out the conference schedule to unlock more exciting content: https://aicon.infoq.cn/2025/shanghai/schedule

The following content is based on the live broadcast shorthand and has been edited by InfoQ.

Technological evolution drives product innovation

Wang Song: In the past year, is there any smart hardware or product form that makes you feel that "this is really different"? What are the essential improvements in core technology?

Gu Jian: I was impressed by the Ola Friend product launched by ByteDance. At first, I didn't pay much attention to it, but after purchasing and using it, I found that the experience was very smooth. In particular, I was satisfied with its interaction with the big model, the speed and fluency of communication with the bean bag, the wake-up ability, and the noise reduction effect. I think this product is an entry-level product for a big model, but it does meet my expectations for AI hardware.

Wang Song: What are your usual usage scenarios?

Gu Jian: My children also like to chat with Doubao, including listening to songs and asking questions.

Zhang Guangyong: I don’t pay special attention to any one product, but the field of intelligent hardware has made significant progress in recent years. For example, AIPC, smart glasses, humanoid robots, and Youdao’s dictionary pen and question-answering pen, etc., these products have been combined with big models to move from theory to practical application.

Overall, the portability, smoothness and quality of the devices have progressed faster than expected. The most impressive thing is the low latency of these devices, which avoids people's concerns about the slow response that may be caused by large models. After landing on smart hardware, the user experience has been greatly improved, making the communication between people and devices more natural.

Wang Song: With the development of technologies such as model compression and quantization, which functions that were not possible in the past can now be implemented on the device side?

Gu Jian: We have made three generations of glasses. In the first generation of products, we used the Android system and applied some functions, such as SLAM technology, to sports glasses with cameras. However, when we entered the second and third generations, we found that it was quite difficult to implement some complex algorithms, even large model algorithms, on the end side. For example, it is now possible to put a model that may take up several GB on a mobile phone, but it still cannot meet the user's basic experience in terms of power consumption and effect. Our product is a pair of glasses that weighs only a few dozen grams, and it still faces great challenges. If you want to make a large model product on the end side that can not only meet the user's requirements but also be used smoothly in specific scenarios, it is indeed very difficult. Therefore, we still think that cloud-based models are the best solution.

Zhang Guangyong: Initially, our functions were mainly focused on word lookup and translation. But now, we have launched more large model capabilities, such as the AI teacher's Q&A function, as well as grammar explanations, word explanations and other large model functions. Regarding the landing mode, we have several options. One is pure cloud, and the other is a combination of cloud and local. Because the computing power of mobile phones is still far behind that of large models, some large models cannot run locally, so we use base models and cloud computing. For scenarios that require interaction such as voice recognition and OCR, we can use computing power locally to handle offline tasks.

In addition to cloud integration, we have also launched large offline models. Although the scale of these models is not as large as the cloud, which is tens or even hundreds of Bytes, we have achieved large offline models of 0.5B to 3B. These models support Chinese-English translation and ancient poetry translation, and a single model can complete multiple tasks.

Compared with the original offline function, we use a large offline model for translation, which has greatly improved the quality and has exceeded the quality of online NMT.

Wang Song: Does the offline function you mentioned rely on the offline mode of the mobile phone?

Zhang Guangyong: No, our offline function is based on the dictionary pen. The dictionary pen can be used independently, especially for students, because parents usually don't give their children mobile phones. We hope that the dictionary pen can operate independently, and students can use it outdoors or in classrooms. Therefore, the offline function we deployed is completely based on the computing power of the dictionary pen itself, without the help of mobile phones or other devices. In this way, the dictionary pen can be used even without network connection.

Technical implementation and cross-team collaboration

Wang Song: Smart hardware requires deep collaboration among software, algorithms, hardware, and product teams. What key conflicts have you encountered during the collaboration process? How do you find the best balance?

Zhang Guangyong: From the perspective of the hardware team, hardware engineers pursue stable performance, controllable costs and mass production. The product team is more concerned with user experience and time to market, and usually needs to compress the development cycle. However, hardware development is different from APP development. Hardware not only requires research and development, but also involves many other factors, such as AI algorithm modules. The iteration speed of AI is slow, which brings time pressure to product development and leads to conflicts between hardware and product requirements.

From the perspective of software and algorithms, we hope to have flexible development capabilities. On the dictionary pen, we deployed a local model. Ideally, we hope that the local computing power and memory are as large as possible, but this will greatly increase the hardware cost, especially for small hardware products like ours, the cost pressure is relatively high. Due to the low market positioning and price of the dictionary pen, its computing power and memory are far less than that of a mobile phone. Therefore, with limited hardware, deploying multiple AI models faces greater challenges. In addition, the product team faces frequent changes in requirements, especially new products combined with AI, which makes the iteration of algorithms more complicated. On the APP side, due to limited memory, it is impossible to support too many underlying algorithm modules, and overall optimization is required. Ultimately, our goal is to create a product with excellent user experience and ensure that our smart hardware has the overall advantages of high quality, low latency, low cost, and low power consumption.

Gu Jian: The definition of hardware products is crucial in the early stages. First, we need to clarify the usage scenarios of the hardware and make reasonable compromises based on this. For example, in the glasses we designed, although AI glasses are for the mass market, we must ensure that users can wear them for a long time, and the appearance of the glasses should take into account the target group.

Functional definition is also crucial. As a product that focuses on voice interaction, we need to design specific functions for the glasses, such as noise reduction and specific vocabulary recognition. At the same time, we must find a balance between battery capacity and appearance design. The contradiction between performance and appearance will inevitably emerge in this process. For example, some people may want glasses to have the ability to interact with users for a long time, or even realize functions similar to smart assistants. Returning to the essence of hardware design, the key is to match the capabilities of hardware and software around usage scenarios.

Wang Song: Have you ever been forced to simplify functions or even models due to device computing power limitations? Can you share a case study of a breakthrough through algorithm optimization or hardware adaptation?

Gu Jian: In the design process, we do face the situation of giving up some functions. Now our design is more based on the distribution model. Many people ask us which big model we used when designing AI glasses. To this end, I think we need to explain in more detail: we do not use only one big model, but actually a combination of multiple models, including small models, distribution models, chat models, and agent models. In this case, we need to ensure that the distribution process is fast enough, so we may choose a smaller model.

But when chatting, in order to ensure accuracy and avoid wrong answers, we will use a larger model. When using a large model, the reply speed may be relatively slow. So how to solve the problem of waiting time in this process? Because the patience of users of glasses devices is very limited, user feedback must be given in a short time. These are very important and challenging parts of the design, especially in the Agent function. Many Agent access methods even involve different large models, which is also a more complex difficulty in hardware and software design.

Wang Song: Professor Gu mentioned a very critical point - different functions or scenarios may require different models. The front part is actually equivalent to an MOE model within your company.

Gu Jian: Yes, because in many cases, if you just ask a simple "hello", but you still need to call the Deepseek model, it will waste a lot of resources. The key is how to distribute. For example, after I distribute, I can determine whether to call the Doubao model or the Deepseek model, etc. This design is very important.

Zhang Guangyong: The dictionary pen uses both cloud and end-side models. For the cloud part, we used Youdao's self-developed "Zi Yue" education model. Due to the small computing power on the end side, for example, the dictionary pen we use is equipped with an A53 CPU, which is much weaker than the mobile phone chip. Therefore, from the second generation to the current seventh generation, we have made a lot of optimizations in algorithms and engineering, including the use of distillation, MOE, quantization, etc. Due to the insufficient performance of third-party computing libraries, we have implemented some underlying computing libraries ourselves and adopted mixed precision quantization technology so that our model can be deployed locally on the dictionary pen. Due to the limitations of computing power and memory, our model was small in the early stage and performance optimization was insufficient, but this is not the end. As performance improves, the size of the algorithm model is also gradually increasing. After the final optimization, the number of parameters of the model has doubled. From 2018 to the present, we have optimized multiple versions for offline machine translation, which has improved quality and reduced latency, greatly improving user experience.

In addition, we also have very close cooperation with upstream and downstream, especially with chip manufacturers. The implementation of smart hardware requires the support of AI capabilities, and the cooperation of chip manufacturers is crucial. The NPU chip used on the end side is more powerful than the CPU and consumes less power. For example, after our OCR model switched from CPU to NPU, the model size increased by 15 times, the error rate dropped by more than 60%, and the recognition speed increased by 50%.

User Experience and Interaction Design

Wang Song: How to coordinate algorithm performance, hardware capabilities and user experience to achieve efficient hardware interaction design?

Zhang Guangyong: Although the model currently launched on the dictionary pen is not an end-to-end multimodal model, our user experience function is already multimodal. Users can input not only through text, but also through voice input. In particular, the scanning input of the dictionary pen is more efficient and is the user's favorite input method. Of course, this process is gradually explored. We have added a camera to the 7th generation dictionary pen to make it better for users.

In the design of the dictionary pen, it originally only provided word lookup and translation functions, and its shape was relatively long and concentrated at the tip of the pen. With the addition of the question-answering function, especially the need for question explanation, we found that the full screen is more suitable for this function, so we upgraded it to a full-screen design, which improved the display effect of the screen and made it more convenient to use.

Gu Jian: Although some manufacturers may combine rings or mobile phones to control glasses, we have always believed that the integrated design, that is, the interaction method of the glasses themselves, is the most complete. Therefore, our core is still the voice experience. We believe that voice interaction is the most basic part of all interaction methods. In addition, we may add some simple sliding operations on the temples.

In terms of voice interaction, we focus on basic functions such as voice recognition and command recognition. Especially in the dual-chip design, how we optimize noise reduction and sound source localization is an important part of the interaction design. At the same time, we also consider wake-up words and simultaneous interpretation during the translation process, such as echo cancellation and language differentiation. The application of these functions in actual scenarios is very complex, so after determining the scenario, we will optimize the core capabilities of hardware and software around the scenario.

Wang Song: Many people believe that glasses are the next generation computing platform. What do you think?

Gu Jian: I have been working in the AR industry, and now I think that glasses still cannot be completely separated from mobile phones. However, in the future, glasses will definitely have their own computing platform. If glasses want to adapt to future technological changes, they may subvert the existing app stores and replace them with a system similar to the Agent store. Glasses must break free from the constraints of mobile phones.

In the future, glasses will have functions such as eye tracking, SIM cards and cameras. How to achieve a compromise between battery life and battery while ensuring these functions and keeping the glasses light (such as less than 40 grams, preferably between 30 and 35 grams) will be a huge challenge. I think this goal may take another 3 to 5 years, or even longer to achieve.

Wang Song: I heard that Apple's Vision Pro 2 is already under development. Do you think its first generation product is successful?

Gu Jian: I think the first generation was not a success. It weighed more than 600 grams and the sales volume did not meet expectations. I think the second generation will focus more on optimization. It may be comparable to Meta's glasses or use new display technologies such as silicon carbide materials.

Wang Song: What challenges will there be in the architecture design of AI Infra in future multimodal perception technologies?

Gu Jian: We expect to launch glasses with cameras next month. Previously, we have used these glasses with cameras to test multimodal applications, such as identifying cultural relics in museums. I think there are several key points to pay attention to. The first is the problem of multimodal transmission protocol, how to quickly transmit data such as pictures to the cloud while ensuring low speed and power consumption. The second is the problem of vector storage, especially multimodal data storage and text alignment, which is also a technical difficulty.

In addition, parallel computing is also an important issue. During the transmission process, voice computing and other operations may need to be performed simultaneously. In addition, the interaction mode will also change greatly. For example, when you see a picture, the system may actively tell you what it is, or you can actively ask: "What is this picture?" How to make these interactions natural and smooth is a challenge in architectural design. I think the key parts of the underlying architecture include the design of vector storage and multimodal transmission protocols.

Zhang Guangyong: From the user's perspective, our dictionary pen is already a multimodal product with pen scanning, camera photography and voice input functions. Of course, from the end-to-end solution, the current process is still serial, and it will definitely develop into an end-to-end multimodal model in the future. If it is a complete multimodal model, we may transfer the captured pictures directly to the large model in the cloud for processing. Now we use a combination of local and cloud models, first perform OCR text recognition locally, and then only transfer text, which can greatly reduce the amount of transmission.

At present, our technology for image transmission and acquisition is relatively mature and has been implemented in products. In the future, we may still focus on the deployment of multimodal models in the cloud, including distributed parallel computing, data volume separation, quantification and other challenges. If more modalities and different network structures are added, the deployment will become more complicated, and it is necessary to combine GPU or other chips for design and development to ensure that the model can achieve high throughput while ensuring low latency.

Scenario-based applications

Wang Song: In educational hardware, how can we ensure the rapid response of the model and high-precision knowledge output through algorithm optimization and AI Infra support?

Zhang Guangyong: First of all, regarding the problem of hallucination, we use the Confucius Education Model and RAG, knowledge base and other technologies to circumvent these problems through years of education accumulation and data support. In terms of low latency, we use mixed quantization methods such as INT8, INT4, and FP16 to fully utilize local computing power with low precision. In terms of high precision, we use hybrid quantization technology, because pure INT4 precision may not meet the requirements. The mixed use of INT8 and FP16 can ensure both accuracy and fast response.

Wang Song: What is the OCR recognition rate on your end?

Zhang Guangyong: Our OCR recognition rate for regular text can usually reach more than 98%. Of course, the accuracy of recognition is related to the usage habits of the dictionary pen. If the user does not correctly aim at the scanning area or does not take a good photo, it may affect the recognition effect. In this case, the user sometimes rescans. If the scan is in place, there is no problem with the recognition of regular text. Even for some complex scenes, such as artistic characters, handwriting, etc., we can maintain a high accuracy.

Wang Song: So for scenarios like OCR, is offline mode sufficient?

Zhang Guangyong: Yes, for most cases, offline capabilities are sufficient. However, for some complex scenarios, such as complex formula recognition, offline mode may not be able to handle it well due to computing power limitations. In this case, we will combine some online capabilities to solve it.

Wang Song: Could you please share some specific algorithm optimization strategies or AI Infra architecture designs to demonstrate differentiated tuning practices in these two areas?

Gu Jian: Overall speed is still a key issue. For example, when deploying a model, we may first use a small model similar to speculative sampling for sequence detection, and then verify it with a large model. In addition, during the design process, we are committed to improving the user experience. Compared with text input in mobile phone chats, the feeling of users interacting with glasses is completely different. We designed a distribution strategy to train our distributed large model by quickly processing a large amount of annotated corpus, including system corpus, chat corpus, and instruction corpus.

For example, when a user asks about the weather, the system can quickly call up weather information. If the user says, "I'm full and want to go to West Lake," the system needs to decide whether to call up the navigation function or provide food recommendations around West Lake. All of this depends on our training strategy, which improves the overall speed by annotating a large amount of data.

In terms of user experience, we have also added some optimizations. For example, during the search process, the system will prompt users to wait for a while. When users receive similar feedback, they are usually willing to wait for a few seconds. At this time, we can provide better feedback and improve the user experience.

Wang Song: What technological innovations at the AI Infra level do you think can effectively improve the product's scenario adaptability and user experience? Can you share a successful case you have participated in or known about, and explain in detail how to achieve product scenario design and improve user value perception through the combination of algorithms and AI Infra?

Zhang Guangyong: It mainly focuses on two major scenarios: word search and translation, and AI Q&A teachers. The word search and translation function combines OCR, translation, and TTS technologies, while the AI Q&A function, with the support of a large model, can provide more value to users. The goal of the Q&A function is not to replace teachers, but to supplement them. For example, traditional tutors have clear divisions of labor in subjects, while AI large models can handle problems in all subjects under the same model. If students encounter historical problems when studying mathematics, AI large models can also help answer them.

In addition, AI big models can also provide a better interactive experience. Unlike the traditional fixed question-answering method, big models can achieve flexible interaction. Students can interrupt and ask questions at any time, asking the model for specific knowledge points or related encyclopedic knowledge, which can make the learning process more interesting and broaden students' knowledge.

Gu Jian: At present, we are combining with some exhibitions and foreign trade scenarios to create a complete solution. This solution includes multilingual translation, especially translation of some minority languages, and also includes all-weather recording and summarizing functions. Especially in professional scenarios, for example, I attended the Canton Fair a few days ago, and I really felt that people from different countries needed translation services.

Although there are many English translation devices, there are still challenges in translating small languages, professional vocabulary and different accents. Our glasses can communicate with users quickly, especially at exhibitions, where exhibitors need to record the communication content with customers. If you communicate with 100 customers a day, it is common to not remember all the details. Our solution can help exhibitors record the communication content, summarize the conversation, and even keep the translation history and audio files for subsequent contact with potential buyers. I think this is an effective translation solution based on actual scenarios.

Wang Song: If we can add video and photo functions to record the on-site situation and restore the scene, users may have a deeper impression.

Gu Jian: Indeed, after receiving feedback, we plan to add the function of taking business cards and group photos in the camera version, and insert these contents into the records, which can make the records more complete.

Wang Song: When developing end-side capabilities, should we choose an open source model or a self-developed closed source solution? What dimensions should we consider?

Zhang Guangyong: The optimization of algorithms and models on the end mainly depends on two parts: one is the algorithm, and the other is model engineering. We will perform in-depth optimization based on some open source models and our data. For the cloud, there are many open source reasoning frameworks and the effect is good, but on the end, there are fewer open source reasoning frameworks and the effect is limited. The main reason is that the computing power and memory of the dictionary pen are very limited, with only 1GB of memory, while some models require hundreds of megabytes of memory. In addition, when using third-party frameworks, the speed often cannot meet the real-time requirements. Therefore, we choose to implement the underlying service ourselves, which not only improves the speed, but also reduces the running memory, so that the memory consumption remains within a controllable range. This also reflects an important difference between the deployment of models on the end and the cloud: the cloud side can meet user needs by expanding multiple machines and multiple cards, but on the end side, the chip on a device needs to support multiple functional modules at the same time, such as offline large models, OCR, TTS, ASR, etc. This limitation makes it more challenging to deploy local models on the end.

Gu Jian: At present, we do not have a fully open source end-side solution because the computing power of glasses is limited. We believe that glasses and mobile phones are personal devices, and users' chat records and other data should be kept locally to ensure privacy. Therefore, we tend to use open source solutions that are optimized to suit mobile phones or glasses. However, it seems that there is still some distance for glasses to directly run end-side models.

Wang Song: How long do you estimate it will take to be able to run offline models directly on the glasses?

Gu Jian: This is definitely closely related to the development of battery technology. For example, some semi-solid batteries are already in use. If the computing power of chips is improved, the battery life can also be increased. I think it is very likely to be realized in the next one or two years. Now, many companies are exploring this possibility. The model on the end side is relatively small and may solve the problem of specific scenarios, rather than the large offline model solution as we said. Therefore, the realization of this small offline model is very possible.

Wang Song: What paradigm-level experience changes do you think will be brought about by deeply embedding AI Agents into hardware?

Gu Jian: We attach great importance to the concept of Agent, because we believe that glasses should be a portable device, like a personal assistant. For example, we have our own Agent store. In addition, we are also exploring the MCP solution, hoping to access more Agent frameworks. The advantage of Agent is that it can break through traditional data limitations, connect all apps, and record user usage, so as to provide more personalized assistance. We hope to implement similar functions on glasses, such as ordering coffee, takeout, and booking tickets through Agent. With Agent, you no longer need to open your phone, which is the future development trend.

Zhang Guangyong: Our AI Q&A teacher is essentially an agent with several features. First, it can achieve personalized teaching, teach students in accordance with their aptitude, and support full-subject teaching. In traditional education, each subject is relatively independent, but with the support of AI agents, students can learn across disciplines and break down barriers between disciplines.

Secondly, AI agents can enhance students’ learning experience and improve interactivity. For example, for subjects with a strong sense of space, such as solid geometry, students may find it difficult to understand, but if presented through videos or animations, learning will be more visual. AI can generate content based on students’ needs, and even allow students to draw and generate learning content based on their own ideas, rather than being limited to a fixed format.

In addition, AI can also promote students to shift from passive learning to active learning. In traditional education, students mainly receive lectures from teachers, but now, students can actively explore knowledge through scanning, taking pictures, voice interaction, etc. Interaction with AI allows students to ask questions at any time, stimulating a more active interest in learning. AI can not only generate videos and animations, but also create other works, which provides more possibilities for students.

Wang Song: I think there will be two modes of interaction between AI and humans in the future. The first is the Copilot mode, where the main activities are still led by humans, while AI provides highly intelligent assistance on the side. This mode is inevitable in the future because humans are always the protagonists. The second mode is that humans set the task at the beginning, and then AI completes it independently, and then notifies humans after the task is completed. In the future, I think these two modes will run in parallel for a long time. Humans will continue to participate, but they can also be "lazy" occasionally. Therefore, both modes are very important in the future.

Wang Song: What is the most promising smart hardware scenario in the next 2-3 years?

Zhang Guangyong: For our own products, we mainly focus on intelligent hardware products that combine AI and education. For example, the SpaceOne Q&A pen we launched this year has a full screen and is more suitable for the implementation of large models. Based on these hardware, coupled with the capabilities of large language models, reasoning models, and multimodal models, our products can provide a very natural interactive experience, whether it is voice or photo taking, it can be carried out smoothly.

For other products, the user experience will get better and better. For example, I used the Doubao earphones. I originally thought that the delay was large and there would be lags, but after actual use, I found that its interaction was very natural and the response was very fast, which could easily solve various problems.

For devices like headphones and glasses, the user experience will continue to improve with the combination of scenarios and the integration of AI and hardware. Of course, smart hardware faces challenges, especially power consumption and weight issues, especially glasses need to be more portable. In the future, the defects of these devices will gradually be made up, and the experience will get better and better.

Gu Jian: I am still very optimistic about the development of glasses. For example, in future education, the myopia rate among children is very high, and many parents do not want their children to use mobile phones. If children wear glasses, they can use them to scan questions, prompt learning content, and even guide sitting posture correction, etc. I think this is an important application scenario of glasses in the field of education.

In addition to glasses, there are also devices like necklaces and rings that must be combined with AI. Collecting personal data through these portable hardware and training a personal assistant or auxiliary system will greatly improve the user experience. In the future, it may even be possible to combine this data with brain-computer interfaces or robotics technology so that users have an "avatar" to help complete many tasks. In this way, users can enjoy life more without worrying about trivial matters. The big model is just a starting point. As technology advances, human work will gradually decrease, and we will rely more on "Avatar" to get the job done.