Let's talk about our 4090 single card 24GB deployment of DEEPSEEK full blood version of Ktransformer

Written by

Silas Grey

Updated on:July-11th-2025

Recently, I surveyed the product managers around me and found that very few of them are actually working on AI, and among those who are working on AI, the companies all use APIs.

I believe that for model knowledge bases and training for AI products, technical personnel will recommend private deployment in order to truly develop AI products, but this requires the company to provide resources.

And hardware alone is not enough, it also requires the development team to be able to cooperate and support, and to be able to implement this AI technology framework.

This is the only way to ensure that the downloading of AI models and the downloading and deployment of model management tools can be achieved.

What is ktransformer

Ktransformer's algorithm, in simple terms, reduces the use of video memory and places the model calculation on the CPU and memory.

As mentioned above, the main design goal of KTransformers is to implement an easy-to-use template rule-based injection framework, so that multiple operator-level optimizations can be easily integrated into a deployment-ready engine, which is convenient for testing and verification in different environments. This article will introduce this capability of KTransformers in detail through three scenarios: MoE, multi-card, and on-demand CPU offload.

In general, KT greatly reduces the utilization of video memory, similar to DEEPSEEK, allowing its developers to deploy lower video memory at full capacity and no longer rely too much on graphics cards with high video memory.

When we were running KT, as a product manager, I found that it still had several shortcomings, but they will be solved soon. If you plan to use it for product construction, you can be aware of them.

1. No concurrency support

Currently KT can only be used by a single user. If multiple users call it, they will need to queue up, which is similar to the way of ofo refunds deposits.

Product managers will inevitably have to add a queuing mechanism, but when using AI functions, our concurrency refers to the same time period. For non-AI products, the time it takes for data to be received and feedback to be received is often completed in milliseconds. For AI products, every AI task may involve online search or other capabilities (decompression software, office, compilation software).

Therefore, launching an AI reasoning task, like Manus, may take several hours or tens of minutes.

Therefore, if the agent is taken into consideration, KT is not the best approach.

Because it does not support concurrency now, if you don't take the agent into consideration, KT is still a good solution, especially for teams or small and micro enterprises.

2. Suitable scenarios: hospitals, stores, homes

Some time ago, I saw a sub-brand of Ideal Investment, Habitat, which mentioned that it could create a private AI deployment for homes.

Compared with the Xiao Ai and Tmall Genie that we are using now, the home AI provided by Habitat is completely private and belongs entirely to the individual, and Habitat only provides system and hardware maintenance.

KT is obviously suitable for such scenarios, not only because it has fewer concurrent users, but also because it has low construction costs. For example, in a hospital, although we see thousands of employees, the number of medical staff who can use AI functions at the same time and use the medical system on their computers at the same time is very small.

It is even more difficult for hundreds of thousands of users to use AI functions, because as I said before, not all functions require AI, only some functions require AI, so the high concurrency use of a single AI function module does not last that long.

KTansformer is a very good solution, which can avoid the high price usage while ensuring the use of full-blooded large models.

After all, most medical staff are too busy to use computers.

Therefore, compared to the use of servers with millions of AI computing power to assist in decision-making, it is still necessary to use tens of thousands of servers to complete the full-blooded DEEPSEEK. KT is very attractive as the latter.

3. KT framework is also being upgraded

Now KT is about to have a new update, which is committed to creating a concurrent model, that is, allowing multiple people to use KT. It will be updated this week.

Of course, according to reliable information, KT concurrency will still become slower and slower as more people use it, but compared to previous versions, KT can be used by more than two people.

4. KT raised the price of the 4090 graphics card

After I joined, I deeply felt that many technical people were paying attention to KT, from AR glasses to brain-computer interfaces. It was precisely for this reason that KT raised the price of 4090, because most people can spend tens of thousands of dollars to get full performance.

When we were buying graphics cards, the supplier mentioned that the price of 4090 graphics cards changes every day, and it is almost the same as the mining time.

KT's model gives many small and medium-sized enterprises the opportunity to deploy their own large models, which will naturally lead to a large number of 4090 models being purchased instead of A100 or H100 graphics cards.

5. Currently, large companies are unable to exert their strength in the application layer of AI

It is now very difficult for product managers who have come from large companies, including those who are currently working in large companies, to make efforts in AI.

The main reason is that large technology companies such as large factories provide AI resources to teams in the form of cloud resources, that is, in the form of APIs, and build unified management similar to data middle platforms.

Coupled with the current very embarrassing input-output ratio (AI is now almost always losing money), it will be difficult for shareholders to unify too much investment, and AI-related projects all require Python language, which is quite embarrassing for previous technology companies because there are almost very few talents in this area.

Therefore, you will find that it is very difficult for product managers of large technology companies to sort out the relevant resources they need, while it is very difficult for traditional software companies. If you want to call AI resources, they will encourage the use of APIs (because large companies have their own cloud servers). If you really want to do the AI part yourself, you must have an AI computing server to be able to talk about training and algorithms.

Therefore, the development methods of technology companies, including large manufacturers, seem to be out of date in the AI era, because the talent structure and the required hardware resources have undergone fundamental changes.

Okay, that’s all for today’s sharing.