Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

DeepSeek local deployment - comparison of actual results of distilled version, quantized version and full-blood version

Written by

Caleb Hayes

Updated on:July-16th-2025

introduction

As DeepSeek has received widespread attention around the world, the official website service is often unavailable. In addition to calling APIs provided by third parties, local deployment has also become one of the alternatives.

This article will introduce three different types of models for local deployment of DeepSeek: the full-blooded version, the 1.58 bit quantized version, and the distilled version (Qwen 1.5B). The same code generation problem will be used to evaluate the effect, and finally practical deployment suggestions will be given.

Hardware Requirements

Model	Parameter quantity	Minimum hardware requirements
Full Blood Version	671B	16 cards H100 or 8 cards H200
1.58 bit quantized version	131B	4 cards H100 or 2 cards H200 or CPU server
Distilled version	1.5B	Macbook Air (16G), RTX 3060

The above is just a recommended configuration. As the community and hardware manufacturers continue to optimize, there may be more cost-effective deployment solutions in the future.

Deployment Process

Model	Reasoning Framework	step
Full Blood Version	SGLang	https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
1.58 bit quantized version	llama.cpp	https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
Distilled version	llama.cpp(ollama)	ollama run deepseek-r1:1.5b

The deployment process is rather lengthy, so this article will not go into detail. If you have any questions, you can add WeChat: Mistake113.

Effect measurement

Let DeepSeek write a CUDA code to implement the Softmax operator which is essential in the Transformer architecture.

Let’s look at the distilled version first:

Well, the user asked me to write a CUDA code to implement the softmax function. First, I have to recall what the Softmax function is. It is usually used in multi-classification problems to ensure that the sum of the probability distribution is 1 and each element is greater than 0.

Then, consider the characteristics of CUDA. CUDA is a GPU-based acceleration technology suitable for parallel computing. Therefore, my code must be designed to efficiently utilize the parallelism of the GPU.

Next, I need to analyze the specific steps of Softmax. Generally speaking, Softmax will first subtract the maximum value from each input vector (normalization), then exponentialize, and finally divide by the sum of these exponentials. This step is to avoid numerical overflow and ensure that the result is a probability distribution.

In terms of parallel processing, my code may need to split the input data into different blocks and then calculate their normalization factors separately. The elements in each block can use shared and shared memory to communicate efficiently, reducing the number of video memory accesses.

At the same time, I also need to consider the format of the data. Software is usually stored in the form of a matrix, so each CudaSharedMemoryBlock may need a float matrix to store the normalization factor. Then calculate the sum of the normalization factors in all blocks and finally normalize the result.

Additionally, there may be some edge cases that need to be watched out for, such as when all elements are identical, or very close values lead to numerical instability, but in CUDA you can use fused variadic parameters to address these issues.

Finally, I need to organize these steps into an efficient parallel block of code that is logically clear, well-structured, and able to handle a variety of input situations.

#include <math.h>#include <stdio.h>#include <string>
// Shared memory Block classclass CudaSharedMemoryBlock {private:std::vector<std::vector<float>> shared;public:CudaSharedMemoryBlock() : _shared{} {}
void push(std::vector<std::vector<float>>& out) {for (size_t i = 0; i < out.size(); ++i) {for (size_t j = 0; j < out[i].size(); ++j) {out[i][j] = _shared shared[i][j];}}}
size_t get_size() const { return _shared.size(); }};
// Softmax function implementationvoid softmax(float *input, float *output, size_t block_size, int num_blocks) {CudaSharedMemoryBlock smb_block;
// Allocate shared memory for all blocksfor (size_t i = 0; i < num_blocks; ++i) {const float* out = new float[num_blocks];if (!sobsm_block.get_size() == output.size()) {*out = 0.0f;}// Store normalization factors in shared memoryfor (size_t j = 0; j < num_blocks; ++j) {out[j] = smb_block.push(sobsm_block.get_size());}
// Get the index of the value in shared memoryfor (size_t i = 0; i < num_blocks; ++i) {const float* sum = new float[num_blocks];if (!sobsm_block.get_size() == sum.size()) {*sum = 0.0f;}// Calculate the normalization factorfor (size_t j = 0; j < num_blocks; ++j) {sum[j] = expf(sobsm_block.push(sobsm_block.get_size(), smb_block.get_size() * i));}
// Calculate the sum of normalization factorsfloat total_sum = 0.0f;for (size_t j = 0; j < num_blocks; ++j) {total_sum += sum[j];}
// Normalize the normalization factorif (total_sum == 0.0f)return;
for (size_t i = 0; i < num_blocks; ++i) {const float* normalized = new float[num_blocks];if (!sobsm_block.get_size() == normalized.size()) {*normalized = 0.0f;}for (size_t j = 0; j < num_blocks; ++j) {normalized[j] = smb_block.push(sobsm_block.get_size(), total_sum, i);normalized[j] /= total_sum;}
// Release the resultfor (size_t i = 0; i < num_blocks; ++i) {delete[] out[i];delete[] sum[i];delete[] normalized[i];}}

The distilled version has the shortest thinking process and the simplest idea. The problem is that it only implements C++ code and does not write a CUDA Kernel. In addition, the generated code reports an error when compiled:

The author was too lazy to repair it manually and gave up directly.

Then we look at the full version (671B), and it took us 2 minutes to solve this code problem:

<Thinking process>
Well, the user asked me to write a CUDA code to implement the softmax function. First, I need to confirm the specific implementation of softmax and how to implement it efficiently in CUDA. Softmax is usually used to convert the values in a vector into a probability distribution, where the value of each element is between 0 and 1, and the sum of all elements is 1. Then, the calculation steps of softmax are to find the exponential of each element and then divide it by the sum of the exponentials of all elements. The formula is softmax(x_i) = exp(x_i) / sum(exp(x)). If implemented in CUDA, parallelization and numerical stability issues need to be considered. In terms of numerical stability, the maximum value is usually subtracted from the input data to prevent numerical overflow during exponential operations. So you need to find the maximum value in each data block first, then subtract this maximum value from each element, and then perform exponential operations. After this processing, the calculation will be more stable. Next, the question is how to efficiently calculate each element in CUDA. Because softmax usually operates on a vector, such as in the last layer of a neural network, assuming that the input is a two-dimensional tensor with a shape of (batch_size, feature_size), we need to perform softmax processing on the features in each batch. So in this case, each thread block processes a batch of samples, or shares memory to speed up the calculation.

But in actual situations, you may need to process larger data, so you need to consider how to divide it into blocks. For example, for each input vector, the calculation is divided into two steps: the first step is to calculate the maximum value, the second step is to calculate the exponential sum, and the third step is to divide each element by the sum.

In CUDA, reduction operations are often used to calculate maximums and sums. So you might need to write two kernel functions: one to calculate the maximum and the other to calculate the sum. Alternatively, you can do the two steps in the same kernel function. In addition, you can use existing CUDA libraries, such as BlockReduce in the CUB library, to simplify reduction operations.

However, users may want to implement it manually, so they need to consider how to process in blocks. For example, for a vector, each thread block processes a vector, and the threads in the block collaborate to calculate the maximum value and sum.

Now, the specific implementation steps may be as follows:

Determine the dimension of the input, assuming that the input is a one-dimensional array, or a batch processing of a two-dimensional array. For example, the input is a matrix of shape (n, m), and softmax processing is performed on each row. Here it is assumed that each thread block processes a row and each thread processes an element in the row.
The kernel function takes as parameters the input array, the output array, and the number of elements in each row. For example, the input array is d_x, the output array is d_y, and each row has cols elements.
In the kernel function, first determine the row number that the current thread block is processing. For example, blockIdx.x corresponds to the row number, blockDim.x is the number of threads in the thread block, and the thread index is threadIdx.x.
For each row, first find the maximum value. This requires all threads to cooperate in reduction. Each thread first reads the element it is responsible for (if it exists), and then performs the maximum value reduction within the block. For example, use shared memory to store intermediate results and reduce layer by layer to get the maximum value.
After getting the maximum value, each thread calculates the exponent of its element and accumulates these exponent values. At this time, reduction and summation are also required. These two reductions (maximum value and summation) can be performed in two stages.
Finally, each thread calculates the exponential value again and divides it by the sum to get the final softmax result.

However, it should be noted that each step requires thread synchronization to ensure that all threads complete the calculation of the current stage, especially when sharing memory operations.

Possible numerical problems: When dealing with very large numbers, exponential operations may cause overflow. So it is necessary to subtract the maximum value. For example, calculate x_i - max_x for each element and then take the exponent. This ensures that the largest exponent is exp(0) = 1, and the others are smaller values without overflow.

So the specific steps are:

For each row vector:

Find the maximum value max_val in the row.
Calculate the exponential of each element and add them up to get sum_exp.
Each element exponent is divided by sum_exp.

Now, how to implement these steps in CUDA?

First, for each row, a thread block is started. Assuming that each row has cols elements, the number of threads in the thread block is set to cols, or in a more practical case, when cols is larger, such as 1024, but the maximum number of threads in a CUDA thread block is 1024, so for cases exceeding this number, multiple threads need to be used for loop processing.

Assume that the number of elements in each row is cols, and each thread block processes one row, and each thread processes multiple elements. Alternatively, if cols is larger, you can set the thread block size to 256 or 512, and let each thread process multiple elements, processing them through a loop. For example, each thread processes cols / blockDim.x elements.

However, for simplicity, let's assume that the number of elements in each row does not exceed 1024, so that each thread block can process one row and each thread processes one element. This will be easier to implement.

For example, a kernel function might have the following structure:

global void softmax_kernel(float *d_x, float *d_y, int cols) {

extern __shared__ float shared_mem[]; // shared memory for reduction int row = blockIdx.x; int tid = threadIdx.x; // Calculate the starting pointer of the row float *x_row = d_x + row * cols; float *y_row = d_y + row * cols; // Step 1: Find the maximum value float max_val = -INFINITY; for (int i = tid; i < cols; i += blockDim.x) { float val = x_row[i]; if (val > max_val) max_val = val; } //Put the max_val of each thread into shared memory and perform intra-block reduction shared_mem[tid] = max_val; __syncthreads(); // Reduce to find the maximum value, such as using tree reduction for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { if (shared_mem[tid + s] > shared_mem[tid]) { shared_mem[tid] = shared_mem[tid + s]; } } __syncthreads(); } max_val = shared_mem[0]; __syncthreads(); // Step 2: Calculate exponential sum float exp_sum = 0.0f; for (int i = tid; i < cols; i += blockDim.x) { exp_sum += expf(x_row[i] - max_val); } // Accumulate the exp_sum of each thread to the shared memory shared_mem[tid] = exp_sum; __syncthreads(); // Reduce and sum for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { shared_mem[tid] += shared_mem[tid + s]; }

__syncthreads(); } exp_sum = shared_mem[0]; __syncthreads(); // Step 3: Calculate the value of each element and write the result if (exp_sum == 0) { // Avoid division by zero, you may need to handle exceptions, here assuming exp_sum is not zero } else { for (int i = tid; i < cols; i += blockDim.x) { y_row[i] = expf(x_row[i] - max_val) / exp_sum; } }

}

There are a few issues to note here:

The size of the shared memory needs to be sufficient. Here, it is assumed that the number of threads in each thread block is equal to the size of the shared memory. Therefore, when starting the kernel function, the size of the shared memory is dynamically allocated, for example, each thread block requires blockDim.x * sizeof(float) of shared memory. When calling the kernel function, the third parameter is the required shared memory size.

For example, when launching this kernel:

int threads_per_block = 256;

int shared_mem_size = threads_per_block * sizeof(float);

softmax_kernel<<<num_rows, threads_per_block, shared_mem_size>>>(d_x, d_y, cols);

The number of elements per row cols may be greater than the number of threads in a thread block. For example, if cols is 1000 and threads_per_block is 256, then each thread needs to process multiple elements, using a loop such as i += blockDim.x.
The first phase reduction is correct, but you need to make sure that all threads participate in the reduction. For example, the tree reduction may need to be adjusted when blockDim.x is not a power of 2. Or use a more stable reduction method, such as BlockReduce using CUB.

However, for simplicity, it is assumed that threads_per_block is a power of 2, such as 256, 512, or 1024. This may require the caller to adjust the thread block configuration according to the size of cols.

In addition, the number of thread blocks that process each row should be num_rows, that is, each thread block processes one row.

For example, assuming the input is a two-dimensional array with shape (rows, cols), the kernel function launch configuration is:

dim3 grid(rows);

dim3 block(threads_per_block);

This way, each thread block processes a row, and each thread processes one or more elements in that row.

In addition, when calculating the exponential sum, you may need to pay attention to the precision issue and use expf instead of exp because the input data is of float type.

Also, regarding numerical stability, max_val is already subtracted in the code, so that's fine.

Now, consider the correctness of this kernel. For example, will the reduction be wrong if each thread processes multiple elements per row?

For example, in the maximum value calculation phase, each thread's loop covers all elements of the row, calculates the maximum value among the elements it has processed, and then reduces these maximum values within the block. This will get the maximum value of the row.

After the reduction, all threads get the same max_val, which is used for subsequent exponential operations.

Similarly, for the calculation of the sum of exponentials, each thread calculates the sum of the exponentials of the elements it processes and then reduces to the sum, so that the sum of the entire row is calculated correctly.

Finally, each thread processes the elements it is responsible for again, calculates exp(x_i - max_val) / sum_exp, and obtains the final softmax value.

This implementation is correct, but there may be room for optimization in terms of efficiency, such as using a more efficient reduction method or avoiding bank conflicts by utilizing shared memory.

In addition, when cols is large, for example, it exceeds the maximum number of threads per thread block, then the threads in the thread block may need to process multiple elements, which may require multiple loop steps. For example, each thread processes cols / blockDim.x elements, rounded up.

For example:

int tid = threadIdx.x;

int num_elements_per_thread = (cols + blockDim.x - 1) / blockDim.x;

for (int idx = 0; idx < num_elements_per_thread; idx++) {

int i = tid + idx * blockDim.x; if (i < cols) { // Process the i-th element }

}

This may be more efficient, especially when cols is much larger than blockDim.x.

Therefore, it may be necessary to adjust the loop part of the kernel function accordingly, for example, dynamically calculate the number of elements processed by each thread.

For example, in a maximum value calculation loop:

for (int idx = 0; idx < num_elements_per_thread; idx++) {

int i = tid + idx * blockDim.x; if (i < cols) { float val = x_row[i]; if (val > max_val) max_val = val; }

}

The same applies to the calculation of the exponential sum and the final writing step.

This way, no matter how big or small cols is, it can be processed. So we need to calculate num_elements_per_thread.

In summary, the complete CUDA code may be as follows:

First include the necessary header files:

#include <cuda_runtime.h> #include <math.h>

Then define the kernel function:

global void softmax_kernel(float *d_x, float *d_y, int cols) {

extern __shared__ float shared_mem[]; float* max_shared = shared_mem; float* sum_shared = shared_mem + blockDim.x; int row = blockIdx.x; int tid = threadIdx.x; float *x_row = d_x + row * cols; float *y_row = d_y + row * cols; // Calculate the maximum value for each row float max_val = -INFINITY; int num_elements_per_thread = (cols + blockDim.x - 1) / blockDim.x; for (int idx = 0; idx < num_elements_per_thread; ++idx) { int i = tid + idx * blockDim.x; if (i < cols) { float val = x_row[i]; if (val > max_val) max_val = val; } } // Store the local maximum of the thread into shared memory for reduction max_shared[tid] = max_val; __syncthreads(); for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { if (max_shared[tid + s] > max_shared[tid]) { max_shared[tid] = max_shared[tid + s]; } } __syncthreads(); } max_val = max_shared[0]; __syncthreads(); // Calculate exponential sum float exp_sum = 0.0f; for (int idx = 0; idx < num_elements_per_thread; ++idx) { int i = tid + idx * blockDim.x; if (i < cols) { exp_sum += expf(x_row[i] - max_val); } } // Reduce and sum sum_shared[tid] = exp_sum; __syncthreads(); for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { sum_shared[tid] += sum_shared[tid + s]; } __syncthreads(); } exp_sum = sum_shared[0]; // Calculate softmax and write the result if (exp_sum != 0.0f) { for (int idx = 0; idx < num_elements_per_thread; ++idx) { int i = tid + idx * blockDim.x; if (i < cols) { y_row[i] = expf(x_row[i] - max_val) / exp_sum; } } }

}

Here we need to pay attention to the use of shared memory, because the two reduction operations of maximum value and summation are performed simultaneously in the same kernel function, so the shared memory needs to have enough space. The original code uses a shared memory array, but the reductions of maximum value and summation need to be stored separately. Therefore, the shared memory should be divided into two parts, one for storing the intermediate results of maximum value reduction and the other for storing the intermediate results of summation. For example, the size of shared memory is 2 * blockDim.x * sizeof(float). So when the kernel function starts, you need to allocate enough shared memory:

int threads_per_block = 256; size_t shared_mem_size = 2 * threads_per_block * sizeof(float); softmax_kernel<<<num_rows, threads_per_block, shared_mem_size>>>(d_x, d_y, cols);

In this way, max_shared points to the first blockDim.x element of the shared memory, and sum_shared points to the last blockDim.x element. This can avoid data overwriting problems during the reduction process.

But in the previous code, I may have a problem with the use of shared memory, because when processing the maximum value and sum at the same time, if the shared memory is reused and there is not enough space, it will cause data overwriting. Therefore, here we must ensure that the reduction of the maximum value and summation uses different shared memory areas.

That is why we need to split the shared memory into two parts: max_shared and sum_shared, each of which is blockDim.x elements in size, so that they do not interfere with each other during reduction.

In addition, during the reduction process, for example, when calculating the maximum value, each thread stores its own maximum value in max_shared[tid], and then obtains the global maximum value through tree reduction. Similarly, when calculating the sum, each thread stores the local exp_sum in sum_shared[tid] and then reduces.

After this modification, there will be no conflict in the use of shared memory.

Finally, when writing the results, each thread again traverses all the elements it is responsible for, calculates exp(x_i - max_val)/exp_sum, and writes them to the output array.

In addition, we need to consider the case when exp_sum is 0. At this time, all elements may be negative infinity, resulting in the exponential sum being 0. This situation may require special processing, but it may be rare in actual applications, or it can be avoided by adjusting the value. Here we assume that it will not happen, so we directly determine whether exp_sum is 0. If it is 0, we may need to do some processing, such as setting the result to 0 or other default values. However, this situation is not handled in the example, and the user may need to add it himself.

In summary, this kernel function should correctly implement the softmax function. However, there is still room for optimization, such as using a faster reduction algorithm, or taking advantage of shared memory bank conflict avoidance. In addition, when processing very large rows, it may be necessary to split the data into multiple blocks for processing, or use dynamic parallelism. But for most cases, especially when the row length is moderate, this implementation should be feasible.

When calling this kernel function, you need to ensure that each thread block processes one row and the number of thread blocks is equal to the number of rows of the input data. For example, if the input is a matrix of shape (rows, cols), then rows thread blocks are started, each with threads_per_block threads, and enough shared memory is allocated.

In addition, it may be necessary to check whether the input pointer is valid before calling the kernel function, as well as CUDA error handling, etc., but users may only need the code of the kernel function itself.

Finally, when testing this kernel function, you should use some sample data, such as the vector [1.0, 2.0, 3.0], to calculate whether the result after softmax is correct. For example, the correct result should be [e^1/(e^1+e^2+e^3), e^2/(...), e^3/(...)]. Check whether the numerical calculation process is correct, especially whether the result after subtracting the maximum value is correct. For example, in the example above, the maximum value is 3, so each element is subtracted by 3 to get [-2, -1, 0]. The sum after exponentiation is e^-2 + e^-1 + e^0 ≈ 0.1353 + 0.3679 +1 = 1.5032, and then divide each by the sum to get the corresponding probability.

# include  <cuda_runtime.h># include  <math.h>
__ global__ void  softmax_kernel ( float * input, float * output, int num_rows, int num_cols)  {extern __shared__ float shared_mem[];float * max_shared = shared_mem;float * sum_shared = shared_mem + blockDim.x;
int row = blockIdx.x;int tid = threadIdx.x;
if (row >= num_rows) return ;
float * row_input = input + row * num_cols;float * row_output = output + row * num_cols;
// Step 1: Calculate the maximum value of each rowfloat max_val = -INFINITY;for ( int i = tid; i < num_cols; i += blockDim.x) {max_val = fmaxf(max_val, row_input[i]);}
// Shared memory reduction to find the maximum valuemax_shared[tid] = max_val;__syncthreads();
for ( int s = blockDim.x/ 2 ; s > 0 ; s >>= 1 ) {if (tid < s) {max_shared[tid] = fmaxf(max_shared[tid], max_shared[tid + s]);}__syncthreads();}max_val = max_shared[ 0 ];__syncthreads();
// Step 2: Calculate the exponential sumfloat exp_sum = 0.0f ;for ( int i = tid; i < num_cols; i += blockDim.x) {exp_sum += expf(row_input[i] - max_val);}
// Shared memory reduction to find the exponential sumsum_shared[tid] = exp_sum;__syncthreads();
for ( int s = blockDim.x/ 2 ; s > 0 ; s >>= 1 ) {if (tid < s) {sum_shared[tid] += sum_shared[tid + s];}__syncthreads();}exp_sum = sum_shared[ 0 ];__syncthreads();
// Step 3: Calculate and write Softmax resultsif (exp_sum != 0.0f ) {for ( int i = tid; i < num_cols; i += blockDim.x) {row_output[i] = expf(row_input[i] - max_val) / exp_sum;}}}
void  launch_softmax ( float * input, float * output, int num_rows, int num_cols)  {dim3 grid (num_rows) ;int block_size = 256 ; // The thread block size can be adjusted according to actual conditionsdim3 block (block_size) ;size_t shared_mem_size = 2 * block_size * sizeof ( float ); // Allocate enough shared memory for reduction
softmax_kernel<<<grid, block, shared_mem_size>>>(input, output, num_rows, num_cols);}

From the thinking process, we can find that the full version of DeepSeek is more like a senior CUDA developer, using shared memory optimization techniques to reduce global memory access. At the same time, it also considers the potential problems of overflow or divisor 0 in the softmax calculation process, which shows that it is also experienced in AI engineering.

Let’s write a simple correctness check program, first run the calculation using PyTorch, then run the calculation using CUDA, and see if the results match.

PyTorch code (testbench.py):

import torcha = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])b = torch.softmax(a, 0)print(b)

Calculation results:

# python testbench.pytensor([5.7661e-04, 1.5674e-03, 4.2606e-03, 1.1582e-02, 3.1482e-02, 8.5577e-02,2.3262e-01, 6.3233e-01])

Then write a corresponding C++ code to call the above CUDA code:

#include <stdio.h>int main(void){float a[8] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};float b[8] = {0.0f};float * d_a, *d_b;cudaMalloc(&d_a, 8 * sizeof(float));cudaMalloc(&d_b, 8 * sizeof(float));cudaMemcpy(d_a, a, 8 *sizeof(float), cudaMemcpyHostToDevice);cudaMemset(d_b, 0, 8*sizeof(float));launch_softmax(d_a, d_b, 1, 8);cudaMemcpy(b, d_b, 8*sizeof(float), cudaMemcpyDeviceToHost);
for(int i = 0; i < 8; i ++){    printf("%e\n", b[i]);}cudaFree(d_a);cudaFree(d_b);return 0;}

Calculation results:

# ./cuda_testbench5.766128e-041.567396e-034.260624e-031.158158e-023.148199e-028.557692e-022.326222e-016.323327e-01

The CUDA program results are consistent with the PyTorch results, indicating that there is no problem with correctness!

Next, let’s look at the 1.58 bit quantitative version. It spent 4 minutes thinking about this problem!

OK, the user asked me to write a CUDA code to implement the Softmax function. I have to think about the formula of Softmax first, which should be to perform an exponential operation on each element and then divide it by the sum of the exponentials of all elements. In CUDA, parallel processing is usually used, especially for the independent calculation of each element. However, Softmax involves summation, which may require synchronization or atomic operations, but this may not be efficient.

First, I should consider how to split the input data. Assuming the input is a one-dimensional array, each thread can process one element. But the summation part may require the cooperation of all threads, or be divided into multiple steps. For example, first calculate the exponent of each element, then calculate the sum, and finally divide each element by the sum.

However, if the array is large, it may be difficult to sum it directly because each thread needs to know the sum. In this case, you may need to use a parallel reduction method to sum it. For example, let each thread calculate its own exponent value first, then calculate the sum through multiple levels of reduction, and then each thread divides this sum by its own exponent value.

In addition, memory management is also very important. The input array and the output array need to allocate memory on the GPU. Also, the exponential value may need to be stored in the middle, so space for intermediate storage may be required.

The specific steps may be:

Each thread calculates its own exponent value and stores it in a temporary array.
Calculate the sum of all index values.
Each thread divides its own index value by the sum to get the Softmax result.

The structure of the code would be: first write a kernel function, divided into two stages. The first kernel calculates the exponent and stores it in a temporary array, then calls another kernel or uses different steps of the same kernel to calculate the sum, and then calculates each element divided by the sum.

Or is it possible to do it in one kernel? It may require two different kernel calls. Because CUDA has limited synchronization mechanisms, especially synchronization across blocks, the summation may require a separate kernel or use CUDA library functions to do the summation, such as using Thrust or cuBLAS, but users may want pure CUDA code.

Or consider using a parallel reduction approach, where each block first calculates its own partial sum, then adds these partial sums together in a second kernel to get the global sum, and then divides each element by the sum. However, this may require multiple kernel calls and additional memory storage.

The specific steps may be as follows:

Allocate input and output memory, as well as arrays for intermediate storage.
First kernel: Each thread calculates the exponent of the corresponding element and stores it in an intermediate array.
Call a reduction kernel that computes the sum of the intermediate arrays.
Second kernel: Each thread divides the value of the intermediate array by the sum and writes to the output array.

But such a process may require two kernel calls and an intermediate array, and synchronization issues need to be handled. Or is it possible to complete it in one kernel? Unlikely, because when calculating the sum, all threads need to contribute, which requires synchronization. In CUDA, synchronization within a block can be done with __syncthreads(), but global synchronization cannot be done.

So a more practical approach is to split the calculation into three steps:

Calculate the exponent of each element and save to a temporary array.
Compute the sum of a temporary array.
Divide each element by the sum.

Step 2 may require multiple kernel calls, such as using parallel reduction methods.

Now how to write the specific code?

For example, the first kernel computes the exponential:

global void compute_exp(float *input, float *temp, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { temp[idx] = exp(input[idx]); } }

The second kernel then computes the sum and may need a reduction function:

global void sum_reduce(float *temp, float *sum, int size) { // Using the reduction algorithm, each block calculates a part of the sum, and then adds the atomic operation to the global sum // But the specific implementation may require multi-level processing, or use existing reduction methods}

However, the reduction may be more complicated, so you can consider using CUDA library functions, such as Thrust's sum. However, if the user wants pure CUDA code, you may need to implement it yourself.

Alternatively, it might be more efficient to have each thread compute its own exponent in the same kernel, then use shared memory to do the reduction and sum, but this can only be done at the block level, and if the array size exceeds one block, multiple stages of processing may be required.

Alternatively, if the user input array is not particularly large, you could assume that each block processes a portion of it, then calculates the sum of each block in chunks, and then sums it up at the end. This may require multiple kernel steps.

Another approach is to use dynamic parallelism, but there may be compatibility issues.

Alternatively, it might be simpler to copy the temporary array back to the host after calculating the exponent, have the host calculate the sum, and then pass it back to the GPU for the division. This is not very efficient, especially for a GPU, but it might be acceptable as example code.

However, users may want the entire operation to be completed on the GPU, so pure GPU processing is required.

The possible steps are:

Allocate inputs, outputs, temporary arrays, and a variable to store the sum.
The first kernel computes the exponential of each element and stores it in a temporary array. : Compute exp(input[i]). . Then, another kernel is called to sum the temporary array to get the total.
Finally, the third kernel is called, and each thread uses temp[i] / sum to get the output.

Now, how do we implement the reduction sum? You can refer to the standard CUDA reduction examples, such as the official reduction examples from NVIDIA.

For example, in a reduction kernel, partial sums within a block are computed using shared memory, then written to an array of partial sums in global memory, and finally these partial sums are added together to get the total in another kernel.

This will increase the complexity of the code, but it is necessary for large arrays.

If the array is small, i.e. it can fit in one block, then it can be processed in one block and the sum can be calculated in shared memory. For example:

Assuming the block size is large enough, like processing a smaller array, say 1024 elements, and one block has 1024 threads, then you can use shared memory for the reduction.

For example, within the same kernel that computes exp, this may not work, because each thread computes its own exp and then reduces the sum.

Alternatively, separate the computation of exp and the reduction.

For example, first the compute_exp kernel is called, and then a reduction kernel is called to compute the sum.

The reduction kernel might look like:

global void sum_kernel(float *temp, float *sum, int size) { shared float shared_data[1024]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x;

// Each thread loads data into shared memory if (idx < size) { shared_data[tid] = temp[idx]; } else { shared_data[tid] = 0; } __syncthreads(); // Perform reduction, for example, for 1024 threads, add half each time for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { shared_data[tid] += shared_data[tid + s]; } __syncthreads(); } // The first thread writes the sum of the chunks to global memory if (tid == 0) { atomicAdd(sum, shared_data[0]); }

}

But this method can only process one part in each block, and then add the sum of each block to the global sum through atomic operations. If the array is large, multiple blocks will be started, each block calculates its own partial sum, and then adds it atomically.

Then, when all blocks are executed, the sum variable stores the sum of the entire array. Then, in the kernel of step 3, each thread uses temp[idx] / sum.

But there are a few things to note here:

Atomic operations may cause performance issues, but modern GPUs are efficient with atomic operations. The "sum" variable should be a single variable, allocated with cudaMalloc and initialized to 0 before the kernel is launched.

For example, before calling sum_kernel, initialize sum to 0:

float *d_sum; cudaMalloc(&d_sum, sizeof(float)); cudaMemset(d_sum, 0, sizeof(float));

Then in sum_kernel, each block atomically adds its partial sum to d_sum.

Finally, d_sum is copied back to the host, or used directly in the third kernel.

But the third kernel needs to access the sum value. Alternatively, the sum can be stored in device memory and the third kernel can read it directly.

However, when the third kernel is started, sum has been processed by all blocks. Because CUDA kernel calls are executed sequentially, the third kernel is called after sum_kernel is called:

global void divide_kernel(float *temp, float *sum, float *output, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { output[idx] = temp[idx] / *sum; } }

In this way, each thread reads the global sum value in divide_kernel and divides it by each temp element.

However, there may be a problem: when the atomic operation in sum_kernel is completed, all blocks are processed and the value of sum is the correct sum.

However, you may need to synchronize or ensure that sum_kernel has completed before calling divide_kernel.

This can be achieved with CUDA streams, but by default kernels are executed sequentially, so as long as sum_kernel and divide_kernel are in the same stream and executed sequentially, the value of sum will be updated correctly.

So, the overall steps are:

Allocate GPU memory for input, output, temp, and sum.
Copy input data to the GPU.
Call the compute_exp kernel to calculate the temp array.
Call sum_kernel to calculate the sum.
Call divide_kernel to calculate the output array.
Copies the output array back to the host.

However, sum_kernel needs to process a part of each block, so the division of block and grid needs to be based on the size of the temp array.

For example, suppose the temp array size is N, and each block processes a maximum of 1024 elements, so the size of the grid is (N + 1023) / 1024, and each block has 1024 threads.

Then, in sum_kernel, each block processes its own part and adds the partial sum to the sum variable using atomic operations.

Possible problem: Correctness of atomic operations. When multiple blocks perform atomic addition at the same time, is the result correct? Yes, because atomicAdd is performed sequentially, so the final sum will be the sum of all parts.

Now, let’s translate these steps into code:

First, initialize:

float *d_input, *d_temp, *d_output, *d_sum; cudaMalloc(&d_input, size * sizeof(float)); cudaMalloc(&d_temp, size * sizeof(float)); cudaMalloc(&d_output, size * sizeof(float)); cudaMalloc(&d_sum, sizeof(float));

Then, the input data is copied from the host to d_input.

Then, the compute_exp kernel is called:

int blockSize = 256; int gridSize = (size + blockSize - 1) / blockSize; compute_exp<<<gridSize, blockSize>>>(d_input, d_temp, size);

Next, call sum_kernel to sum:

cudaMemset(d_sum, 0, sizeof(float)); int sumBlockSize = 256; int sumGridSize = (size + sumBlockSize - 1) / sumBlockSize; sum_kernel<<<sumGridSize, sumBlockSize>>>(d_temp, d_sum, size);

Here, the blockSize of sum_kernel may need to be a power of 2, such as 256, but the size of the shared memory needs to correspond. For example, in sum_kernel, the size of the shared memory is 256, so it needs to be declared as:

global void sum_kernel(float *temp, float *sum, int size) { shared float shared_data[256]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x;

if (idx < size) { shared_data[tid] = temp[idx]; } else { shared_data[tid] = 0; } __syncthreads(); for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { shared_data[tid] += shared_data[tid + s]; } __syncthreads(); } if (tid == 0) { atomicAdd(sum, shared_data[0]); }

}

Note that each block processes 256 elements and then adds the partial sums to sum.

Finally, divide_kernel is called:

divide_kernel<<<gridSize, blockSize>>>(d_temp, d_sum, d_output, size);

Implementation of divide_kernel:

global void divide_kernel(float *temp, float *sum, float *output, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { output[idx] = temp[idx] / *sum; } }

But there is a problem here. In divide_kernel, each thread has to read the value of sum, which is a pointer to a single float, so it needs to be read from the device memory. This may have efficiency problems because each thread will access the same global memory address, which may cause competition, but in practice, as long as sum has been calculated correctly and all threads read the same value, this is feasible.

A more efficient approach would probably be to copy sum to the host first and then back into a device variable, or use a constant. For a code example though, it's probably fine.

Now, we need to consider numerical stability, such as overflow when calculating exponentials, especially when the input value is large. In practical applications, the maximum value is usually subtracted to improve numerical stability. However, in user needs, it may only be a basic implementation, so this may not be considered. However, if necessary, before calculating the exponent, you can find the maximum value of the array, and then subtract the maximum value from each element before performing the exponential operation. This can avoid overflow caused by too large an exponent. For example, the formula for softmax is usually:

softmax(x_i) = e^(x_i - max(x)) / sum(e^(x_j - max(x)))

But users may want a simple implementation, so this part may not be included.

To sum up, the complete code may be as follows:

First include the header file:

#include <cuda_runtime.h> #include <stdio.h> #include <math.h>

Then define three kernel functions:

global void compute_exp(float *input, float *temp, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { temp[idx] = expf(input[idx]); } }

global void sum_kernel(float *temp, float *sum, int size) { shared float shared_data[256]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x;

if (idx < size) { shared_data[tid] = temp[idx]; } else { shared_data[tid] = 0.0f; } __syncthreads(); for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { shared_data[tid] += shared_data[tid + s]; } __syncthreads(); } if (tid == 0) { atomicAdd(sum, shared_data[0]); }

}

global void divide_kernel(float *temp, float *sum, float *output, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { output[idx] = temp[idx] / *sum; } }

Then, the call in the main function:

void cuda_softmax(float *h_input, float *h_output, int size) { float *d_input, *d_temp, *d_output, *d_sum;

cudaMalloc(&d_input, size * sizeof(float)); cudaMalloc(&d_temp, size * sizeof(float)); cudaMalloc(&d_output, size * sizeof(float)); cudaMalloc(&d_sum, sizeof(float)); cudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice); int blockSize = 256; int gridSize = (size + blockSize - 1) / blockSize; compute_exp<<<gridSize, blockSize>>>(d_input, d_temp, size); cudaMemset(d_sum, 0, sizeof(float)); int sumBlockSize = 256; int sumGridSize = (size + sumBlockSize - 1) / sumBlockSize; sum_kernel<<<sumGridSize, sumBlockSize>>>(d_temp, d_sum, size); divide_kernel<<<gridSize, blockSize>>>(d_temp, d_sum, d_output, size); cudaMemcpy(h_output, d_output, size * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_input); cudaFree(d_temp); cudaFree(d_output); cudaFree(d_sum);

}

However, there may be several problems here:

Is the atomic operation in sum_kernel correct? Assume that each block calculates the sum of the part it processes and then atomically adds it to d_sum. However, each block processes multiple elements. For example, each thread processes one element, but a block has 256 threads and processes 256 elements. Each block calculates the sum of the 256 elements and then atomically adds it to d_sum. When all blocks are processed, the value in d_sum is the sum of all elements.

However, the original implementation of sum_kernel may have problems. For example, when size is not a multiple of 256, the part processed by each block may exceed the number of actual elements, causing the unprocessed thread position in shared_data to be set to 0, thus correctly accumulating?

For example, if idx < size, the thread reads temp[idx], otherwise it sets it to 0. This may be OK, because unprocessed elements are set to 0, so the partial sum of each block is only for elements that actually exist.

But each block processes 256 elements. If the total number of elements is not a multiple of 256, the last block's portion will contain the actual elements and extra 0s. Is the calculation of the partial sum correct? For example, suppose there are 257 elements. The first block processes 0-255 and the second block processes 256-511. However, only 256 actually exist. The remaining 255 are out of range, so their shared_data[tid] will be set to 0. So in the partial sum of the blocks, the sum of the first block is the sum of the exp of the first 256 elements, and the sum of the second block is the 257th element plus the following 255 0s. The sum is correct.

However, suppose each block processes 256 elements, when the total number of elements is N, each block processes elements from i 256 to i 256 + 255. In this way, the partial sum calculated by each block is the sum of these 256 elements (if the element exists, otherwise it is 0), and then the partial sums of all blocks are added together, and the sum is correct.

This way, sum_kernel should calculate the sum correctly.

Another possible optimization is to use a more efficient reduction method in sum_kernel, such as processing more elements per block, using loops, but the code will be more complicated. For example, processing multiple elements per thread and then doing in-block reduction may be more suitable for processing larger arrays.

However, as example code, the current approach may be sufficient.

Another question is whether the value of sum is read correctly in divide_kernel. Because sum is a variable in device memory, each thread reads the value of the variable in divide_kernel, which has been atomically updated by all sum_kernel blocks. In CUDA, kernels are executed in the order of calls, so after sum_kernel is called, divide_kernel is called, and the value of sum has been completely updated.

However, you may need to use cudaDeviceSynchronize() to ensure that the previous kernel execution is completed, but in the main function of the sample code, synchronization may be called implicitly, because CUDA's default stream is synchronized, so the call of each kernel is executed sequentially.

In general, this code should be able to implement the Softmax function, but it may not be efficient or numerically stable enough. However, as an example code, it should be able to show the basic implementation ideas.

Finally, let’s test the code. For example, pass in a small array, such as h_input is [1.0, 2.0, 3.0], calculate the exp of each element, the sum is exp(1)+exp(2)+exp(3), and then divide each element by the sum to see if the output is correct.

There may be a question: when multiple threads read sum at the same time, will there be any synchronization problems? Since sum is in device memory and all write operations have been completed in sum_kernel, the read operation in divide_kernel should get the final value.

In summary, this should be a viable CUDA implementation, although there may be room for optimization, such as dealing with numerical stability issues or a more efficient reduction implementation.

#include <cuda_runtime.h>#include <math.h>
__global__ void compute_exp_and_sum(float *input, float *temp, float *sum, int size) {__shared__ float shared_data[256];int idx = blockIdx.x * blockDim.x + threadIdx.x;int tid = threadIdx.x;
// Calculate the exponent and store it in a temporary arrayif (idx < size) {temp[idx] = expf(input[idx]);shared_data[tid] = temp[idx];} else {shared_data[tid] = 0.0f;}
__syncthreads();
// Reduction and summation within the blockfor (int s = blockDim.x / 2; s > 0; s >>= 1) {if (tid < s) {shared_data[tid] += shared_data[tid + s];}__syncthreads();}
// Atomic operation adds the sum of the block to the global sumif (tid == 0) {atomicAdd(sum, shared_data[0]);}}
__global__ void normalize(float *temp, float *sum, float *output, int size) {int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < size) {output[idx] = temp[idx] / *sum;}}
void cuda_softmax(float *h_input, float *h_output, int size) {float *d_input, *d_temp, *d_output, *d_sum;
// Allocate device memorycudaMalloc(&d_input, size * sizeof(float));cudaMalloc(&d_temp, size * sizeof(float));cudaMalloc(&d_output, size * sizeof(float));cudaMalloc(&d_sum, sizeof(float));
// Copy input data to the devicecudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice);
int blockSize = 256;int gridSize = (size + blockSize - 1) / blockSize;
// Initialize the sum to 0cudaMemset(d_sum, 0, sizeof(float));
// Calculate exponents and sumcompute_exp_and_sum<<<gridSize, blockSize>>>(d_input, d_temp, d_sum, size);
// Normalizationnormalize<<<gridSize, blockSize>>>(d_temp, d_sum, d_output, size);
// Copy the result back to the hostcudaMemcpy(h_output, d_output, size * sizeof(float), cudaMemcpyDeviceToHost);
// Release device memorycudaFree(d_input);cudaFree(d_temp);cudaFree(d_output);cudaFree(d_sum);}

Compared with the full version, the 1.58bit quantized version omits the code for preventing exponential accumulation overflow and adds some code for video memory allocation and release. Write the same test case as before as follows:

#include <stdio.h>int main(void){float a[8] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};float b[8] = {0.0f};cuda_softmax(a, b, 8);for(int i = 0; i < 8; i ++){printf("%e\n", b[i]);}return 0;}

Run output:

# ./cuda_testbench 25.766128e-041.567396e-034.260624e-031.158158e-023.148199e-028.557692e-022.326222e-016.323327e-01

The results are consistent with PyTorch.

in conclusion

When conditions permit, use the full-capacity version if possible. This will help you solve problems more professionally and save more labor costs.
As a second choice, you can choose the quantitative version of the model, which has a more complete thinking chain and average but usable code generation quality;
If there are really no conditions, you can choose the distilled version of the small model, which has the lowest hardware requirements, but the usability of the code generation is poor, and you need to spend more time and energy debugging;