Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

DeepSeek Smallpond

Written by

Silas Grey

Updated on:July-14th-2025

You may have heard about smallpond from the Twitter/LinkedIn buzz. From all that buzz, you might conclude that Databricks and Snowflake are done for. But wait, that’s not the case.

While this open source technology is interesting and powerful, it is unlikely to be widely used in the analytics world anytime soon. Here is a concise breakdown to help you see through the noise. We’ll cover:

1. What is Smallpond and its supporting system 3FS?

2. Are they suitable for your use case?

3. How to use them if they are suitable

What is smallpond?

Smallpond is a lightweight distributed data processing framework recently launched by DeepSeek. It extends DuckDB (usually a single-node analytical database) to enable it to process larger data sets across multiple nodes. Smallpond enables DuckDB to manage distributed workloads by using distributed storage and computing systems. Main features:

1. Distributed analysis: By partitioning data and running analysis tasks in parallel, DuckDB can handle data sets that exceed the memory capacity of a single machine.

2. Open source deployment: If you can run it successfully, 3FS will provide you with powerful and high-performance storage at a much lower cost than other alternatives.

3. Manual partitioning: Users need to manually partition the data, and Smallpond will distribute these partitions to multiple nodes for parallel processing.

What is 3FS?

3FS, the full name of which is Fire-Flyer File System, is a high-performance parallel file system developed by Deep Quest. It is specially optimized for AI and high-performance computing (HPC) workloads, and provides extremely high throughput and low latency by using SSD and RDMA network technology.

3FS is the high-speed distributed storage backend that smallpond relies on, providing it with amazing performance. 3FS achieves a read throughput of 6.6 TiB/s on a cluster of 180 nodes, far exceeding many traditional distributed file systems.

How to use it? First, install it like any other Python package: pip install smallpond. But to really take advantage of smallpond, more effort is required, depending on the size of your data and infrastructure:

1. Less than 10TB: Unless you have very specific distributed computing requirements, smallpond may not be necessary. A single-node DuckDB instance or a simpler storage solution will be simpler and may even perform better. Frankly speaking, using smallpond (without Ray or 3FS) with small data will probably be slower and more complicated than vanilla DuckDB.

2.10TB to 1PB: Smallpond starts to show its advantages. You need to set up a cluster and use 3FS or other high-speed storage backend to achieve fast parallel processing.

3. More than 1PB (PB-scale): Smallpond and 3FS are designed to handle massive data sets. At this scale, you need to deploy a larger cluster and make a lot of infrastructure investment.

Deployment typically consists of the following steps:

1. Set up a computing cluster (such as AWS EC2, Google Compute Engine, or a local server).

2. Deploy 3FS on nodes , which need to be equipped with high-performance SSDs and RDMA networks.

3. Install smallpond via Python to run distributed DuckDB tasks on the cluster.

Among them, steps 1 and 3 are very simple, but step 2 is very difficult. Since 3FS is new technology, there is no guide on how to set it up on AWS or other cloud platforms (maybe Deep Search will provide relevant support in the future?). You can certainly deploy it on a bare metal server, but this will put you in a deeper DevOps dilemma.

I have tried running smallpond with S3 instead of 3FS, but for moderately large data sets it is unclear whether this will give a performance improvement over scaling out to a single node.

Whether you need to use smallpond depends on several factors:

1. Data scale : If your dataset is less than 10TB, smallpond will introduce unnecessary complexity and overhead. For larger datasets, it can provide significant performance advantages.

2. Infrastructure capabilities : Smallpond and 3FS require strong infrastructure and DevOps expertise. Without a team that is good at cluster management, deployment and operation can be very challenging.

3. Analysis complexity : Smallpond excels at partition-level parallel processing, but is less optimized for complex join operations. If your workload requires complex join operations across partitions, performance may be limited.

Core Concepts

Session

As the entry point of smallpond

Responsible for managing the execution environment, configuration, and resources

Provides methods for creating DataFrame

Managing Ray clusters and executors

DataFrame

The main abstraction for data processing

Provides interfaces for data conversion and manipulation

Support SQL, map, flat_map and other operations

Delayed execution until compute() is called

Platform

Abstracts differences between different execution platforms (e.g., local, MPI, etc.)

Responsible for task initiation and resource management

Provides platform specific configuration and defaults

LogicalPlan

DAG (directed acyclic graph) representing the entire computation

Composed of Node

Can be optimized and converted into execution plans

Node (Logical Node)

DataSourceNode: Data source node

SqlEngineNode: SQL execution node

ArrowComputeNode: Arrow compute node

DataSinkNode: Data output node

ProjectionNode: Column selection node

UnionNode: Data merging node

ConsolidateNode: partition merge node

ExecutionPlan

Converted from LogicalPlan

Contains specific execution tasks (Task)

Manage task dependencies and execution order

Task

Specific execution unit

Contains input, output, and execution logic

Support retry and error handling

Can run on different executors

Scheduler

Responsible for task scheduling and allocation

Managing the lifecycle of tasks

Handling task failures and retries

Support for speculative execution

DataSet

Represents a set of data files

Supports different file formats (Parquet, CSV, JSON, etc.)

Managing the partitioning and distribution of data

Provides data reading and writing interfaces

WorkQueue

Managing the execution queue of tasks

Prioritization of support tasks

Handling task status changes

Session

-> Create DataFrame

-> Build LogicalPlan

-> Optimizer

-> Planner generates ExecutionPlan

-> Scheduler schedules execution

-> Platform Run Tasks

Session manages the entire execution environment

DataFrame provides user interface

LogicalPlan describes the calculation logic

ExecutionPlan is responsible for specific execution

Platform provides runtime support

DataSet manages data storage and access

How it works

Lazy DAG execution: Smallpond is useful for`map()`,`filter()`and`partial_sql()`Operations such as name use lazy evaluation. It does not execute these operations immediately, but builds a logical execution plan, which is represented by a directed acyclic graph (DAG), with each operation becoming a node.`SqlEngineNodeHashPartitionNode`,`DataSourceNode`.

The actual calculation will not start until you explicitly trigger the execution of an action. These triggers include:

write_parquet() — Write data to disk

to_pandas() — Convert the result to pandas

DataFrame

compute() — Explicitly force calculation
count() — Count rows
take() — Get some rows of data

This lazy evaluation mechanism is very efficient because it avoids unnecessary calculations and optimizes the workflow.

From logical plan to execution plan: When you finally trigger an operation, the logical plan is transformed into a plan consisting of specific tasks (such as`SqlEngineTask`,`HashPartitionTask`) are the actual work units that are distributed and executed by Ray.

Ray Core and Distributed Processing: Smallpond's distributed capabilities rely on Ray Core, and scalability is achieved through partitioning at the Python level. Partitioning can be done manually, and Smallpond supports the following partitioning methods:

Hash partitioning (based on column values)
Even partitioning (by number of files or rows)
Shuffle partitions

Each partition runs independently in its own Ray task, using a DuckDB instance to handle SQL queries. This tight integration with Ray emphasizes horizontal scaling (adding more nodes) rather than vertical scaling (using a more powerful single node). To use it at scale, you need a Ray cluster. You can choose to run a Ray cluster on your own infrastructure or on a cloud provider like AWS, but if you just want to test it out, it's easier to get started with Anyscale.

DeepSeek Smallpond

Deployment typically consists of the following steps:

1. Set up a computing cluster (such as AWS EC2, Google Compute Engine, or a local server).

2. Deploy 3FS on nodes , which need to be equipped with high-performance SSDs and RDMA networks.

3. Install smallpond via Python to run distributed DuckDB tasks on the cluster.

Whether you need to use smallpond depends on several factors:

1. Data scale : If your dataset is less than 10TB, smallpond will introduce unnecessary complexity and overhead. For larger datasets, it can provide significant performance advantages.

2. Infrastructure capabilities : Smallpond and 3FS require strong infrastructure and DevOps expertise. Without a team that is good at cluster management, deployment and operation can be very challenging.

3. Analysis complexity : Smallpond excels at partition-level parallel processing, but is less optimized for complex join operations. If your workload requires complex join operations across partitions, performance may be limited.

Core Concepts

ExecutionPlan

DataSet

How it works

From logical plan to execution plan: When you finally trigger an operation, the logical plan is transformed into a plan consisting of specific tasks (such asSqlEngineTask,HashPartitionTask) are the actual work units that are distributed and executed by Ray.

Ray Core and Distributed Processing: Smallpond's distributed capabilities rely on Ray Core, and scalability is achieved through partitioning at the Python level. Partitioning can be done manually, and Smallpond supports the following partitioning methods:

From logical plan to execution plan: When you finally trigger an operation, the logical plan is transformed into a plan consisting of specific tasks (such as`SqlEngineTask`,`HashPartitionTask`) are the actual work units that are distributed and executed by Ray.