DeepSeek Smallpond

Written by
Silas Grey
Updated on:July-14th-2025
Recommendation

DeepSeek Smallpond: Analyzing the emerging lightweight distributed data processing framework

Core content:
1. The Smallpond framework and its extended distributed analysis capabilities of DuckDB
2. The performance advantages of the 3FS file system in AI and HPC workloads
3. Smallpond's practical installation and usage guide

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


You may have heard about smallpond from the Twitter/LinkedIn buzz. From all that buzz, you might conclude that Databricks and Snowflake are done for. But wait, that’s not the case.


While this open source technology is interesting and powerful, it is unlikely to be widely used in the analytics world anytime soon. Here is a concise breakdown to help you see through the noise. We’ll cover:


1. What is Smallpond and its supporting system 3FS?

2. Are they suitable for your use case?

3. How to use them if they are suitable


What is smallpond?

Smallpond is a lightweight distributed data processing framework recently launched by DeepSeek. It extends DuckDB (usually a single-node analytical database) to enable it to process larger data sets across multiple nodes. Smallpond enables DuckDB to manage distributed workloads by using distributed storage and computing systems. Main features:


1. Distributed analysis: By partitioning data and running analysis tasks in parallel, DuckDB can handle data sets that exceed the memory capacity of a single machine.


2. Open source deployment: If you can run it successfully, 3FS will provide you with powerful and high-performance storage at a much lower cost than other alternatives.


3. Manual partitioning: Users need to manually partition the data, and Smallpond will distribute these partitions to multiple nodes for parallel processing.


What is 3FS?

3FS, the full name of which is Fire-Flyer File System, is a high-performance parallel file system developed by Deep Quest. It is specially optimized for AI and high-performance computing (HPC) workloads, and provides extremely high throughput and low latency by using SSD and RDMA network technology.


3FS is the high-speed distributed storage backend that smallpond relies on, providing it with amazing performance. 3FS achieves a read throughput of 6.6 TiB/s on a cluster of 180 nodes, far exceeding many traditional distributed file systems.


How to use it? First, install it like any other Python package: pip install smallpond. But to really take advantage of smallpond, more effort is required, depending on the size of your data and infrastructure:


1. Less than 10TB: Unless you have very specific distributed computing requirements, smallpond may not be necessary. A single-node DuckDB instance or a simpler storage solution will be simpler and may even perform better. Frankly speaking, using smallpond (without Ray or 3FS) with small data will probably be slower and more complicated than vanilla DuckDB.


2.10TB to 1PB: Smallpond starts to show its advantages. You need to set up a cluster and use 3FS or other high-speed storage backend to achieve fast parallel processing.


3. More than 1PB (PB-scale): Smallpond and 3FS are designed to handle massive data sets. At this scale, you need to deploy a larger cluster and make a lot of infrastructure investment.


Deployment typically consists of the following steps:

1. Set up a computing cluster (such as AWS EC2, Google Compute Engine, or a local server).


2. Deploy 3FS on nodes , which need to be equipped with high-performance SSDs and RDMA networks.


3. Install smallpond via Python to run distributed DuckDB tasks on the cluster.


Among them, steps 1 and 3 are very simple, but step 2 is very difficult. Since 3FS is new technology, there is no guide on how to set it up on AWS or other cloud platforms (maybe Deep Search will provide relevant support in the future?). You can certainly deploy it on a bare metal server, but this will put you in a deeper DevOps dilemma.


I have tried running smallpond with S3 instead of 3FS, but for moderately large data sets it is unclear whether this will give a performance improvement over scaling out to a single node.


Whether you need to use smallpond depends on several factors:

1. Data scale : If your dataset is less than 10TB, smallpond will introduce unnecessary complexity and overhead. For larger datasets, it can provide significant performance advantages.


2. Infrastructure capabilities : Smallpond and 3FS require strong infrastructure and DevOps expertise. Without a team that is good at cluster management, deployment and operation can be very challenging.


3. Analysis complexity : Smallpond excels at partition-level parallel processing, but is less optimized for complex join operations. If your workload requires complex join operations across partitions, performance may be limited.


Core Concepts

Session

  • As the entry point of smallpond
  • Responsible for managing the execution environment, configuration, and resources
  • Provides methods for creating DataFrame
  • Managing Ray clusters and executors


DataFrame

  • The main abstraction for data processing
  • Provides interfaces for data conversion and manipulation
  • Support SQL, map, flat_map and other operations
  • Delayed execution until compute() is called


Platform

  • Abstracts differences between different execution platforms (e.g., local, MPI, etc.)
  • Responsible for task initiation and resource management
  • Provides platform specific configuration and defaults


LogicalPlan

DAG (directed acyclic graph) representing the entire computation

Composed of Node

Can be optimized and converted into execution plans


Node (Logical Node)

  • DataSourceNode: Data source node
  • SqlEngineNode: SQL execution node
  • ArrowComputeNode: Arrow compute node
  • DataSinkNode: Data output node
  • ProjectionNode: Column selection node
  • UnionNode: Data merging node
  • ConsolidateNode: partition merge node


ExecutionPlan

  • Converted from LogicalPlan
  • Contains specific execution tasks (Task)
  • Manage task dependencies and execution order


Task

  • Specific execution unit
  • Contains input, output, and execution logic
  • Support retry and error handling
  • Can run on different executors


Scheduler

  • Responsible for task scheduling and allocation
  • Managing the lifecycle of tasks
  • Handling task failures and retries
  • Support for speculative execution


DataSet

  • Represents a set of data files
  • Supports different file formats (Parquet, CSV, JSON, etc.)
  • Managing the partitioning and distribution of data
  • Provides data reading and writing interfaces


WorkQueue

  • Managing the execution queue of tasks
  • Prioritization of support tasks
  • Handling task status changes


Session

-> Create DataFrame

-> Build LogicalPlan

-> Optimizer

-> Planner generates ExecutionPlan

-> Scheduler schedules execution

-> Platform Run Tasks


  • Session manages the entire execution environment
  • DataFrame provides user interface
  • LogicalPlan describes the calculation logic
  • ExecutionPlan is responsible for specific execution
  • Platform provides runtime support
  • DataSet manages data storage and access


How it works

Lazy DAG execution: Smallpond is useful formap(),filter()andpartial_sql()Operations such as __name__ use lazy evaluation. It does not execute these operations immediately, but builds a logical execution plan, which is represented by a directed acyclic graph (DAG), with each operation becoming a node.SqlEngineNodeHashPartitionNode,DataSourceNode.


The actual calculation will not start until you explicitly trigger the execution of an action. These triggers include:

write_parquet() — Write data to disk

to_pandas() — Convert the result to pandas


DataFrame

  • compute() — Explicitly force calculation
  • count() — Count rows
  • take() — Get some rows of data

This lazy evaluation mechanism is very efficient because it avoids unnecessary calculations and optimizes the workflow.




From logical plan to execution plan: When you finally trigger an operation, the logical plan is transformed into a plan consisting of specific tasks (such asSqlEngineTask,HashPartitionTask) are the actual work units that are distributed and executed by Ray.




Ray Core and Distributed Processing: Smallpond's distributed capabilities rely on Ray Core, and scalability is achieved through partitioning at the Python level. Partitioning can be done manually, and Smallpond supports the following partitioning methods:


  • Hash partitioning (based on column values)
  • Even partitioning (by number of files or rows)
  • Shuffle partitions


Each partition runs independently in its own Ray task, using a DuckDB instance to handle SQL queries. This tight integration with Ray emphasizes horizontal scaling (adding more nodes) rather than vertical scaling (using a more powerful single node). To use it at scale, you need a Ray cluster. You can choose to run a Ray cluster on your own infrastructure or on a cloud provider like AWS, but if you just want to test it out, it's easier to get started with Anyscale.