True enterprise-level solution! MindSQL open source: four training methods + private data + RAG + high concurrency in production

Enterprise-level text-to-SQL solution, MindSQL open source package, helps build high-concurrency Chat2SQL applications.
Core content:
1. Features of MindSQL open source package: open source and free, privacy protection, flexible construction
2. Solve the difficulties of enterprise applications: improve accuracy, support high-concurrency processing
3. Working principle and key functions: RAG model, two-step procedure, core API components
MindSQL is a Python package for text-to-SQL generation. It is open source and supports RAG, which can help us build Chat2SQL related applications.
In addition to some common open source software features:
• The open source package and its different front-end integrations are free for anyone to use. • Privacy It is designed to give you control over your company’s data without external monitoring. • You ’re in control You can build it any way you want and connect and integrate all your business data.
MindSQL also solves two major difficulties in enterprise applications: low accuracy and inability to support high concurrency:
Highly accurate
Achieving high accuracy with MindsSQL depends on the quality of the training data provided, and more training data improves high accuracy.
High concurrency
MindSQL's powerful architecture automatically scales to handle thousands of requests per minute.
How it works
MindSQL is a Python library that uses RAG (Retrieval Augmented Generation) with the help of large language models to create precise SQL queries tailored to your database.
MindSQL follows a simple two-step process to simplify your database interactions:
• First, it indexes database specific information into vector storage, creating a RAG “model” that fits your data. • It then formulates SQL queries based on your query for seamless execution on the database.
To fully utilize the power of MindSQL, you primarily need to interact with two key features:
• minds.index(...)
: This feature plays a key role in the initial step, allowing you to index database-specific details into the vector store. This information can serve as the base corpus for subsequent SQL query generation.• minds.ask_db(...)
: This feature helps in communicating directly with the database. It allows you to ask questions and leverage the context stored in the vector storage to build SQL queries to solve your problem exactly.
MindSQL consists of three basic components, each of which inherits and implements the API provided by the MindSQLCore class:
• Database : This component is dedicated to managing database-related functions. • Vector Storage : The Vector Storage package focuses on storing and organizing indexed database contexts as an important reference corpus for query generation. • LLMs : The LLMs package is responsible for integrating large language models into the MindSQL framework, enhancing the library's ability to understand and generate context-sensitive SQL queries.
Installation and Usage
MindSQL is very easy to use. First install the python package:
pip install mindsql
Then the environment variables can be set through the command line or .env
File to set:
• API_KEY : API key for authentication with LLMs such as OpenAI, Gemini, LLAMA, etc. • DB_URL : The URL or connection string of the database • EXAMPLE_PATH : The path to any sample JSON files or data you might want to use for bulk indexing
Common database DB_URL and format are as follows:
• mysql://username:password@host:port/database
• postgresql://username:password@host:port/database
• mongodb://username:password@host:port/database
The program is very simple. First, introduce the relevant packages:
from mindsql.core import MindSQLCore
from mindsql.databases import Sqlite
from mindsql.llms import GoogleGenAi
from mindsql.vectorstores import ChromaDB
Pass in the LLM configuration and vector database, and specify the database type to create it MindSQLCore
Example.
# Choose the Vector Store. LLM and DB You Want to Work With And
# Create MindSQLCore Instance With Configured Llm, Vectorstore, And Database
minds = MindSQLCore(
llm=GoogleGenAi(config={ "api_key" : "YOUR-API-KEY" }),
vectorstore=ChromaDB(),
database=Sqlite()
)
Use the database connection string to connect to the database:
connection = minds.database.create_connection(url="YOUR_DATABASE_CONNECTION_URL")
Index all data structures and store them in the vector database:
minds.index_all_ddls(connection=connection, db_name='NAME_OF_THE_DB')
Batch index questions (RAG knowledge base) from the specified example path. If there are no example questions, this step can be skipped:
minds.index(bulk=True, path="your-qsn-sql-example.json")
Then you can ask the database questions.
response = minds.ask_db(
question = "YOUR_QUESTION" ,
connection=connection,
visualize= True
)
In Python, you can also extract and display charts from the response.
chart = response[ "chart" ]
chart.show()
Finally, don’t forget to close the database connection:
connection.close()
The complete procedure is as follows:
from mindsql.core import MindSQLCore
from mindsql.databases import Sqlite
from mindsql.llms import GoogleGenAi
from mindsql.vectorstores import ChromaDB
config = { "api_key" : "YOUR-API-KEY" }
minds = MindSQLCore(
llm=GoogleGenAi(config=config),
vectorstore=ChromaDB(),
database=Sqlite()
)
connection = minds.database.create_connection(url= "YOUR_DATABASE_CONNECTION_URL" )
minds.index_all_ddls(connection=connection, db_name= 'NAME_OF_THE_DB' )
minds.index(bulk= True , path= "your-qsn-sql-example.json" )
response = minds.ask_db(
question = "YOUR_QUESTION" ,
connection=connection,
visualize= True
)
chart = response[ "chart" ]
chart.show()
connection.close()
Query data
ask_db(...)
Methods serve as the main interface for querying the system and gaining insights from the underlying data and are structured as follows:
result = ask_db(
question = "What is the average salary of employees?" ,
connection = my_connection,
table_names = [ "employees" ],
visualize = False
)
table_names
The parameter is optional and can be used to specify related tables if needed. If provided table_names
, MindSQL will not retrieve the related tables from the vector database, but will use the provided table name to obtain the related DDL .
Will visualize
If the parameter is set to True, users can request to use plotly charts to visualize query results (except for demos, this parameter is usually turned off).
Manually build an index knowledge base
Earlier we mentioned using sample files to build an index knowledge base, and we can also use code to build it at any time.
There are several ways to build an index:
Question and answer
When you add a question-SQL pair, you directly instruct the system how to interpret a specific user query. This is done by associating the question with its corresponding SQL query, allowing the system to understand the user's intent and accurately retrieve relevant data.
index(
question = "What is the average salary?" ,
sql = "SELECT AVG(salary) FROM employee"
)
Add DDL statements
Add Data Definition Language (DDL) statements, which define the structure of database objects, such as tables, columns, and data types. By incorporating DDL statements into the system's knowledge base, you can provide the necessary schema information, thereby enhancing the system's ability to effectively interpret and respond to queries about the database.
index(ddl="CREATE TABLE employees (id INT, name VARCHAR(50), salary FLOAT)")
Adding Docstrings
Adding documentation strings involves providing specific contextual information that is critical for the system to understand the domain it operates in. This documentation can include details about business logic, industry-specific terminology, or any other relevant information required to accurately interpret a user query.
index(documentation="Employee salaries are calculated in US dollars ($)")
Batch data addition
To efficiently integrate multiple question-SQL pairs from external sources, use the index method with the bulk=True flag and specify the path to the JSON file (data.json). This allows for quick extraction of data and integration into the system's knowledge base.
index(bulk=True, path="data.json")
The data structure of the JSON file is:
[
{
"Question" : "What is the average salary of employees?" ,
"SQLQuery" : "SELECT AVG(salary) FROM employees"
} ,
]
at last
After testing, it was found that with more than 400 tables, simple queries were basically possible, but errors sometimes occurred in the selection of similar fields and similar tables.
At this time, the query effect of the specified table is better, but in actual use, it requires the support of upper-level applications, such as using AI to select the table first.
However, actual production still relies on private training data. Most actual customers have limited Chat2BI scenarios, and the accuracy of this part can be close to 100%.