Getting started with RAG optimization using the Wikipedia dataset

Written by
Caleb Hayes
Updated on:June-27th-2025
Recommendation

Use the Wikipedia dataset to optimize the RAG model and improve the AI ​​intelligence level.

Core content:
1. Analysis of the reasons why the RAG model is not ideal
2. Methods and steps to optimize RAG using the Wikipedia dataset
3. Application of PolarDB PostgreSQL database in data management

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

    The reasons are not repeated here. This article will take the Wikipedia dataset as an example and manage it with the PolarDB PostgreSQL database to explain how to improve the recall coverage and recall accuracy through the following means.

    • Segmentation method
    • Extract QA pairs
    • Extract tags
    • Full text search
    • Fuzzy query
    • Tag search
    • Vector search
    • rerank

    Prepare the environment

    Assume that you have deployed the following environment

    "Poor guy playing with PolarDB RAC write-one-read-many cluster series | Using loop devices to simulate shared storage in Docker containers to build PolarDB RAC"

    "Poor Guy Plays with PolarDB RAC Write-Once-Read-Many Cluster Series | Accessing Private Large Model Services"

    Download Wikipedia dataset

    Using command line tools

    https://huggingface.co/docs/huggingface_hub/main/en/guides/cli

    pip install -U  "huggingface_hub[cli]"  

    The Wikipedia dataset is as follows. The data format is divided into 4 fields. id, url, title, text

    https://huggingface.co/datasets/wikimedia/wikipedia

    We can download only the Chinese dataset

    https://huggingface.co/datasets/wikimedia/wikipedia/tree/main/20231101.zh

    train-00000-of-00006.parquet  
    train-00001-of-00006.parquet  
    ...  
    train-00005-of-00006.parquet  

    Create a new directory

    mkdir ~/wiki  

    download

    HF_ENDPOINT=http://hf-mirror.com nohup huggingface-cli download wikimedia/wikipedia --include 20231101.zh/* --repo-type dataset -- local -dir ~/wiki --cache-dir ~/wiki/.cache --resume-download --max-workers 1 >~/wiki/download.log 2>&1 &    

    PS:

    Originally duckdb could download the hf dataset directly, but it would not automatically use the proxy. This resulted in access failure.


    https://huggingface.co/docs/hub/en/datasets-duckdb

    $ HF_ENDPOINT=http://hf-mirror.com ./duckdb   
    v1.2.1 8e52ec4395  
    Enter  ".help" for  usage hints.   
    Connected to a transient  in -memory database.  
    Use  ".open FILENAME"  to reopen on a persistent database.  
    D select * from  'hf://datasets/wikimedia/wikipedia/20231101.zh/*.parquet' limit  1;   
    IO Error:  
    SSL connection failed error  for  HTTP GET to  'https://huggingface.co/api/datasets/wikimedia/wikipedia/tree/main/20231101.zh'  with status 1718054505  

    Importing Wikipedia dataset into PolarDB

    Put the downloaded parquet file in a place accessible by the PolarDB container

    mv ~/wiki ~/data_volumn/  
      
      
    ./20231101.zh:   
    total 3481600  
    -rw-r--r-- 1 digoal staff 560M Apr 17 14:28 train-00000-of-00006.parquet  
    -rw-r--r-- 1 digoal staff 251M Apr 17 14:29 train-00001-of-00006.parquet  
    -rw-r--r-- 1 digoal staff 121M Apr 17 14:30 train-00002-of-00006.parquet  
    -rw-r--r-- 1 digoal staff 250M Apr 17 14:32 train-00003-of-00006.parquet  
    -rw-r--r-- 1 digoal staff 245M Apr 17 14:33 train-00004-of-00006.parquet  
    -rw-r--r-- 1 digoal staff 214M Apr 17 14:34 train-00005-of-00006.parquet  
    drwxr-xr-x 4 digoal staff 128B Apr 17 14:34 ..  
    drwxr-xr-x 8 digoal staff 256B Apr 17 14:34 .  

    Enter the PolarDB container

    docker  exec  -ti pb1 bash  

    Download duckdb

    cd  /data  
    wget https://github.com/duckdb/duckdb/releases/download/v1.2.2/duckdb_cli-linux-aarch64.zip  
    sudo apt install -y unzip  
    unzip duckdb_cli-linux-aarch64.zip   

    Querying the dataset using DuckDB

    ./duckdb  
    D load postgres_scanner;  
    100% ▕████████████████████████████████████████████████████████████████▏   
    D select * from  '/data/wiki/20231101.zh/*.parquet' limit  10000;  
    ┌─────────┬────────────────────────┬────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
    │ id │ url │ title │ text │  
    │ varchar │ varchar │ varchar │ varchar │  
    ├─────────┼────────────────────────┼──────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
    │ 13 │ https://zh.wikiped… │ Mathematics │ Mathematics is a discipline that studies concepts such as quantity, structure, and space and their changes. It is a kind of formal science. Mathematics uses abstraction and logical reasoning to develop from counting, calculation, measurement, and observation of the shape and movement of objects… │  
    │ 18 │ https://zh.wikiped… │ Philosophy │ Philosophy is a discipline that studies universal and fundamental issues, including existence, knowledge, value, reason, mind, language and other fields. Philosophy is different from other disciplines in that it has a unique way of thinking, such as critical thinking, common sense, and so on. │  
    │ 21 │ https://zh.wikiped… │ Literature │ Literature (), in a narrow sense, is a kind of language art, that is, an art that uses language and words as a means to vividly reflect objective social life and express subjective author's thoughts and feelings. Literature not only emphasizes the transmission of ideas and concepts, but also… │  
    │ 22 │ https://zh.wikiped… │ History │ History (modern Chinese term, classical Chinese called it history) refers to the past events and actions of human society, as well as the systematic recording, interpretation and research of these events and actions. History can help people today understand the past and serve as a basis for the future… │  
    │ 25 │ https://zh.wikipedia… │ Computer Science │ Computer science (, sometimes abbreviated as) is the systematic study of the theoretical foundations of information and computing and the practical techniques of how they are implemented and applied in computer systems. It is often described as the study of those who create, describe, and… │  
    │ 39 │ https://zh.wikiped… │ Nationality │ In the People's Republic of China, a nation is a community formed based on cultural and bloodline inheritance, and is recognized by the laws of the People's Republic of China. This word originated in Europe and the United States in the 19th century, and was introduced to China via Japan. In the United States, it is not a common word.  
    │ 45 │ https://zh.wikiped… │ Drama │ Drama () is an art form in which actors perform a story or situation through dialogue, singing or action. Drama has four elements, including "actors", "story (situation)", "stage (performance venue)" and... │  
    │ 48 │ https://zh.wikiped… │ Film │ Film (/ ), characterized by motion/moving images (), is a work of visual art used to simulate experiences that convey ideas, stories, perceptions, feelings, beauty, or atmosphere through the use of moving images. These images are often accompanied by… │  
    │ 51 │ https://zh.wikiped… │ Music │ Music, in a broad sense, refers to any art composed of sound. The English word "Music" comes from the ancient Greek "μουσική" (mousike), which means the art of the goddess Muse. The Chinese word "music" is the same as the word "music" in Xu Shen's "Shuo… │  
    │ 53 │ https://zh.wikiped… │ Economics │ Economics (), is a social science branch that studies the relationship between goods and services, including all purchasing, production, distribution and consumption behaviors. \n\nGeneral economics focuses on the study of the power changes between economic subjects and objects,… │  
    │ 56 │ https://zh.wikiped… │ Political Science │ Political science, or political science, is a social science that conducts scientific research on politics, governance, and power systems, analyzes political activities, political systems, political ideas and behaviors, and related constitutions and laws. Political science involves… │  
    │ 57 │ https://zh.wikiped… │ Jurisprudence │ Jurisprudence (jurisprudence, legal theory), legal science, is a discipline in social science that studies the nature and laws of law, a specific social phenomenon. All order can be said to be a kind of "law… │  
    │ 59 │ https://zh.wikiped… │ Sociology │ Sociology () originated in the late 19th century and is a discipline that studies society. Sociology uses various research methods to conduct empirical investigations and critical analysis to develop and improve a set of theories about human social structure, social action, or social relations… │  
    │ 62 │ https://zh.wikiped… │ Military Science │ Military science is related to many fields, mainly war. In addition, military science itself contains various disciplines. Military is part of politics, and war is a continuation of politics. It is a country or group using violent means to achieve its own goals… │  
    │ 66 │ https://zh.wikiped… │ Information science │ Information science, formerly known as intelligence science (a Japanese loanword in Chinese), mainly refers to the research object of information, including the analysis, collection, classification, processing, storage, retrieval, dissemination and preservation of information… │  
    │ 67 │ https://zh.wikiped… │ Physics │ Physics (from, from, transliteration: phýsis, literal translation: nature) is the natural science that studies the nature and properties of matter and energy. Since matter and energy are the basic elements that must be involved in all scientific research, physics… │  
    │ 70 │ https://zh.wikiped… │ Astronomy │ Astronomy is a natural science that studies celestial bodies and astronomical phenomena. It uses mathematics, physics and chemistry to explain their origin and evolution. Astronomy studies objects including: planets, satellites, stars, nebulae, galaxies and comets, etc… │  
    │ 72 │ https://zh.wikiped… │ Mechanics │ Mechanics () is a branch of physics that studies energy and force and their relationship to the balance, deformation or motion of objects. \n\nHistory\nPeople use levers, water-pumping tools, etc. in their daily work, and gradually recognize… │  
    │ 74 │ https://zh.wikiped… │ Chemistry │ Chemistry is a sub-discipline of physics that studies the properties, composition, structure, and laws of change of matter. The objects of chemical research involve the relationship between substances, or the relationship between substances and energy. Traditional chemistry is often… │  
    │ 76 │ https://zh.wikiped… │ Geography │ Geography () is the study of the Earth and its characteristics, inhabitants and phenomena. It is a discipline that studies the interaction between the various layers of the Earth's surface, as well as their spatial differences and changes. \n\nEtymology\nThe English word geography comes from… │  
    │ · │ · │ · │ · │  
    │ · │ · │ · │ · │  
    │ · │ · │ · │ · │  
    │ 35312 │ https://zh.wikiped… │ Jiangmen City │ Jiangmen City (official transliteration: , traditional foreign language: , Kongmun, Kiangmoon), also known as Siyi, Wuyi, abbreviated as Yi, is a prefecture-level city under the jurisdiction of Guangdong Province, People's Republic of China. It is located in the southern part of Guangdong Province and is the Guangdong-Hong Kong-Macao Greater Bay Area… │  
    │ 35313 │ https://zh.wikiped… │ Xinhui District │ Xinhui (Cantonese: , Chinese Pinyin: ), abbreviated as Xin or Hui, was called Gangzhou in ancient times. It is also called Kuixiang because of its abundant production of palm trees. It is now a district under the jurisdiction of Jiangmen City, Guangdong Province, China. It is located in the western Pearl River Delta, at the confluence of the lower reaches of the Xijiang River and the Tanjiang River… │  
    │ 35314 │ https://zh.wikiped… │ PID controller │ PID controller (proportional-integral-derivative controller) consists of a proportional unit (Proportional), an integral unit (Integral) and a derivative unit (Derivative). The gains of these three units can be adjusted, and… │  
    │ 35319 │ https://zh.wikiped… │ Lattice (mathematics) │ In mathematics, a lattice () is a partially ordered set (poset) whose non-empty finite subsets have a supremum (called a union) and an infimum (called an intersection). A lattice can also be characterized as an algebraic structure that satisfies certain axiomatic identities. … │  
    │ 35325 │ https://zh.wikiped… │ Semilattice │ Let be a partially ordered set. If for any , there is a minimum upper bound (union), or for any , there is a maximum lower bound (intersection), then it is said to form a semilattice. \n\nA semilattice can also be defined as an algebraic structure. A semi… │  
    │ 35329 │ https://zh.wikiped… │ Partial order relation │ Partially ordered sets (, abbreviation) are sets equipped with a partial order relation in mathematics, especially in order theory. \nThis theory abstracts the intuitive concept of sorting, sequence or arrangement of the elements of a set. This ordering does not have to be complete… │  
    │ 35331 │ https://zh.wikiped… │ Binary relation │ In mathematics, a binary relation (or simply relation) is used to discuss the connection between two objects. For example, "greater than" and "equal to" in arithmetic, "similar" in geometry, or "element of" and "subset of" in set theory. │  
    │ 35333 │ https://zh.wikiped… │ Tasmanian tiger │ Tasmanian tiger (scientific name: ), now extinct, was once widely distributed in the tropical rainforests of New Guinea and the grasslands of Australia. Later, due to human activities, it was only distributed on the island of Tasmania… │  
    │ 35334 │ https://zh.wikiped… │ Metal materials │ Metal materials generally refer to pure metals or alloys used in industrial engineering applications. There are about 70 kinds of pure metals in nature, among which the common ones are gold, silver, copper, iron, aluminum, tin, nickel, lead, zinc, carbon, etc. Alloys usually refer to two or more kinds of metals.  
    │ 35335 │ https://zh.wikiped… │ Cartesian product │ In mathematics, the Cartesian product () of two sets and , also known as the direct product, is represented in set theory as , which is the set of all possible ordered pairs where the first object of the ordered pair is a member of and the second object is a member of . … │  
    │ 35336 │ https://zh.wikiped… │ Ordered pair │ In mathematics, an ordered pair is a collection of two objects such that one can be distinguished as the "first element" and the other as the "second element" (the first element and the second element are also called the left projection and the right projection). With the first… │  
    │ 35340 │ https://zh.wikiped… │ Metal Matrix Composites │ Metal Matrix Composites (MMC) are generally composite materials with metal or alloy as the continuous phase and a second phase in the form of particles, whiskers or fibers. Currently, its preparation and processing are difficult, and the cost is high. │  
    │ 35343 │ https://zh.wikipedia… │ Equivalence relation │ In mathematics, an equivalence relation () is a binary relation that is reflexive, symmetric, and transitive. An equivalence relation is also called an equivalent relation. Some examples of equivalence relations include congruence on sets of integers, equivalent quantities () in Euclidean geometry,… │  
    │ 35344 │ https://zh.wikiped… │ 666 │ 666 (six hundred and sixty-six) is a natural number between 665 and 667. \n\nIn mathematics\n 666 is the 667th non-negative integer. \n 666 is the 333rd even number and the 544th composite number. (The 666th composite number is 806 = 2 × 13 ×… │  
    │ 35347 │ https://zh.wikiped… │ Total order relation │ A total order relation, also called a linear order (), is an antisymmetric, transitive, and complete binary relation on a set (generally referred to as ). \n\nIf the total order relation is satisfied, then the following statement holds for all and in :\n\n Against… │  
    │ 35356 │ https://zh.wikiped… │ Least Upper Bound │ Least Upper Bound, also known as supremum (, denoted as sup E), is an important concept in order theory in mathematics, and is widely used in fields such as lattice theory and mathematical analysis. \n\nDefinition\nGiven a partially ordered set (T,≤), for S⊆T, the supremum of S… │  
    │ 35361 │ https://zh.wikiped… │ Upper and lower bounds │ Let it be a partially ordered set. If there exists a real number that satisfies all the conditions, it is called the upper bound of the set. If there exists a real number that satisfies all the conditions, it is called the lower bound of the set. \n\nFor example, in real variables, if there is a real number that satisfies all the conditions, then it is the set… │  
    │ 35363 │ https://zh.wikiped… │ Greatest lower bound │ In mathematics, the infimum (or, denoted as inf E ) of a subset E of a set X is the largest element that is less than or equal to all other elements of E, which is not necessarily in E. Therefore, the term greatest lower bound (abbreviated as… │  
    │ 35366 │ https://zh.wikiped… │ Software Development │ Software development () is the process of building a software system or the software part of a system according to user requirements. Software development is a process that includes requirements acquisition, development planning, requirements analysis and design, programming implementation, software… │  
    │ 35372 │ https://zh.wikiped… │ Nanomaterials │ Nanomaterials are generally a general term for materials that have at least one dimension in the nanoscale range in three-dimensional space or are composed of materials in this scale range as basic structural units. Since nano-sized materials have very different properties from macroscopic materials… │  
    ├─────────┴────────────────────────┴────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
    │ 10000 rows (40 shown) 4 columns │  
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
    D   

    Create a PolarDB database user

    $ psql  
      
    create role digoal encrypted password  '123456'  login superuser;  

    Import the Wikipedia dataset into PolarDB using the DuckDB postgres_scanner plugin.

    https://duckdb.org/docs/stable/extensions/postgres.html

    ./duckdb  
      
    CREATE SECRET polardb_pb1 (  
        TYPE postgres,  
        HOST  '127.0.0.1' ,  
        PORT 5432,  
        DATABASE postgres,  
        USER  'digoal' ,  
        PASSWORD  '123456'
    );  


    ATTACH  ''  AS pb1 (TYPE postgres, SECRET polardb_pb1);  

    create table pb1.wiki (id int, url text, title text, content text);  

    -- example   

    insert into pb1.wiki select * from  '/data/wiki/20231101.zh/*.parquet' limit  10000;   

    D select * from pb1.wiki  limit  1;  
    ┌───────┬────────────────────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
    │ id │ url │ title │ content │  
    │ int32 │ varchar │ varchar │ varchar │  
    ├───────┼────────────────────────┼─────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
    │ 13 │ https://zh.wikiped… │ Mathematics │ Mathematics is a discipline that studies concepts such as quantity, structure, and space and their changes. It is a kind of formal science. Mathematics uses abstraction and logical reasoning to develop from counting, calculation, measurement, and observation of the shape and movement of objects. … │  
    └───────┴────────────────────────┴───────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
    D   

    PS: If you have enough space, you can export parquet to csv and then use psql copy to import it, which will be faster.

    -- duckdb  
    ./duckdb  
    copy (select * from  '/data/wiki/20231101.zh/*.parquet'  ) to  '/data/a.csv'  (FORMAT csv, DELIMITER  '|' , HEADER);  
      
    --polardb  
    psql   
    copy wiki from  '/data/a.csv'  with (format csv, header on, DELIMITER  '|' , escape  '"' , QUOTE  '"' );  
    COPY 1384748  

    Data size

    postgres= # \dt+
                                               List of relations
     Schema | Name | Type | Owner | Persistence | Access method | Size | Description 
    --------+------------------+-------+----------+--------------+---------------+------------+-------------
     public | wiki | table | digoal | permanent | heap | 2241 MB | 
    (20 rows)

    Prepare plugins and local model services

    Create the corresponding plug-in

    $ psql  
    psql (PostgreSQL 15.12 (PolarDB 15.12.3.0 build e1e6d85b debug) on ​​aarch64-linux-gnu)  
    Type  "help" for help .  

    postgres= # \d wiki  
                         Table  "public.wiki"
     Column | Type | Collation | Nullable | Default   
    ---------+-------------------+-----------+----------+---------  
     id |  integer            | | |   
     url | character varying | | |   
     title | character varying | | |   
     content | character varying | | |   

    postgres= # select count(*) from wiki;  
      count    
    ---------  
     1384748  
    (1 row)  

    postgres= # create extension pg_bigm;  
    CREATE EXTENSION  
    postgres= # create extension pg_jieba;  
    CREATE EXTENSION  
    postgres= # create extension vector;  
    CREATE EXTENSION  

    create extension http ;  
    create extension openai ;   

    Using local/remote models

    $ Ollama list  
    NAME ID SIZE MODIFIED       
    qwen2.5-coder:3b e7149271c296 1.9 GB 5 hours ago       
    ds-qwen2.5-1.5b-digoal:latest 2c6aaa8a497c 3.6 GB 7 weeks ago       
    qwen2.5:1.5b 65ec06548149 986 MB 7 weeks ago       
    DeepSeek-r1:7b 0a8c26691023 4.7 GB 2 months ago      
    qwen_1.5b_test1:latest 682ad25636bd 1.1 GB 2 months ago      
    deepseek-r1:1.5b a42b25d8c10a 1.1 GB 2 months ago      
    deepseek-r1:14b ea35dfe18182 9.0 GB 2 months ago      
    mxbai-embed-large:latest 468836162de7 669 MB 4 months ago   

    Start the local model  

    OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve >> ~/.ollama.log 2>&1 &  

    Specify the model API address

    psql  
      
    alter database postgres  set  openai.api_uri =  'http://host.docker.internal:11434/v1/' ;  

    Data processing and retrieval optimization

    1. Segmentation/Slicing

    You can use pgai / postgresml to segment the raw data

    https://github.com/postgresml/postgresml?tab=readme-ov-file#chunk

    https://postgresml.org/docs/open-source/pgml/api/pgml.chunk

    https://www.cybertec-postgresql.com/en/pgai-importing-wikipedia-into-postgresql/

    https://python.langchain.com/docs/how_to/#text-splitters

    https://python.langchain.com/docs/integrations/splitters/writer_text_splitter/#setup

    How to choose segmentation algorithm? :

    https://docs.dify.ai/zh-hans/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text

    The following are examples of using pgml slices

    First deploy rust, refer to the following:

    • PostgreSQL pg_bm25 (open source by paradedb): Elastic quality full-text search performance inside Postgres is 20x better than tsvector tsrank

    Deploy Rust

    curl --proto  '=https'  --tlsv1.2 -sSf https://sh.rustup.rs | sh  
    echo ". \"\$HOME/.cargo/env\""  >> ~/.bashrc   
    . ~/.bashrc  
      
      
    $ rustc --version  
    rustc 1.86.0 (05f9846f8 2025-03-31)  

    Configure domestic mirror

    echo "   
    [source.crates-io]            
    replace-with = 'ustc'            
                
    [source.ustc]            
    registry = \"sparse+https://mirrors.ustc.edu.cn/crates.io-index/\"    
    "
     >> /home/postgres/.cargo/config.toml  

    Download the pgml plugin code

    cd  /data  
    git  clone  --depth 1 -b v2.10.0 https://github.com/postgresml/postgresml  

    Compile and install the pgml plugin dependency environment

    cd  /data/postgresml/pgml-extension/  

    # For pgrx versions, please refer to the Cargo.toml files of different versions   
    cargo install --locked --version 0.12.9 cargo-pgrx      
    cargo pgrx init     # create PGRX_HOME, then immediately exit by pressing ctrl^c          
    cargo pgrx init --pg15=` which  pg_config`     # ignore the warning     

       Validating /home/postgres/tmp_polardb_pg_15_base/bin/pg_config  
     Initializing data directory at /home/postgres/.pgrx/data-15  


    sudo apt-get update  
    sudo apt-get install -y lld libopenblas-dev  

    Compile and install the pgml plugin

    PGRX_IGNORE_RUST_VERSIONS=y cargo pgrx install --release --pg-config ` which  pg_config`   

    Install dependencies for running pgml

    sudo pip3 install xgboost lightgbm scikit-learn langchain  

    Configure postgresql.conf

    vi ~/primary/postgresql.conf  
      
    shared_preload_libraries= 'pgml,$libdir/polar_vfs,$libdir/polar_worker'  

    Restart PolarDB for postgresql

    pg_ctl restart -m fast -D ~/primary  

    Install the plugin in the database

    postgres= # create extension pgml;  
    INFO: Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], executable: /usr/bin/python3  
    INFO: Scikit-learn 1.6.1, XGBoost 3.0.0, LightGBM 4.6.0, NumPy 2.2.4  
    CREATE EXTENSION  

    postgres= # \dx  
                                                  List of installed extensions  
            Name | Version | Schema | Description                                   
    ---------------------+---------+------------+----------------------------------------------------------------------------  
     http | 1.6 | public | HTTP client  for  PostgreSQL, allows web page retrieval inside the database.  
     openai | 1.0 | public | OpenAI client.  
     pg_bigm | 1.2 | public | text similarity measurement and index searching based on biggrams  
     pg_bulkload | 3.1.22 | public | pg_bulkload is a high speed data loading utility  for  PostgreSQL  
     pg_jieba | 1.1.0 | public | a parser  for  full-text search of Chinese  
     pg_stat_statements | 1.10 | public | track planning and execution statistics of all SQL statements executed  
     pg_trgm | 1.6 | public | text similarity measurement and index searching based on trigrams  
     pgml | 2.10.0 | pgml | Machine Learning and AI  functions  from postgresml.org  
     pgstattuple | 1.5 | public | show tuple-level statistics  
     plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language  
     plpython3u | 1.0 | pg_catalog | PL/Python3U untrusted procedural language  
     polar_feature_utils | 1.0 | pg_catalog | PolarDB feature utilization  
     polar_vfs | 1.0 | public | polar virtual file system  for  different storage  
     sslinfo | 1.2 | public | information about SSL certificates  
     vector | 0.8.0 | public | vector data  type  and ivfflat and hnsw access methods  
    (15 rows)  

    Use the chunk interface of pgml to slice the wikipedia file imported earlier

    -- Create an index first  
    postgres= # create index on wiki (id);  
    CREATE INDEX  
      
    -- Slice function pgml.chunk test  
    postgres= # SELECT id, (pgml.chunk('character', content)).chunk_index from wiki where id=100;  
     id | chunk_index   
    -----+-------------  
     100 | 1  
     100 | 2  
     100 | 3  
     100 | 4  
     100 | 5  
    (5 rows)  

    The slicing function supports several slicing methods as follows. Choose the appropriate one according to the situation.

    • https://github.com/postgresml/postgresml/blob/master/pgml-extension/src/bindings/langchain/langchain.py
    SPLITTERS = {  
        "character" : CharacterTextSplitter,  
        "latex" : LatexTextSplitter,  
        "markdown" : MarkdownTextSplitter,  
        "nltk" : NLTKTextSplitter,  
        "python" : PythonCodeTextSplitter,  
        "recursive_character" : RecursiveCharacterTextSplitter,  
        "spacy" : SpacyTextSplitter,  
    }  

    Introduction to slice function parameters

    postgres= # \x  
    Expanded display is on.  
    postgres= # \df+ pgml.chunk  
    List of  functions
    -[ RECORD 1 ]-------+-----------------------------------------------------------------  
    Schema | pgml  
    Name | chunk  
    Result data  type     | TABLE(chunk_index bigint, chunk text)  
    Argument data types | splitter text, text text, kwargs jsonb DEFAULT  '{}' ::jsonb  
    Type | func  
    Volatility | immutable  
    Parallel | safe  
    Owner | postgres  
    Security | invoker  
    Access privileges |   
    Language | c  
    Source code | chunk_wrapper  
    Description |   

    The last parameter kwargs, the value depends on the parameters of different splitters of langchain. For example, CharacterTextSplitter:

    • https://python.langchain.com/docs/how_to/character_text_splitter/
    from langchain_text_splitters import CharacterTextSplitter  
      
    # Load an example document  
    with open( "state_of_the_union.txt" ) as f:  
        state_of_the_union = f.read()  

    text_splitter = CharacterTextSplitter(  
        separator = "\n\n" ,  
        chunk_size=1000,  
        chunk_overlap=200,  
        length_function=len,  
        is_separator_regex=False,  
    )  
    texts = text_splitter.create_documents([state_of_the_union])  
    print (texts[0])  

    Test the slice parameters. After reducing the slice segment length, the number of slices increases and each slice becomes smaller.

    SELECT id, (pgml.chunk( 'character' , content,  '{"chunk_size": 500, "chunk_overlap": 100}' ::jsonb)).chunk_index from wiki  where  id=100;  


    postgres= # SELECT id, (pgml.chunk('character', content, '{"chunk_size": 500, "chunk_overlap": 100}'::jsonb)).chunk_index from wiki where id=100;  
     id | chunk_index   
    -----+-------------  
     100 | 1  
     100 | 2  
     100 | 3  
     100 | 4  
     100 | 5  
     100 | 6  
     100 | 7  
     100 | 8  
     100 | 9  
     100 | 10  
     100 | 11  
     100 | 12  
     100 | 13  
     100 | 14  
     100 | 15  
     100 | 16  
     100 | 17  
     100 | 18  
     100 | 19  
     100 | 20  
     100 | 21  
     100 | 22  
     100 | 23  
     100 | 24  
     100 | 25  
     100 | 26  
     100 | 27  
     100 | 28  
     100 | 29  
     100 | 30  
     100 | 31  
     100 | 32  
     100 | 33  
     100 | 34  
     100 | 35  
     100 | 36  
     100 | 37  
     100 | 38  
     100 | 39  
     100 | 40  
    (40 rows)  

    postgres= # SELECT id, (pgml.chunk('character', content, '{"chunk_size": 500, "chunk_overlap": 100}'::jsonb)).* from wiki where id=100;  
    -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    -----------------------------------------------------------------------------------  
    id | 100  
    chunk_index | 1  
    chunk | Agriculture is a primary industry, including crop planting, animal husbandry, fishery, forestry and other activities, responsible for the supply of staple and non-staple food and cash crops. The main products of agriculture are food, fiber, energy and raw materials (such as rubber), among which food includes grains, vegetables, fruits, edible oil, meat, milk  
    Products, eggs and fungi. Global agriculture produces about 11 billion tons of food, 32 million tons of natural fiber and 4 billion cubic meters of wood each year. However, 14% of this food is wasted before it reaches the retail stage. Since the beginning of the 20th century, industrial agriculture based on single crop cultivation has become the main source of world agricultural output.    
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                |                                                                                                                                                                                               
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | The emergence of agriculture is a milestone in the transition of human civilization to a settled form. With the domestication, cultivation and reproduction of wild animals and plants, people obtained sufficient food and resources, and promoted the development and formation of early cities. Humans began to collect grains from the wild 105,000 years ago, but it was not until 11,500 years ago that they began to  
    Crop cultivation began in the 19th century and domesticated sheep, goats, pigs, cattle and other livestock about 10,000 years ago. Crop cultivation has been independently developed in at least 11 regions of the world.  
    -[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    -----------------------------------------------------------------------------------  
    id | 100  
    chunk_index | 2  
    chunk | The development of modern agricultural technology, plant breeding, and agricultural chemical products (such as pesticides and fertilizers) has significantly increased crop yields, but has also caused many ecological and environmental problems. Selective breeding and the development of modern animal husbandry technology have increased meat production, but have also caused concerns about animal welfare and the environment.  
    These environmental issues include climate change, depletion of underground aquifers, deforestation, antibiotic resistance and agricultural-related pollution. Agriculture is both a cause and a major contributor to environmental degradation, with biodiversity loss, desertification, soil degradation and climate change all reducing crop yields.  
    The proportion of sustainable agriculture is gradually increasing, including traditional and organic farming, focusing on ecological balance and local food. Genetically modified crops are widely used, but also in some countries.                                                                                            
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                |                                                                                                                                                                                               
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | Definition                                                                                                                                                                                          
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | According to the explanations in the Shuowen Jiezi (Explanation of Chinese Characters) of the Eastern Han Dynasty and the Kangxi Dictionary (Dictionary of Kangxi) of the Qing Dynasty, the word "nong" means "farming", which means that in ancient China only planting was called "agriculture". However, the modern definition of agriculture is broader, including the use of natural resources to produce life-sustaining products.  
    The goods needed, such as food, fiber, forestry products, horticultural crops, and related services. Therefore, agriculture in a broad sense includes planting, horticulture, animal husbandry (animal husbandry, aquaculture, etc.) and forestry, but sometimes horticulture and forestry are also excluded. In addition, agriculture can be divided into two categories depending on the different management objects.  
    Categories: Plant agriculture, which is primarily concerned with the cultivation of crops; Animal agriculture, which focuses on agricultural animal products.                                                                                                                                          
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                |                                                                                                                                                                                               
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | History  
    -[ RECORD 3 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    -----------------------------------------------------------------------------------  
    id | 100  
    chunk_index | 3  
    chunk | Origin                                                                                                                                                                                          
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | The world's population has grown significantly since the days of hunter-gatherers, thanks to the development of agriculture. Agriculture originated independently in several regions, with at least 11 centers of origin. As early as 105,000 years ago, people began to gather and eat wild cereals. In the Neolithic period 23,000 years ago,  
    In the Levant, people began to grow emmer wheat, barley, and oats near the Sea of ​​Galilee. The Chinese domesticated rice between 13,500 and 8,200 years ago, and the earliest known cultivation was about 7,700 years ago. Later, mung beans, soybeans, and red beans were also domesticated in this area. The Mesopotamians domesticated rice between 130 and 110 million years ago.  
    Sheep. Cattle were domesticated from aurochs that lived in what is now Turkey and Pakistan 105 million years ago. Domestic pigs were domesticated from wild boars independently in Europe, East Asia, and Southeast Asia, with the earliest domestication occurring about 105 million years ago. In the Andes Mountains of South America, people domesticated wild boars between 10,000 and 7,000 years ago.  
    Potatoes were domesticated 7,000 years ago, followed by beans, coca, alpacas, llamas, and guinea pigs. Sorghum was domesticated 7,000 years ago in the Sahel of Africa. Cotton was domesticated 5,600 years ago in Peru and later independently in Eurasia. Corn originated from the wild Zea mays and was domesticated 6,000 years ago in Mesoamerica.  
    , around 500 AD, horses were domesticated on the Eurasian steppes. There has been a great deal of research on the origins of agriculture, and scholars have proposed many hypotheses. The transition from hunting and gathering to agricultural society was accompanied by the development of intensification and, in particular, the Natufian culture in the Levant and the early Neolithic culture in China.  
    With the change of lifestyle, people began to grow resources collected from the wild in settlements, which led to the gradual domestication of these species.  
    -[ RECORD 4 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    -----------------------------------------------------------------------------------  
    id | 100  
    chunk_index | 4  
    chunk | Ancient Civilization                                                                                                                                                                                      
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                |                                                                                                                                                                                               
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                                                                                                                                                
                                                                                       +  
                | In 8,000 BC, the Sumerians in Eurasia began to settle in villages and relied on the Tigris and Euphrates rivers to irrigate crops. The plow appeared in their pictographic symbols in 3,000 BC, and the sowing plow appeared in 2,300 BC. Sumerian farmers grew barley, wheat, vegetables (  
    Agriculture in ancient Egypt began in the Predynastic Period at the end of the Paleolithic Age, around 10,000 BC, and relied mainly on the Nile River and local seasonal flooding to irrigate crops. The main crops grown by the locals were barley and wheat, in addition to  
    Cash crops such as flax and papyrus. Indian ancestors domesticated wheat, barley and dates before 9,000 BC, and later domesticated sheep and goats. In the Mehrgarh culture of 8,000-6,000 BC, Pakistani ancestors domesticated cattle, sheep and goats. Cotton was domesticated in the 4th-5th century BC. In addition,  
    Ancient evidence shows that animal-drawn plows appeared in the Indus Valley Civilization around 2,500 BC.  
    -[ RECORD 5 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
    -----------------------------------------------------------------------------------  
    id | 100  
    chunk_index | 5  
    chunk | In China, barn construction and sericulture for silk production began in the 5th century BC. Water mills and irrigation began in the 1st century AD, and heavy plows with metal plowshares and moldboards began to appear in the late 2nd century AD, and these technologies then spread westward across Eurasia.  
    According to molecular clock estimates, rice was domesticated in Asia about 8,200-13,500 years ago in the Pearl River Basin in southwestern China. The original species was wild rice. The main grains grown in ancient Greece and Rome were wheat, emmer and barley, vegetables included peas, soybeans and olives, and sheep and goats were the main livestock.  
    To obtain dairy products.  

    ...  
      

    Write the segmented data into a new table, including "PK, original ID, segment ID, segment content, vector, label array"; add fields to the original data: "array field stores all the segments, vectors, and label arrays corresponding to the new table PK for an original content";

    --Storage shards and tags extracted from shards 
    create table wiki_cks (
      id serial primary key,
      wiki_id int,
      ckid int,
      content text,
      vec vector(1024),
      tags text[]
    );

    --Store QA pairs extracted from shards
    -- I originally wanted to use two fields, q and a, to store data, but my local model is not powerful enough to output the content in array format accurately.
    -- So I chose to store qa in simple text format.
    create table wiki_cks_qa (
      id serial primary key,
      wiki_id int,
      ckid int,
      qa text,
      vec vector(1024)
    );

    -- New mapping relationship, vector  
    alter table wiki add column ckids int[];
    alter table wiki add column vec vector(1024);

    A small amount of sample data is used below. It is for demonstration purposes only.

    create table wiki1 (like wiki including all);
    insert into wiki1 select * from wiki  limit  100;
    alter table wiki rename to wiki_full;
    alter table wiki1 rename to wiki;

    Write Shard

    insert into wiki_cks (wiki_id, ckid, content) select id, chunk_index, chunk from (
      SELECT id, (pgml.chunk( 'character' , content,  '{"chunk_size": 500, "chunk_overlap": 100}' ::jsonb)).* from wiki  
    ) t; 

    2. Process the segmented data

    The following uses the openai plugin, reference

    • "Poor Guy Plays with PolarDB RAC Write-Once-Read-Many Cluster Series | Accessing Private Large Model Services"

    1. Use the model to summarize Q/A pairs

    Adjusting the default temperature

    CREATE OR REPLACE FUNCTION openai.prompt(context text, prompt text, model text DEFAULT NULL::text)  
     RETURNS text  
     LANGUAGE plpgsql  
    AS  $function $  
    DECLARE  
        js jsonb;  
        req http_request;  
        res http_response;  
        api_key text;  
        api_uri text;  
        uri text;  
        chat_path text :=  'chat/completions' ;  
    BEGIN  

        -- Fetching settings  for  API key, URI  
        api_key := current_setting( 'openai.api_key'true );  
        api_uri := current_setting( 'openai.api_uri'true );  

        IF api_uri IS NULL THEN  
            RAISE EXCEPTION  'OpenAI: the ' 'openai.api_uri' ' is not currently set' ;  
        END IF;  

        IF api_key IS NULL THEN  
            RAISE EXCEPTION  'OpenAI: the ' 'openai.api_key' ' is not currently set' ;  
        END IF;  

        -- User-specified model over-rides GUC  
        IF model IS NULL THEN  
            model := current_setting( 'openai.prompt_model'true );  
        END IF;  

        uri := api_uri || chat_path;  
        RAISE DEBUG  'OpenAI: querying %' , uri;  

        -- https://platform.openai.com/docs/guides/text-generation #building-prompts  
        js := jsonb_build_object( 'model' , model,  'temperature' , 0, -- Set the specified temperature to 0   
            'messages' , json_build_array(  
                jsonb_build_object( 'role''system''content' , context),  
                jsonb_build_object( 'role''user''content' , prompt)  
                )  
            );  

        RAISE DEBUG  'OpenAI: payload %' , js;  

        -- Construct the HTTP request and fetch response  
        req := (  
            'POST' ,  
            uri,  
            ARRAY[http_header( 'Authorization''Bearer '  || api_key)],  
            'application/json' ,  
            js  
        )::http_request;  

        -- Execute the HTTP request  
        res := http(req);  

        -- Log the response  for  debugging purposes  
        RAISE DEBUG  'OpenAI: Response Status: %' , res.status;  
        RAISE DEBUG  'OpenAI: Content: %' , res.content;  

        -- Check  if  the response status code is not 200  
        IF res.status != 200 THEN  
            js := res.content::jsonb;  
            RAISE EXCEPTION  'OpenAI: request failed with status %, message: %' , res.status, js-> 'error' ->> 'message' ;  
        END IF;  

        -- Return query with extracted data from JSON response  
        js := res.content::jsonb;  
        RETURN js-> 'choices' ->0-> 'message' ->> 'content' ;  
    END;  
    $function $  

    Set http request timeout

    SELECT http_set_curlopt( 'CURLOPT_TIMEOUT''36000' );  

    Summary of QA Testing

    select openai.prompt(   
      $$You are an expert in reading comprehension. You are good at summarizing key questions and related answers based on the content, and return them strictly in the specified format without outputting any redundant content. The output format example is as follows:   
        'Which country is the largest in the East? China.''What are China's four great inventions? Papermaking, gunpowder, compass, movable type printing.' $$,   
      $$Please summarize the key questions and related answers based on the following content, no more than 5 pairs, <content> $$||content|| ' </content>' ,   
    'qwen2.5:1.5b'   
    ) from wiki_cks  limit  1;   

                                                         prompt                                                        
    -----------------------------------------------------------------------------------------------------------------  
     1. Which novel of the same name is the movie "Meeting Socrates at a Late-Night Gas Station" adapted from? +  
        - Based on Dan Milman's 1980 novel of the same name. +  
                                                                                                                    +  
     2. Who are the main actors in the movie? +  
        - Starring Scott Marcholos, Nick Nolte and Amy Smart. +  
                                                                                                                    +  
     3. How was the film reviewed? +  
        - It received polarized reviews, with a Rotten Tomatoes rating of 25% and an average score of 4.7/10, but audience ratings were 76%. Metacritic scored 40, with mixed reviews.  
    (1 row)  

    I feel like there are too many questions, so I'd better generate 1 most important one. Generate QA

    insert into wiki_cks_qa (wiki_id, ckid, qa)   
    select wiki_id, ckid, openai.prompt(   
      $$You are an expert in reading comprehension. You are good at summarizing the most critical questions and related answers based on the content, and return them strictly in the specified format without outputting any redundant content. The output format example is as follows:   
      What are the four great inventions of China? Papermaking, gunpowder, compass, movable type printing. $$,   
      $$Please summarize the most critical questions and related answers based on the following content, <content> $$||content|| ' </content>' ,   
      'qwen2.5:1.5b'   
    ) from wiki_cks;   
      
    INSERT 0 200  

    2. Use model summary label: K_Vs

    Extract keyword examples

    select openai.prompt(   
      $$You are an expert in reading comprehension, good at summarizing the core keywords based on the content. If there are multiple keywords, use ',' to separate the keywords.$$,   
      $$Please summarize the keywords based on the following content, <content> $$||content|| ' </content>' ,   
      'qwen2.5:1.5b'   
    ) from wiki_cks  limit  1; 

    -[ RECORD 1 ]----------------
    prompt | Socrates, USA, Germany

    Generate keywords for each segment

    update wiki_cks  set  tags = ( '{' ||openai.prompt(   
      $$You are an expert in reading comprehension, good at summarizing the core keywords based on the content. If there are multiple keywords, use ',' to separate the keywords.$$,   
      $$Please summarize the keywords based on the content below, up to 5 keywords. <content> $$||content|| ' </content>' ,   
      'qwen2.5:1.5b'   
    )|| '}' )::text[] ;

    3. Vectorization

    Using embedding model: mxbai-embed-large:latest

    SELECT http_set_curlopt( 'CURLOPT_TIMEOUT''36000' );  
    update wiki  set  vec = (openai.vector(content,  'mxbai-embed-large:latest' ))::vector(1024);  
    update wiki_cks  set  vec = (openai.vector(content,  'mxbai-embed-large:latest' ))::vector(1024);   
    update wiki_cks_qa  set  vec = (openai.vector(qa,  'mxbai-embed-large:latest' ))::vector(1024);   

    4. Create an index

    • Inverted index original text + Q/A Chinese word segmentation: used for word segmentation search
    • Inverted index original text + Q/A: used for fuzzy query (word segmentation and fuzzy query can be selected)
    • Inverted index summary label K_V array: used for label matching (contains, intersection)
    • Vector index vector field: used for semantic search (vector similarity)

    Configure PolarDB

    vi ~/primary/postgresql.conf  
    shared_preload_libraries= 'pg_jieba,pg_bigm,pgml,$libdir/polar_vfs,$libdir/polar_worker'  
      
    pg_ctl restart -m fast -D ~/primary  

    Test stuttering participle

    select to_tsvector( 'jiebacfg' , content) from wiki  limit  1;  
      
    ' ':96,97,99,173,174,176,196,214,227,228,230,313,314,317,318,321,323,325,327,329,331,332,335,340,346,350,353,356,361,369,372,376 ' ':183,185,193,201,203,209,220,222,264,266,322,324,326,328,330 ' 10 ':267 ' 1980 ':57 ' 2.5 ':302 ' 2006 ':13,66,336,341,347 ' 2007 ':79 ' 25 ':257 ' 30 ':83 ' 4.7 ':263 ' 40 ':280 ' 73 ':252 ' 76 ':275 ' dan ':192 ' joy ':225 ' metacritic ':278 ' millman ':194 ' soc ':211 ' socrates ':208 ' · ':22,30,37,43,52,178,187,198,216,294,363 ' a ':102,119 ' a ':12 ' released ':75,87 ' but ':153 '':77 '':110,163,186 '丹戏':128 '丹常':138 '主角':47 '人生':150 '他们':122 '':295 '':296 '作品':339,345 '':243 '两极':92 '':308 '剧片':349 '':98 '剧片':19 '气站':3,116,233 '':160 '引用':315 '':284 '應回':229 '':44,217 'S科特':29,177 '合製':17,374,378 ' Same name ':59 ' Mi ':42 ' Joe ':223 ' Executive ':26 ' External ':319 ' University Sports ':107 ' Sun ':288 ' Card grabber ':104,157 ' Good or bad ':282 ' Entertainment general ':63 ' Roger ':293 ' Director ':27 ' Novel ':60 ' Novel ':358 ' Young and successful ':103 ' Nick ':36,197 ' Average ':260,283 ' Year ':58,67,80,342,348 ' Year Germany ':14,337 ' A few sentences ':124 ' Bottom ':7,135,206,237 ' Widely ':86 ' Movie ':301,368 ' After ':126 ' Score ':261,279 ' From ':143 ' Germany ':377 ' German drama ':351 ' Erotic film ':352,355 ' Think ':155 ' athlete ':108 ' adapt ':50 ' adapt ':359 ' story ':100 ' narration ':101 ' literature ':316 ' new ':272 ' freshness ':256 ' on ':65,78,115 ' day ':71,84 ' star ':303,310 ' time ':114 ' wisdom ':151 ' man ':55,190 ' moon ':69,82 ' limited ':74 ' according to ':251 ' gra ':6,134,205,236 ' sand ':23,364 ' luo ':33,181 ' late night ':2,113,232 ' full ':307 ' actor ':175 ' for ':131,262 ' countless ':105 ' rotten ':249 ' you ':24,54,189,365 ' special ':39,46,200,219,297 ' lion ':62,370 ' win ':239 ' win ':156 ' get ':90,146 ' ma ':45,218 ' watt ':25,366 ' by victor ':21 ' movie ':334,360,375,379 ' tomato ':250 ' famous ':291 ' name ':129 ' article ':253 ' meter ':53,188 ' give ':298 ' victor ':362 ' America ':357,373 ' USA ':16,73,343 ' American drama ':354 ' Old man ':120,130,140,144 ' Chat ':141 ' Zidan ':51 ' With ':15,40 ' Ai ':41 ' Amy ':215 ' Chicago ':287 ' English ':333 ' Zi ':34,182 ' Blinded ':161 ' Su ':5,133,204,235 ' Vanity ':159 ' Praise ':244 ' Audience ':270 ' Many ':148 ' Comment ':242 ' Evaluation ':94,305 ' Comment ':254,292 ' Words ':125 ' Said ':123 ' Promised ':38,199 ' Demoted ':245 ' On ':145 ' Link ':320 ' Meet ':117 ' Meet ':4,234 ' Reach ':274 ' Restart ':85 ' Eyes ':165 ' Electric ':300,367 ' Movie ':49,89,338,344,371 ' Movie Design ':64 ' Decoration ':184,202,221 ' Horse ':31,179 ' Freshness ':273  

    Test whether the fuzzy query token segmentation is normal. Recommended databasectype <> C

    • 《PostgreSQL fuzzy query enhancement plugins pgroonga, pgbigm (including single-word, double-word, multi-word, multi-byte characters) - support JSON fuzzy query, etc.》
    • "PostgreSQL Fuzzy Query Best Practices - (including single-word, double-word, and multi-word fuzzy query methods)"
    postgres= # select show_bigm('Andy Lau');  
            show_bigm          
    -------------------------  
     {Liu De, "Hua" , Dehua, "Liu" }  
    (1 row)  

    1. Create a fuzzy query index

    CREATE INDEX ON wiki USING gin (content gin_bigm_ops);  
    CREATE INDEX ON wiki_cks USING gin (content gin_bigm_ops);  
    CREATE INDEX ON wiki_cks_qa USING gin (qa gin_bigm_ops);  

    2. Create a word segmentation query index. You can choose between fuzzy query and Chinese word segmentation.

    CREATE INDEX ON wiki USING gin (to_tsvector( 'jiebacfg' , content));  
    CREATE INDEX ON wiki_cks USING gin (to_tsvector( 'jiebacfg' , content));  
    CREATE INDEX ON wiki_cks_qa USING gin (to_tsvector( 'jiebacfg' , qa));  

    3. Create an inverted index of the label array

    CREATE INDEX ON wiki_cks USING gin (tags);  

    4. Create a vector field index. Note that the recall rate is not only related to the parameters when creating, but also to the parameters when querying. Refer to the following article

    • 《It’s a headache! The results of index scan and full table scan are different. Can this vector database still be used? Teach you a trick to greatly improve the recall rate》
    set  max_parallel_maintenance_workers =2;    
    set  maintenance_work_mem = '256MB' ;    
    ALTER TABLE wiki SET (parallel_workers = 2);    
    ALTER TABLE wiki_cks SET (parallel_workers = 2);    
    ALTER TABLE wiki_cks_qa SET (parallel_workers = 2);    
    create index ON wiki USING hnsw (vec vector_cosine_ops) with (m = 100, ef_construction = 1000);     
    create index ON wiki_cks USING hnsw (vec vector_cosine_ops) with (m = 100, ef_construction = 1000);     
    create index ON wiki_cks_qa USING hnsw (vec vector_cosine_ops) with (m = 100, ef_construction = 1000);     

    5. Retrieval stage

    Hybrid search + rerank

    Semantically similar top_n
    Fuzzy query top_n
    Word segmentation query top_n
    Tag query top_n
    After the above merger, rerank, top_k 
    Expand the context segment (for example, take N segments before and after) 
    RAG 

    Test question: What story does the novel Mikan tell?

    1. Semantically similar top_n

    set  hnsw.ef_search = 1000;    
      
    with a as (  
      select (openai.vector( 'What is the story of the novel Mikan?''mxbai-embed-large:latest' ))::vector(1024) as v   
    ),  
    b as (  
      select a.*,c.* from a join lateral (select *, vec <=> av as simi from wiki_cks order by vec <=> av  limit  10) c on (1=1)   
    )  
    select simi, b.* from b;   

    Operators involved

    postgres= # \do+ <=>  
                                                List of operators  
     Schema | Name | Left arg  type  | Right arg  type  | Result  type     | Function | Description   
    --------+------+---------------+----------------+------------------+------------------------+-------------  
     public | <=> | halfvec | halfvec | double precision | public.cosine_distance |   
     public | <=> | sparsevec | sparsevec | double precision | public.cosine_distance |   
     public | <=> | vector | vector | double precision | public.cosine_distance |   
    (3 rows)  

    2. Fuzzy query top_n

    Extract keywords

    select openai.prompt(       
      $$You are an expert in reading comprehension, good at summarizing the core keywords based on the sentence, and directly return the keywords instead of answering the question.$$,       
      $$Please summarize the keywords based on this sentence: <content> What story does the novel Mikan tell? </content>$$,       
      'qwen2.5:1.5b'       
    );  
      
     prompt   
    --------  
     Satsuma  
    (1 row)  

    Use keyword fuzzy search

    select * from wiki_cks  where  content like  '%Mikan%' ;  

    postgres= # explain (analyze,verbose,timing,costs,buffers) select * from wiki_cks where content like '%Mikan%';  
                                                              QUERY PLAN                                                            
    ------------------------------------------------------------------------------------------------------------------------------  
     Bitmap Heap Scan on public.wiki_cks (cost=12.02..18.91 rows=2 width=1028) (actual time=0.140..0.152 rows=3 loops=1)  
       Output: id, wiki_id, ckid, content, vec, tags  
       Recheck Cond: (wiki_cks.content ~~  '%Mikan%' ::text)  
       Heap Blocks: exact=2  
       Buffers: shared hit=5  
       -> Bitmap Index Scan on wiki_cks_content_idx (cost=0.00..12.02 rows=2 width=0) (actual time=0.087..0.087 rows=3 loops=1)  
             Index Cond: (wiki_cks.content ~~  '%Mikan%' ::text)  
             Buffers: shared hit=3  
     Planning:  
       Buffers: shared hit=1  
     Planning Time: 1.425 ms  
     Execution Time: 0.270 ms  
    (12 rows)  

    Use embedding to perform similarity filtering again, omitted.

    You can also use the similarity filtering of the pg_bigm plugin, as shown below. It is not suitable to match short sentences with long texts, and the similarity is too low.

    • https://github.com/pgbigm/pg_bigm/blob/REL1_2_STABLE/docs/pg_bigm_en.md#pg_bigmsimilarity_limit
    explain (analyze,verbose,timing,costs,buffers) select * from wiki_cks  where  content =%  'Mikan' ;  

    Operators involved

    postgres= # \do+ =%  
                                            List of operators  
     Schema | Name | Left arg  type  | Right arg  type  | Result  type  | Function | Description   
    --------+------+---------------+----------------+-------------+--------------------+-------------  
     public | =% | text | text | boolean | bigm_similarity_op |   
    (1 row)  

    3. Word segmentation query top_n

    You can choose between word segmentation and fuzzy query. The following demonstrates how to use word segmentation query.

    Get keywords from the question

    postgres= # select openai.prompt(       
      $$You are an expert in reading comprehension, good at summarizing several core keywords based on sentences, using | to separate them, and directly returning keywords instead of answering questions.$$,       
      $$Please summarize the keywords based on this sentence: <content> What story does the novel Mikan tell? </content>$$,       
      'qwen2.5:1.5b'       
    );  
          prompt        
    ------------------  
     Mikan|Novel|Story  
    (1 row)  

    Use word search

    select * from wiki_cks  where  to_tsvector (' jiebacfg ', content ) @@ to_tsquery( 'jiebacfg' ,  '蜜桔|小说|故事' );  

    explain (analyze,verbose,timing,costs,buffers) select * from wiki_cks  where  to_tsvector ('jiebacfg ' , content ) @@to_tsquery( 'jiebacfg' ,  '蜜桔|小说|故事' );  

                                                                         QUERY PLAN                                                                        
    -----------------------------------------------------------------------------------------------------------------------------------------------------  
     Bitmap Heap Scan on public.wiki_cks (cost=28.02..38.73 rows=3 width=1028) (actual time=0.108..0.143 rows=10 loops=1)  
       Output: id, wiki_id, ckid, content, vec, tags  
       Recheck Cond: (to_tsvector( 'jiebacfg' ::regconfig, wiki_cks.content) @@  '' '蜜桔' ' | ' '小说' ' | ' '故事' '' ::tsquery)  
       Heap Blocks: exact=7  
       Buffers: shared hit=14  
       -> Bitmap Index Scan on wiki_cks_to_tsvector_idx (cost=0.00..28.02 rows=3 width=0) (actual time=0.082..0.083 rows=10 loops=1)  
             Index Cond: (to_tsvector( 'jiebacfg' ::regconfig, wiki_cks.content) @@  '' '蜜桔' ' | ' '小说' ' | ' '故事' '' ::tsquery)  
             Buffers: shared hit=7  
     Planning:  
       Buffers: shared hit=1  
     Planning Time: 0.671 ms  
     Execution Time: 0.222 ms  
    (12 rows)  

    Operators involved

    postgres= # \do+ @@  
                                                  List of operators  
       Schema | Name | Left arg  type  | Right arg  type  | Result  type  | Function | Description      
    ------------+------+---------------+----------------+-------------+----------------------+-------------------  
     pg_catalog | @@ | tsvector | tsquery | boolean | ts_match_vq | text search match  

    4. Tag query top_n

    Using array contains query

    select * from wiki_cks  where  tags @>  '{蜜桔,小说,故事}' ::text[];  

    explain (analyze,verbose,timing,costs,buffers) select * from wiki_cks  where  tags @>  '{蜜桔,小说,故事}' ::text[];  

     Bitmap Heap Scan on public.wiki_cks (cost=16.00..20.01 rows=1 width=1028) (actual time=0.187..0.189 rows=0 loops=1)  
       Output: id, wiki_id, ckid, content, vec, tags  
       Recheck Cond: (wiki_cks.tags @>  '{蜜桔,小说,故事}' ::text[])  
       Buffers: shared hit=7  
       -> Bitmap Index Scan on wiki_cks_tags_idx (cost=0.00..16.00 rows=1 width=0) (actual time=0.180..0.181 rows=0 loops=1)  
             Index Cond: (wiki_cks.tags @>  '{蜜桔,小说,故事}' ::text[])  
             Buffers: shared hit=7  
     Planning:  
       Buffers: shared hit=1  
     Planning Time: 0.516 ms  
     Execution Time: 0.283 ms  
    (11 rows)  

    Using Array Intersection Queries

    select * from wiki_cks  where  tags &&  '{蜜桔,小说,故事}' ::text[];  

    explain (analyze,verbose,timing,costs,buffers) select * from wiki_cks  where  tags &&  '{蜜桔,小说,故事}' ::text[];  

     Bitmap Heap Scan on public.wiki_cks (cost=16.02..22.91 rows=2 width=1028) (actual time=0.152..0.156 rows=1 loops=1)  
       Output: id, wiki_id, ckid, content, vec, tags  
       Recheck Cond: (wiki_cks.tags &&  '{蜜桔,小说,故事}' ::text[])  
       Heap Blocks: exact=1  
       Buffers: shared hit=8  
       -> Bitmap Index Scan on wiki_cks_tags_idx (cost=0.00..16.01 rows=2 width=0) (actual time=0.128..0.128 rows=1 loops=1)  
             Index Cond: (wiki_cks.tags &&  '{蜜桔,小说,故事}' ::text[])  
             Buffers: shared hit=7  
     Planning:  
       Buffers: shared hit=4  
     Planning Time: 0.820 ms  
     Execution Time: 0.255 ms  
    (12 rows)  

    Operators involved

    postgres= # \do+ @>  
                                                    List of operators  
       Schema | Name | Left arg  type  | Right arg  type  | Result  type  | Function | Description   
    ------------+------+---------------+----------------+-------------+--------------------------------+-------------  
     pg_catalog | @> | anyarray | anyarray | boolean | arraycontains | contains  
      
    postgres= # \do+ &&  
                                                              List of operators  
       Schema | Name | Left arg  type  | Right arg  type  | Result  type  | Function | Description              
    ------------+------+---------------+----------------+-------------+--------------------------------+----------------------------------  
     pg_catalog | && | anyarray | anyarray | boolean | arrayoverlap | overlaps  

    5. Rerank after the above merger and take top_k

    Using the rerank model

    For example
    ollama pull linux6200/bge-reranker-v2-m3  

    If you are using a large model cloud service, you can choose the corresponding rerank model.

    According to the ckid obtained in the above steps, get the corresponding chunk content, use the reranker model to calculate the embedding of each content and the question, rewrite the calculation and question vector example, and take top_k.

    6. Expand the context segment (for example, take 1 segment before and after)

    According to the ckid obtained in step 5, expand the N fragments before and after (supplement the context).

    7. RAG

    Provide the question + all the fragments’ reference content to the model for reply