The road to knowledge base optimization (Part 2): Feeding AI with the favorite data format JSON

Written by
Iris Vance
Updated on:July-11th-2025
Recommendation

Explore the efficient path of AI-optimized knowledge base and reveal the charm of JSON format.

Core content:
1. The importance of JSON format in knowledge base optimization
2. The simplicity and AI-friendly features of JSON
3. Data types suitable for conversion to JSON format

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

The Road to Knowledge Base Optimization (Part 2)

It’s finally here, everyone has been waiting for a long time, let’s continue talking about the knowledge base.

This is the second article in the series “ The Road to Knowledge Base Optimization ”.

This series is mainly aimed at ordinary novice users. Under the condition of limited resources and technical conditions, how to do some basic knowledge base optimization work only from the user side , I hope it can be of some help to everyone.

In the previous article, we have introduced that in the process of using the CherryStudio knowledge base, in order to allow AI to better understand the information in the knowledge base, you can use tools to convert PDF files into Markdown format.

The road to knowledge base optimization (I): Convert PDF files to Markdown format

However, not all data is suitable for conversion to Markdown. Some types of data will perform better if converted to JSON format.

What is JSON?

You may not have heard of it, but you definitely can’t live without it.

JSON is a lightweight data exchange format that is widely used. It is bound to be used in today's Internet or other places involving data transmission.

When you search for information, watch videos, shop, transfer money, chat, watch live broadcasts, and send comments online, countless pieces of information are transmitted rapidly on the Internet in JSON format.

It is no exaggeration to say that JSON is one of the cornerstones of the contemporary Internet.

What does it look like? Let me give you an example.

If you want to record a person's information, the simplest way should be like this:

Zhang San, male, Han nationality, 32 years old, 70 kg, 175 cm tall.

If you wanted to make this information clearer, you might write it like this:

Name: Zhang San Gender: Male Nationality: Han Age: 32 Weight: 70 kg Height: 175 cm




Congratulations, you invented the JSON format!

In JSON format, the above information looks like this:

{
    "Name": "Zhang San",
    "Gender": "Male"
    "Nationality": "Han"
    "Age": 32,
    "Weight": 70,
    "Height": 175
}

Is this basically the same as what you wrote?

Why does AI prefer JSON?

The core of JSON is the one-to-one corresponding " key-value pairs " in the above example (the first is "key", the second is "value"): "name" corresponds to "Zhang San", "gender" corresponds to "male", "nationality" corresponds to "Han"...

It is through such a simple structured expression that it not only records information but also attaches the attributes of the information , making it easier for both people and machines receiving the information to understand.

Because it is a data format independent of programming languages, almost all programming languages ​​support the parsing and generation of JSON. Compared with other data formats such as XML, JSON has a simpler syntax, smaller file size, and higher transmission efficiency.

JSON is also a favorite for AI and large models, because large models are good at processing structured data . During the training process of large models, a lot of data is in JSON format.

The JSON format can clearly express the relationship between data, making it easier for large models to understand and use it, allowing it to learn and predict better. The simplicity of the JSON format and its wide language support make parsing and generating JSON data very efficient, reducing the computational burden of large models.

Therefore, when interacting with large models, the JSON format is widely used in data exchange, prompt engineering, and result output.

What data is suitable for conversion to JSON?

Through the above interpretation, you should have a rough idea of ​​which data is more suitable for conversion to JSON format.

For example, the long text mentioned in the last issue does not need to be converted. Some people may have discovered that the Markdown conversion tool MinerU introduced last time can directly convert PDF to JSON format. But after I carefully read it, I found that the effect is not good, so I did not mention it more.

Most of the data suitable for conversion into JSON format have a clear and fixed structure, which may include but are not limited to the following:

  1. 1.  Test questions : This is a requirement for many people. You can first split the questions, and then convert each question into a JSON object, saving the question, answer, question type, related knowledge points, and solution ideas together.
  2. 2.  Customer Service Q&A : This is a very classic usage scenario in this life. The overall structure is actually a bit similar to the above, so you can refer to it directly.
  3. 3.  Product Catalog : The product catalog of an e-commerce website or enterprise usually contains various structured information such as product name, description, price, specifications, etc. This information can be converted into JSON format so that the embedding model can understand the characteristics and attributes of the product.
  4. 4.  Legal documents : Certain legal documents, such as contracts and clauses, can be converted into JSON if their key information can be extracted in a structured manner (such as parties, subject matter, effective date, etc.).
  5. 5.  Electronic medical records : A patient’s electronic medical records usually contain various structured information such as diagnosis, prescriptions, test results, etc. This information can be converted into JSON format so that the embedding model can understand the patient’s medical history and condition.
  6. 6.Database  export data : Data exported from a relational database usually has a clear table structure and field definition. Each row of data can be converted into a JSON object, with each field corresponding to a key-value pair.
  7. 7.  Data returned by the API : Most APIs return data in JSON format. This data is usually well structured and can be directly used as input for the embedding model.
  8. 8. By analogy, data similar to the types listed above can all be converted into JSON format.

How to convert to JSON format?

This problem is a bit complicated but not difficult to solve.

It is complicated because there are many types of data mentioned above, plus many file formats, and considering different usage scenarios, if you want to find a universal tool where users only need to add the data and it will convert it into a JSON file that fully meets the requirements, it is almost impossible.

The reason why it is not difficult to solve is that we have the most powerful tool - AI .

Since there is no universal tool, we can let AI create many special tools according to our needs.

There are two methods: use AI to generate a web version tool, or generate Python code for processing.

1 Let AI generate web version to JSON tool

I have previously written a tutorial on how to use AI to generate webpage tools:

Tutorial: One command turns DeepSeek into a super tool, a must-have skill in the workplace!

As for the current problem, how to let AI generate a web version of the tool based on your own needs, you can refer to the following prompts (taking cvs file as an example):

Based on the user's need to create a local knowledge base, you need to help the user generate a web-based tool that can batch convert the information provided by the user into JSON format to ensure that it is more suitable for parsing and understanding of the embedded model.

Specific functional requirements:
1. Upload function: Provides file upload function interface and supports batch uploading of files.
2. Format recognition: Automatically recognize the format of the file, for example [cvs file, separated by commas, the first line is the title line].
3. Data extraction: Extract data from the file and convert it into JSON format according to the following rules:
       - Convert each row of data into a JSON object, using the header row as the key of the JSON object.
       - Data cleaning and transformation rules, e.g. converting age to integer type.
       - If a field is empty, it is set to null in the JSON.
4. JSON output: Each input file is converted into a JSON file; the generated file list is displayed; the function of downloading JSON files is provided; single download and package download are supported.
5. User Interface:
       Simple and intuitive user interface.
       Provides upload progress display.
       Provides error information.
       Allows users to preview the converted JSON data.
6. Technical requirements:
       A tool for generating single web pages. You can use HTML, CSS, and JavaScript, but there is only one file, HTML file.
       There are existing libraries that can handle file parsing and JSON conversion (e.g. csv-parser, pdfminer, docx2txt).

Other requirements:
   Consider performance optimization for handling large files.
   Please provide the complete code.

Special reminder :

  1. 1. Please do not use the above prompts directly. They are only provided as reference for ideas and methods.
  2. 2. The content format of each document may be different. You need to adjust the prompt words for different types of documents to generate more targeted tools.
  3. 3. If you are not sure how to write the prompt words, you can clearly state your requirements and let AI generate the prompt words.
  4. 4. Use the AI-generated tool to convert the JSON file, which can be opened with Notepad to view the content. If there are any problems, tell AI and let it continue to optimize the tool.

The tool I generated using the above prompt layer looks like this:

It supports adding multiple csv files, supports direct preview and viewing after conversion, and supports package download, which basically meets the requirements of prompt words.

2 Use AI to generate Python code to convert to JSON

The overall idea of ​​using Python is actually the same as above, but the implementation method is different. You only need to slightly modify the prompt word and you can use it.

Relatively speaking, the threshold for Python is a little higher, and Python needs to be installed locally first. Friends who find it troublesome or don’t understand it well don’t need to try this method.

Python is suitable for large-scale, automated processing, of course, the premise is that you have confirmed through repeated testing that the AI-generated tools can generate JSON files that meet your requirements.

How is the conversion effect?

Taking my own actual use case as an example, I previously collected more than 2,000 ancient Chinese jokes, which were originally stored in a database file.

When I first built the knowledge base, I used Python to export them into a txt file in the following format:

When searching in the knowledge base, I found that the fragments I searched would cut the complete story in the middle.

When I change the input data into JSON format and look at the search results, you will find that each fragment is a complete story, that is, a complete JSON object.

In addition to the story itself, other relevant data are also together.

When AI gets such data fragments, it not only knows the content of the story, but also the source, author, translation, number, etc.

Imagine if your knowledge base was filled with complete data fragments like this, then the quality of content generated by AI would definitely improve.

This is where the benefit of the JSON format comes in .

Of course, it is not easy to convert various types and formats of data into JSON, but this effect improvement is worth trying .