50,000 fields in 3 hours! A practical guide to automatically generating data dictionaries with LLM

Master LLM technology, efficiently generate a 50,000-field data dictionary within 3 hours, and improve data management efficiency.
Core content:
1. The importance of data dictionary and the pain points of traditional writing
2. The core principles and process of LLM data dictionary generation
3. Detailed explanation of Python+SQLAlchemy practical steps
Believe me, you are not alone in this battle. Writing a data dictionary is a tiring and thankless job, but it is extremely important.
Last month, our team took over an "ancestral" database with 15 core tables and thousands of fields. Comments? Almost zero! It took several weeks just to sort out the meaning of the fields, and the project progress was in a state of emergency. This is not only a time cost, but also a potential risk of error and a compliance nightmare.
Today, this article will help you get rid of this dilemma and teach you how to use the large language model (LLM) to get the data dictionary in just 5 steps - the goal is to efficiently output in 3 hours, with expected accuracy, and compliance checks can also be relieved!
1. Why the “Handwritten Data Dictionary” Cannot Last Long
Manually maintaining a data dictionary is a nightmare for data teams. There are three main pain points:
- Too slow : Thousands of fields need to be checked and filled in one by one, which is time-consuming and laborious.
- Easy to miss : manual operation is prone to errors, and documents are often forgotten to be synchronized after fields are updated.
- Difficult to maintain : Business iterations are fast, data structures often change, and delayed document updates are the norm.
Imagine that, in the traditional way, for a medium-sized database (assuming 50,000 fields), even if each field only takes 1 minute (definition, type, business meaning), it will take nearly 833 man-hours ! With LLM, this time can be greatly reduced.
A data dictionary that is not updated in a timely manner is a stumbling block on the road to data-driven development.
2. Core Principle: Turning Metadata into Prompt
Sounds fancy? The principle is actually very simple: “Extract the metadata (information_schema) of the database and feed it to a large language model (such as GPT-4) to help you output field annotations, business meanings, and even label sensitive data.”
The whole process is like an intelligent assembly line:
The core of this process lies in high-quality metadata input and well-designed prompts. The more accurate the information fed into the model, the more reliable the draft of the dictionary produced. Remember, we don’t want AI to completely replace people, but to make it our efficient “document assistant”.
3. 5 practical steps: code + explanation
Talk is cheap, show me the code! Next, we will demonstrate how to implement it step by step using Python and SQLAlchemy.
Step 1: Extract Schema metadata
We need to get the table structure information from the database. SQLAlchemy is a powerful Python SQL toolkit that can help us do this easily.
Note: Swipe left to see the undisplayed part of the code
# Installation: pip install sqlalchemy psycopg2-binary (taking PostgreSQL as an example)
from sqlalchemy import create_engine, inspect
# Replace with your database connection string
DATABASE_URI = "postgresql://user:password@host:port/dbname"
engine = create_engine(DATABASE_URI)
inspector = inspect(engine)
schema_info = {}
table_names = inspector.get_table_names() # Get all table names
for table_name in table_names:
columns_data = []
columns = inspector.get_columns(table_name) # Get table column information
pk_constraint = inspector.get_pk_constraint(table_name) # Get the primary key
pk_columns = pk_constraint. get ( 'constrained_columns' , []) if pk_constraint else []
# Get table comments (the acquisition method may vary slightly depending on the database dialect)
table_comment = inspector.get_table_comment(table_name). get ( 'text' ) if inspector.get_table_comment(table_name) else ""
for col in columns:
columns_data.append({
"name" : col[ 'name' ],
"type" : str(col[ 'type' ]),
"nullable" : col[ 'nullable' ],
"default" : col. get ( 'default' ),
"comment" : col. get ( 'comment' , '' ), # Existing comments
"is_primary_key" : col[ 'name' ] in pk_columns
})
schema_info[table_name] = { "columns" : columns_data, "table_comment" : table_comment}
# schema_info now contains metadata for all tables
# print(schema_info['your_table_name'])
Key point : Make sure to extract the table name, field name, data type, nullability, default value, primary key information, and any existing table and field comments. These are important raw materials for subsequent prompts.
Without complete metadata, even the best cook cannot cook without rice.
Step 2: Build the Prompt template
The quality of prompt directly affects the output quality of LLM. We need to design different templates for generating table descriptions and column descriptions.
Table Description Prompt Template Example:
Database table name: {table_name}
Existing table comment: {existing_table_comment}
Contains the following fields:
{column_list_summary}
Based on the above information, please describe in a concise paragraph in Chinese the main business purpose and context of this form.
If there are already tables with clear annotations, please refer to them first.
Column Description Prompt Template Example:
Database table name: {table_name}
The main purpose of the table: {table _description_ from _llm_ or_manual}
Field name: {column_name}
Data type: {column_type}
Is it nullable: {is_nullable}
Default value: {default_value}
Is it a primary key: {is _primary_ key}
Existing field comment: {existing _column_ comment}
Based on the above information, please provide the following information (in Chinese, output in JSON format):
1. **business_meaning** : A clear, concise explanation of this field in the business context.
2. **data_characteristics** : Describes its typical format (such as date YYYY-MM-DD), potential constraints or value ranges (if it can be inferred from the name, type, or existing annotations).
3. **example_values** : Provide 1-2 real and representative example values (if they can be reasonably inferred).
4. **sensitivity_analysis** : Determine whether this field may contain sensitive data (such as PII, financial data). If so, please explain the reason and give a sensitivity level suggestion (such as: high, medium, low).
Please pay special attention to the existing field comments. If the field name or table purpose is clear enough, please make sure to explain it accurately.
Core elements : clear instructions, sufficient context (table name, field name, type, existing comments, even table description generated by LLM), expected output format (JSON for easy parsing).
A good prompt is the beginning of an efficient conversation with your LLM.
Step 3: Call GPT-4 (or other LLM)
Here we take OpenAI's Python SDK as an example. You need to install openai
library and set your API Key.
#Installation : pip install openai
import openai
import json # used to parse JSON output
# openai.api_key = "YOUR_OPENAI_API_KEY" # It is recommended to use environment variables
def generate_with_gpt4(prompt, is_json_output=False):
try:
response = openai.chat.completions.create(
model="gpt-4o", # or "gpt-4-turbo", "gpt-3.5-turbo"
messages=[
{"role": "system", "content": "You are a senior database architect, proficient in compiling data dictionaries and interpreting business meanings."},
{"role": "user", "content": prompt}
],
temperature=0.2, # Low temperature ensures more stable and consistent output
max_tokens=800 # Adjust as needed
)
content = response.choices[0].message.content.strip()
if is_json_output:
# Try to remove Markdown code block markers (if present)
if content.startswith("```json"):
content = content[7:]
if content.endswith("```"):
content = content[:-3]
return json.loads(content)
return content
except Exception as e:
print(f"Error calling LLM: {e}")
if is_json_output:
return {"error": str(e)}
return f"Generation failed: {e}"
# --- Arrangement logic (pseudo code) ---
# data_dictionary = {}
# for table_name, table_data in schema_info.items():
# # 1. Generate table description
# column_list_for_table_prompt = "\n" .join([f "- {c['name']} ({c['type']})" for c in table_data[ 'columns' ][:10]]) # Example: Use only the first 10 fields
# table_prompt_text = TABLE_DESCRIPTION_PROMPT_TEMPLATE.format(
# table_name=table_name,
# existing_table_comment=table_data[ 'table_comment' ],
# column_list_summary=column_list_for_table_prompt
# )
# generated_table_description = generate_with_gpt4(table_prompt_text)
# data_dictionary[table_name] = { "table_description" : generated_table_description, "columns" : []}
# # 2. Traverse the columns and generate column descriptions
# for col_info in table_data[ 'columns' ]:
# column_prompt_text = COLUMN_DESCRIPTION_PROMPT_TEMPLATE.format(
# table_name=table_name,
# table_description_from_llm_or_manual=generated_table_description, # Use the generated table description as context
# column_name=col_info[ 'name' ],
# column_type=col_info[ 'type' ],
# is_nullable=col_info[ 'nullable' ],
# default_value=col_info[ 'default' ],
# is_primary_key=col_info[ 'is_primary_key' ],
# existing_column_comment=col_info[ 'comment' ]
# )
# generated_column_details = generate_with_gpt4(column_prompt_text, is_json_output=True)
# # Merge the original metadata with the information generated by LLM
# final_col_data = {**col_info, **generated_column_details}
# data_dictionary[table_name][ "columns" ].append(final_col_data)
# print (json.dumps(data_dictionary, indent=2, ensure_ascii=False))
Notice :
- Model selection : GPT-4o/GPT-4 Turbo has better results, but the cost is also high. You can choose GPT-3.5 Turbo or other LLMs according to your budget and needs.
- Temperature : A lower temperature (such as 0.1-0.3) can make the output more stable and more in line with the instructions.
- Error handling and retry : In actual applications, more robust error handling and API call retry logic need to be added.
- Token limits : Pay attention to the number of input and output tokens to avoid exceeding the model limit. For particularly wide tables, you can process columns in batches.
Choosing the right model and parameters is the key to balancing cost and effect.
Step 4: Result Cleaning & Manual Review Checklist
LLM generates a rough draft, and manual review and calibration are essential to ensure accuracy and business relevance.
Checklist for reference:
- Business meaning accuracy : Does the generated description accurately reflect the true purpose of the field in a specific business scenario?
- Data feature integrity : Are the format, constraints, and value ranges reasonable? Are there any omissions?
- Reasonableness of example value : Does the example value conform to the data type and business logic?
- Sensitive data judgment : Is the sensitivity level classification accurate? Are there any missed judgments or misjudgments?
- Consistency : Is the terminology used and descriptive style consistent throughout the dictionary?
- Hallucination content : Is there any LLM "serious nonsense" content? (For example, making up complex business logic for a simple status code)
- Adoption of existing annotations : Did the LLM appropriately reference existing annotations in the database?
Tool assistance : The JSON results generated by LLM can be imported into Excel or professional data governance tools to facilitate manual batch review and modification.
AI is not a silver bullet; human wisdom is the last line of defense for quality.
Step 5: Export to Markdown/Excel and connect to Confluence
The audited data dictionary needs to be output in a format that is easy to consult and maintain.
- Markdown : Clearly structured, suitable for directly pasting into Confluence, Wiki or Git repository's README.
- Excel/CSV : Convenient for non-technical personnel to review and further analyze.
Python output Markdown example (simplified version):
def generate_markdown_output (data_dictionary) :
markdown_string = ""
for table_name, table_data in data_dictionary.items():
markdown_string += f"## Table name: {table_name} \n\n"
markdown_string += f"**Table description**: {table_data.get( 'table_description' , 'N/A' )} \n\n"
markdown_string += "| Field name| Data type| Whether nullable| Primary key| Default value| Existing annotations| Business meaning (LLM) | Data characteristics (LLM) | Example value (LLM) | Sensitivity analysis (LLM) |\n"
markdown_string += "|---|---|---|---|---|---|---|---|---|---|\n"
for col in table_data.get( 'columns' , []):
markdown_string += f" | {col.get( 'name' , '' )} | {col.get( 'type' , '' )} | {col.get( 'nullable' , '' )} | {col.get( 'is_primary_key' , '' )} | {col.get( 'default' , '' ) if col.get( 'default' ) is not None else '' } | {col.get( 'comment' , '' )} | {col.get( 'business_meaning' , '' )} | {col.get( 'data_characteristics' , '' )} | {str(col.get( 'example_values' , '' ))} | {str(col.get( 'sensitivity_analysis' , '' ))} |\n"
markdown_string += "\n"
return markdown_string
# md_output = generate_markdown_output(data_dictionary_after_review)
# with open("data_dictionary.md", "w", encoding="utf-8") as f:
# f.write(md_output)
Continuous update : Once the process is established, you can set up a cron job to run the script regularly and push the output Markdown/Excel to Confluence via the API or automatically submit it to the Git repository to achieve continuous integration and update of the data dictionary.
Only by making documents "come alive" can they realize their greatest value.
4. Complete Case: E-commerce Database Customers & Orders
Take a simplified e-commerce database as an example, including Customers
(customer table) and Orders
(Order Form).
❌ Handwriting is inefficient (assuming original state):
Customers table (partial fields)
✅ After GPT generation (after manual fine-tuning):
Customers table (partial fields)
By comparison, it can be clearly seen that LLM fills in a lot of information and conducts a preliminary sensitivity analysis, which greatly reduces the manual workload.
5. Notes & Common Pitfalls
When enjoying the convenience brought by LLM, you should also pay attention to the following points:
- Cost control (Token estimation) : Calling the LLM API (especially GPT-4) costs money. Estimate token consumption: input prompt length + output content length. For large databases, the cost may not be low. You can test with a small table first, or use a more economical model (such as GPT-3.5-turbo) to process non-core fields.
- Data security and privacy (metadata only) : Our approach is to only pass metadata such as table structure, field name, data type, etc., which usually do not contain specific user data. If the field name itself contains sensitive words, risk assessment or desensitization is also required. Priority is given to LLM that supports local deployment or enterprise-level services with data privacy protection.
- Tip word tuning :
- Few-shot Learning: Giving a few high-quality examples in the prompt can significantly improve the format and content quality of LLM output.
- Temperature: Lower the temperature (e.g. 0.1-0.3) to make the output more stable and focused.
- Iterative optimization: Prompt is constantly adjusted according to the actual output. This is a continuous optimization process.
- Regular updates and version control (CRON + Git) : The database structure will change, and the data dictionary also needs to be updated synchronously. It is recommended to include the script in a scheduled task (such as CRON) and execute it regularly. The generated documents use Git for version control to facilitate tracking of change history.
- Strategies for processing large databases : For very large databases containing thousands of tables and hundreds of thousands of fields, it may not be realistic to process everything at once (API timeout, high cost, context window limitation). You can adopt a strategy of batch processing (for example, by schema, by business domain, or even by table) and incremental updates.
- LLM "illusion" : LLM can sometimes "talk nonsense". For ambiguous field names or lack of context, it may make up inaccurate business meaning. This is why human review is crucial.
There is no silver bullet, only continuously optimized engineering practices.
6. Summary & Next Steps
The use of large language models to automatically generate data dictionaries has undoubtedly brought revolutionary efficiency improvements to the data team. It frees us from tedious manual labor and allows us to focus more on understanding the business value behind the data.
Although LLM cannot completely replace manual work at present, the quality of the first draft it generates is already quite impressive, especially when dealing with a large number of fields, it can save more than 90% of time and effort.
Next, you can:
- Try it yourself : Fork a sample script, connect to your test database, and run it to see the effect!
- Optimize Prompt : Customize a more accurate Prompt template based on your business characteristics and database specifications.
- Integrate into workflow : Integrate automation scripts into your data governance process to achieve continuous maintenance of the data dictionary.
“Is it a waste of time for data engineers to write documents? — No, only those who spend time writing scripts and letting machines write documents are engineers.”
I hope this practical guide can open a new door for you.