Business scenarios and technical analysis of early data processing in AI application scenarios - including structured and unstructured data

Master the key skills of artificial intelligence document processing and improve data processing capabilities.
Core content:
1. The importance and business scenarios of document processing in the field of artificial intelligence
2. Analysis of data structure and technical implementation of different types of documents
3. Difficulties and solutions for processing structured, semi-structured and unstructured data
" Document processing is an important part of artificial intelligence applications. Its business requirements are complex and the technical implementation is difficult. Therefore, how to process complex documents is an issue that every technician needs to consider. "
Document processing is a very basic and important task in the current artificial intelligence industry. Whether it is model training and fine-tuning, RAG retrieval enhancement, or in traditional search engines (including search engines such as Baidu and Google; as well as search needs of internal platforms such as e-commerce), it is an indispensable and important link.
However, faced with complex document formats and types, document processing is quite difficult; and in different demand scenarios, documents of the same format and content require different processing methods; for example, in open question-and-answer scenarios and precise enterprise service scenarios, the requirements and quality of document processing are different.
So, today we will discuss some problems in document processing as well as relative solutions and technology selection.
Document Processing
Regarding document processing, we need to consider two aspects: one is the business scenario and the other is technical implementation.
Business Scenario
Let’s talk about business scenarios first. There are many business scenarios for document processing. The RAG, model training, search engines, etc. mentioned above all require document processing. In vertical fields such as medical, financial, and legal fields that face rapidly exploding knowledge-intensive scenarios, document processing is still a very important part.
Especially when faced with a large amount of historical data accumulated over the years, how to quickly find key information from these historical data; in addition to the application of search algorithms and large models, the early processing of documents is an indispensable and very important link.
Because, for unprocessed data, no matter how powerful the algorithm or model is, it is difficult to obtain accurate query or retrieval results; therefore, the way documents are processed and the quality of the processing results will become important factors affecting data recall.
Technical Implementation
Technical implementation also needs to be explained and analyzed from multiple perspectives, mainly including two aspects: different document types and technical implementation.
Document Type
In the field of document processing, different document types and contents require different processing methods; of course, the documents mentioned here do not only refer to the common word, pdf, md and other documents, but also include database documents, web documents, pictures, tables and other document types.
In terms of form, there are many types of documents, including the common office suite, markdown documents, csv, database documents, web pages, log files, etc.
But from a technical point of view, no matter what form the document is in, it mainly comes in the following three formats:
Structured Data
Semi-structured data
Unstructured Data
Structured data mainly includes excel, csv, database documents, xml documents, log files, etc.
Semi-structured data mainly includes web pages, mongdb, emails, etc., which are partially structured data and partially unstructured data; for example, tags, h1, li in web pages; recipients and senders of emails, etc.
Unstructured data is a more common data type, such as word, pdf, ppt, md, txt, etc.; especially word, pdf, md, which are document formats that support rich text, tables and pictures. Due to their complex document structure, there are many difficulties in the processing process; such as project documents that contain common text, pictures and tables; and PDF documents filled with a large number of architecture diagrams, structure diagrams and flow charts.
In the application scenarios of artificial intelligence, we not only need to ensure the continuity of the content of these documents (such as the problem of table splitting), but also need to ensure their semantics and structure. For example, the current processing effect for documents with a large number of structure diagrams and flow charts is not very good.
Therefore, faced with these complex document types and contents, how to handle validity is a problem that many companies and fields need to think about and solve.
Technical Solution
According to the above complex document types, we need to use different data processing solutions; below we will explain the three types: structured, semi-structured and unstructured.
Structured Data
Among these three data types, structured data is the easiest to process; no matter it is a database, json data, xml data, etc.; because it has a standard format, we only need to process it according to its data format; and because of the development of computer technology in recent years, the data processing technology in this area has become very mature. For example, Python's pandas is very suitable for processing data in database, json and csv formats.
Semi-structured data
The processing of semi-structured data is relatively complicated, but not as complicated as unstructured data.
For example, taking web page documents as an example, web crawling is a very basic function in the crawler field; and to parse the web pages after crawling, you can use regular expressions or some third-party web page parsing libraries for processing; the effect is quite good.
Unstructured Data
In the field of document processing, unstructured data is a data format with the highest technical difficulty and the most complex processing methods.
The reason why unstructured data processing is complicated is that the document structure mentioned above is complex, and there may be multiple different data formats such as text, pictures, tables, flow charts, etc.; and faced with such a complex data format, coupled with the requirements for data processing quality in different application scenarios; therefore, there is no way to perform unified processing; therefore, we can only choose to perform certain abstractions in some links of the processing process.
For example, the text, pictures, and tables in the document are extracted separately, and then processed in a specific way; this generally uses a multimodal model for data extraction.
Secondly, for some pictures that contain a large amount of text descriptions, you can choose to use technologies such as ORC to extract the content from the pictures and then process it as text data.
Of course, you can also use the simplest method, which is to summarize the entire document through a multimodal model; then use the summarized summary as the result for application.