Unlocking the Power of Snowflake Cortex: Document AI

Managing and extracting actionable insights from documents can be a time-consuming and error-prone task, especially when it comes to complex technical documents. These documents are often filled with critical information that needs to be efficiently analyzed and organized for operational decisions. That’s where Snowflake’s Document AI comes in.

Snowflake’s advanced AI-powered platform is revolutionizing the way we process and extract meaningful data from documents. By leveraging the power of AI and machine learning, Document AI enables businesses to automate the extraction of key details directly from unstructured documents. In this blog post, we’ll explore how this cutting-edge technology simplifies data extraction, enhances accuracy, and streamlines workflows in industries where precise, real-time information is essential for maintaining operational efficiency.

Use case

You work as a data engineer for SkiGear Co., a company that specializes in manufacturing ski equipment. One day, you get approached by your team lead with the following problem. Management would like to have a better insight in the health of their manufacturing process, more specifically how often their machines require maintenance. SkiGear Co. already has sensor data from several points in their manufacturing process that is being streamed to their Snowflake instance in near real-time, however this only captures whether a machine is broken or not. The company also carries out regular machine inspections to check if a machine needs to be repaired before it breaks down and slows the production process. Unfortunately, this is done manually by in-house inspectors and written down on paper. This manual process is exactly the data that is missing in front end reporting and is what management would like to incorporate in their decision making process.

You set out to tackle this challenge with the provided information. You know that SkiGear Co. mainly uses Snowflake as their data platform, so a reasonable thing to do, is to check if there are any possibilities within this existing environment. Fortunately, Snowflake has its very own document processing application, called Document AI. From what you can gather with a quick search through the Snowflake documentation, is that you can basically build a model to automatically extract predefined values from documents and load them into your data warehouse in a tabular format. Interested in this technology, you set out to create a Document AI model for this use case.

While both approaches ultimately result in a table at the restaurant, the processes differ significantly.

Gathering the data

In order to create a Document AI model, we first need some training data to train the model on. This means that we would need already filled in inspection documents, which can be more difficult to obtain than readily available data in a data warehouse / data lake. You asked around for some inspection review documents and fortunately Emily, one of the machine inspectors, was happy to provide some of her old inspection documents which she doesn’t need anymore. She apologized as she could only provide you with 10 documents, which doesn’t seem like a good amount of training data to build a reliable and accurate model at first. Fortunately, Snowflake’s Document AI uses few-shot learning to train its models, which means that 10 documents would suffice for building a reliable model.

Example of inspection document

Building the model

Now it’s time to build a Document AI model. We start by creating a blank model by navigating to “AI & ML” → “Document AI” → “+ Build” in the Snowflake web interface. We name our build and store it in a relevant database and schema.

Create a model in Document AI
Pop up to name and store model

After successfully creating the model, we can start training the model. First, we have to upload the training data set, which are the 10 inspection documents kindly provided by Emily. Then, we can extract the desired values from each document by building an extraction framework. Don’t worry, this can all be done without programming a single line of code. Snowflake Document AI provides you with a simple interface where you can extract values from each inspection document PDF via prompting. The PDF for which we will extract values, is shown on the left-hand side while the extracted values are shown on the right-hand side of the interface.

Document AI page to define values to be extracted

The values that we would like to extract from the inspection documents to enhance our already existing data, are defined as follows:

VALUEDESCRIPTIONPROMPT
MACHINEThe inspected machine.Which machine was inspected?
SERIAL_NUMBERSerial number of the inspected machine.What is the serial number?
INSPECTION_GRADEWhether the machine passed or failed the inspection.What is the inspection grade?
INSPECTORPerson who performed the inspection.Who performed the inspection?
INSPECTION_DATEDate of the inspection.What is the inspection date?
LIST_OF_UNTISAll units of the machine that were checked during the inspection.What are all the units?
DEFECTIVEAll units that were found to be defective during the inspection.Which unit is defective?

After entering these values with their corresponding questions, Document AI automatically starts to extract answers to these questions from the document.

List of defined values (part 1)
List of defined values (part 2)

It looks like all the extracted values match with the information in the inspection document, so we click on the blue “Accept all and review next” button to accept all the answers for this document. Now, we have to repeat this step for all other 9 training documents. Fortunately, we don’t have to enter all the values and questions again for each document. Document AI remembers this information and automatically applies it to all other documents as well.

When we’re done reviewing all the training documents, we publish the model. Now, we can use our model to automatically extract data from other inspection documents.

Note

It’s possible that Document AI doesn’t immediately recognize the correct value and returns an incorrect answer to your question. In that case a first step to remediate this issue, is to rephrase your question.

E.g. “What unit is defective?” (singular) vs “What units are defective?” (plural). In the plural version of the question, Document AI always expects there to be multiple units that are defective and also tries to returns a list of answers. In our example, that’s not always the case. It’s entirely possible that no units are defective at all or that even only one unit is defective. Using the singular version of the question avoids this problem.

Extract initial values

Emily already gave us some new inspection reports, which we uploaded to our Snowflake instance using an internal stage. Now, we can use the power of our model to automatically extract data from these new inspection reports. Using the PREDICT() function on the model while referring to the stage in the SQL statement, Snowflake will automatically extract data from the newly uploaded files based on the model that we trained and published. The PREDICT() function returns a JSON with the defined value and extracted value as key-value pairs.

Initial JSON data resulting from using the model in ne documents

After adding some extra steps to our SQL code, we can represent the JSON data in a traditional tabular format.

JSON data flattened into a table

This data is now ready to enhance our front end reporting. We could also automate this by using a stream on the internal stage and setting up a task to automatically extract data from the new inspection reports present in the stream.

Example

Use the following code snippet to extract previously defined values from new documents:

— Names between <> are placeholder names and should be changed in a real application
— Single document
SELECT <database_name>.<schema_name>.<build_name>!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, ‘<relative_file_path>’), <build_version>);

— Multiple documents
SELECT <database_name>.<schema_name>.<build_name>!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), <build_version>)
FROM DIRECTORY(@<stage_name>);

This code snippet returns JSON data with all the extracted values as key-value pairs.

Note

Retrain the model if the extracted values are incorrect. If even after retraining the model the extracted values are still incorrect, consider rephrasing the question for the extracted values and republishing the model.

Conclusion

n conclusion, Snowflake Document AI stands out as a powerful tool for organizations seeking to automate and streamline their document processing workflows. By leveraging Snowflake’s robust data platform and machine learning capabilities, it enables businesses to extract valuable insights from unstructured data within documents, such as PDFs and images. The integration of natural language processing (NLP) and optical character recognition (OCR) technologies allows for accurate data extraction and transformation, significantly reducing manual effort and errors.

Before we conclude this blog post, we would like to go over some pros and cons that we noticed during the setup on Snowflake:

plus
  • Accurate results
  • Easy to understand interface
  • Simple question-answer-based extraction
  • Integrated within the Snowflake Environment. De data is immediately available in any data pipeline without any extra setup
sign-2
  • Extracting answers needs some wrangling. Sometimes Document AI doesn’t immediately provide the correct answer and the question needs to be rephrased or the model needs to be retrained multiple times in order to have a better chance to extract the right answer.
  • Can’t define an area on the document where the AI needs to find an answer for a specific question. It scans the entire document for each question, increasing the risk of extracting a wrong answer. Adding this feature could help with extracting correct answers from standardized documents.
// Related articles

Read our other posts

// CONTACT US

Want to know more?