Managing and extracting actionable insights from documents can be a time-consuming and error-prone task, especially when it comes to complex technical documents. These documents are often filled with critical information that needs to be efficiently analyzed and organized for operational decisions. That’s where Snowflake’s Document AI comes in.

Snowflake’s advanced AI-powered platform is revolutionizing the way we process and extract meaningful data from documents. By leveraging the power of AI and machine learning, Document AI enables businesses to automate the extraction of key details directly from unstructured documents. In this blog post, we’ll explore how this cutting-edge technology simplifies data extraction, enhances accuracy, and streamlines workflows in industries where precise, real-time information is essential for maintaining operational efficiency.

Use case

You work as a data engineer for SkiGear Co., a company that specializes in manufacturing ski equipment. One day, you get approached by your team lead with the following problem. Management would like to have a better insight in the health of their manufacturing process, more specifically how often their machines require maintenance. SkiGear Co. already has sensor data from several points in their manufacturing process that is being streamed to their Snowflake instance in near real-time, however this only captures whether a machine is broken or not. The company also carries out regular machine inspections to check if a machine needs to be repaired before it breaks down and slows the production process. Unfortunately, this is done manually by in-house inspectors and written down on paper. This manual process is exactly the data that is missing in front end reporting and is what management would like to incorporate in their decision making process.

You set out to tackle this challenge with the provided information. You know that SkiGear Co. mainly uses Snowflake as their data platform, so a reasonable thing to do, is to check if there are any possibilities within this existing environment. Fortunately, Snowflake has its very own document processing application, called Document AI. From what you can gather with a quick search through the Snowflake documentation, is that you can basically build a model to automatically extract predefined values from documents and load them into your data warehouse in a tabular format. Interested in this technology, you set out to create a Document AI model for this use case.

While both approaches ultimately result in a table at the restaurant, the processes differ significantly.

Gathering the data

In order to create a Document AI model, we first need some training data to train the model on. This means that we would need already filled in inspection documents, which can be more difficult to obtain than readily available data in a data warehouse / data lake. You asked around for some inspection review documents and fortunately Emily, one of the machine inspectors, was happy to provide some of her old inspection documents which she doesn’t need anymore. She apologized as she could only provide you with 10 documents, which doesn’t seem like a good amount of training data to build a reliable and accurate model at first. Fortunately, Snowflake’s Document AI uses few-shot learning to train its models, which means that 10 documents would suffice for building a reliable model.

Building the model

Now it’s time to build a Document AI model. We start by creating a blank model by navigating to “AI & ML” → “Document AI” → “+ Build” in the Snowflake web interface. We name our build and store it in a relevant database and schema.

After successfully creating the model, we can start training the model. First, we have to upload the training data set, which are the 10 inspection documents kindly provided by Emily. Then, we can extract the desired values from each document by building an extraction framework. Don’t worry, this can all be done without programming a single line of code. Snowflake Document AI provides you with a simple interface where you can extract values from each inspection document PDF via prompting. The PDF for which we will extract values, is shown on the left-hand side while the extracted values are shown on the right-hand side of the interface.

The values that we would like to extract from the inspection documents to enhance our already existing data, are defined as follows:

VALUE	DESCRIPTION	PROMPT
MACHINE	The inspected machine.	Which machine was inspected?
SERIAL_NUMBER	Serial number of the inspected machine.	What is the serial number?
INSPECTION_GRADE	Whether the machine passed or failed the inspection.	What is the inspection grade?
INSPECTOR	Person who performed the inspection.	Who performed the inspection?
INSPECTION_DATE	Date of the inspection.	What is the inspection date?
LIST_OF_UNTIS	All units of the machine that were checked during the inspection.	What are all the units?
DEFECTIVE	All units that were found to be defective during the inspection.	Which unit is defective?

After entering these values with their corresponding questions, Document AI automatically starts to extract answers to these questions from the document.

It looks like all the extracted values match with the information in the inspection document, so we click on the blue “Accept all and review next” button to accept all the answers for this document. Now, we have to repeat this step for all other 9 training documents. Fortunately, we don’t have to enter all the values and questions again for each document. Document AI remembers this information and automatically applies it to all other documents as well.

When we’re done reviewing all the training documents, we publish the model. Now, we can use our model to automatically extract data from other inspection documents.

Note

It’s possible that Document AI doesn’t immediately recognize the correct value and returns an incorrect answer to your question. In that case a first step to remediate this issue, is to rephrase your question.

E.g. “What unit is defective?” (singular) vs “What units are defective?” (plural). In the plural version of the question, Document AI always expects there to be multiple units that are defective and also tries to returns a list of answers. In our example, that’s not always the case. It’s entirely possible that no units are defective at all or that even only one unit is defective. Using the singular version of the question avoids this problem.

Extract initial values

Emily already gave us some new inspection reports, which we uploaded to our Snowflake instance using an internal stage. Now, we can use the power of our model to automatically extract data from these new inspection reports. Using the PREDICT() function on the model while referring to the stage in the SQL statement, Snowflake will automatically extract data from the newly uploaded files based on the model that we trained and published. The PREDICT() function returns a JSON with the defined value and extracted value as key-value pairs.

After adding some extra steps to our SQL code, we can represent the JSON data in a traditional tabular format.

This data is now ready to enhance our front end reporting. We could also automate this by using a stream on the internal stage and setting up a task to automatically extract data from the new inspection reports present in the stream.

Example

Use the following code snippet to extract previously defined values from new documents:

— Names between <> are placeholder names and should be changed in a real application
— Single document
SELECT <database_name>.<schema_name>.<build_name>!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, ‘<relative_file_path>’), <build_version>);

— Multiple documents
SELECT <database_name>.<schema_name>.<build_name>!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), <build_version>)
FROM DIRECTORY(@<stage_name>);

This code snippet returns JSON data with all the extracted values as key-value pairs.

Note

Retrain the model if the extracted values are incorrect. If even after retraining the model the extracted values are still incorrect, consider rephrasing the question for the extracted values and republishing the model.

Conclusion

n conclusion, Snowflake Document AI stands out as a powerful tool for organizations seeking to automate and streamline their document processing workflows. By leveraging Snowflake’s robust data platform and machine learning capabilities, it enables businesses to extract valuable insights from unstructured data within documents, such as PDFs and images. The integration of natural language processing (NLP) and optical character recognition (OCR) technologies allows for accurate data extraction and transformation, significantly reducing manual effort and errors.

Before we conclude this blog post, we would like to go over some pros and cons that we noticed during the setup on Snowflake:

Accurate results
Easy to understand interface
Simple question-answer-based extraction
Integrated within the Snowflake Environment. De data is immediately available in any data pipeline without any extra setup

Extracting answers needs some wrangling. Sometimes Document AI doesn’t immediately provide the correct answer and the question needs to be rephrased or the model needs to be retrained multiple times in order to have a better chance to extract the right answer.
Can’t define an area on the document where the AI needs to find an answer for a specific question. It scans the entire document for each question, increasing the risk of extracting a wrong answer. Adding this feature could help with extracting correct answers from standardized documents.

Cloud, data analytics, Snowflake

// Related articles

Read our other posts

// July 29, 2025

No More Context Switching: Developing dbt Inside Snowsight

// January 31, 2025

Leveraging dbt with Snowflake Iceberg tables: A modern approach to Data Modeling

// December 12, 2024

Streamline Your Cloud Operations with Infrastructure-as-Code: A Guide to Terraform

// CONTACT US

Unlocking the Power of Snowflake Cortex: Document AI

Use case

Gathering the data

Building the model

Note

Extract initial values

Example

Note

Conclusion

Read our other posts

No More Context Switching: Developing dbt Inside Snowsight

Leveraging dbt with Snowflake Iceberg tables: A modern approach to Data Modeling

Streamline Your Cloud Operations with Infrastructure-as-Code: A Guide to Terraform

Want to know more?

Belgium

The Netherlands

Unlocking the Power of Snowflake Cortex: Document AI

Use case

Gathering the data

Building the model

Note

Extract initial values

Example

Note

Conclusion

Browse categories

Browse tags

Read our other posts

No More Context Switching: Developing dbt Inside Snowsight

Leveraging dbt with Snowflake Iceberg tables: A modern approach to Data Modeling

Streamline Your Cloud Operations with Infrastructure-as-Code: A Guide to Terraform

Want to know more?

Belgium

The Netherlands