Kipi.ai / Insights / Blogs / Snowflake Document AI

Snowflake Document AI

Authored by – Rajiv Gupta

What is Document AI?

Purpose

Document AI is a feature within Snowflake that leverages the proprietary large language model (LLM) called Arctic-TILT.

Functionality

Data Extraction: Document AI extracts data from various document formats, including both text-heavy paragraphs and graphical content (e.g., logos, handwritten text, checkmarks).  

Zero-Shot Extraction: The foundation model can locate and extract information specific to a document type, even if it hasn’t seen that exact document before.

Fine-Tuning: Users can fine-tune the model for their specific use case by training it on relevant documents.

Privacy: The fine-tuned model and its training data remain private and are not shared with other Snowflake customers.

Unlock the power of Data

Document AI High-Level Layout

1.   Model Creation and UI:  

Document AI provides a user interface for creating model builds.

Model builds represent specific document types or use cases (e.g., invoice extraction).

2. Components of a Model Build:

Model: The AI model at the core, trained on various documents.

Data Values: Define which information to extract (fields, keywords, etc.).

Training Documents: Upload samples for model learning and evaluation.

3. Extracting Query:

Use the PREDICT method with a specific model build.

Extract relevant data from documents based on learned patterns.

4. Continuous Processing Pipelines:

Set up pipelines for ongoing document processing.

Integrate with streams and tasks for efficiency.

Remember, Document AI combines zero-shot extraction and fine-tuning, making it a powerful tool for automating document processing tasks.

Use Cases

1. Structured Data Extraction:

Scenario: When you need to transform unstructured data from documents into structured data stored in tables.

Benefit: Document AI helps convert messy, text-heavy content into organized, query-friendly formats.

2. Continuous Document Processing:

Scenario: When you want to create pipelines for ongoing processing of new documents of a specific type (e.g., invoices, contracts).

Benefit: Document AI streamlines repetitive tasks by automating document handling and extraction.

3. Collaboration Between Business Users and Data Engineers:

Scenario: Business users with domain expertise prepare the model (defining what to extract), while data engineers work with SQL to set up processing pipelines.

Benefit: This collaborative approach effectively utilizes both business knowledge and technical skills.

Cost

Broadly, Document AI incurs cost in the three categories below.

1. AI Services compute: Document AI facilitates information extraction from documents using the <model_build_name>!PREDICT method, which involves computational costs.

2. User Compute(Virtual Warehouse): To execute queries within worksheets (including utilizing the <model_build_name>!PREDICT method), you must choose a warehouse. Furthermore, Document AI may involve costs for other data retrieval operations within the worksheets.

3. Storage: To evaluate the Document AI model, you upload documents to the Document AI user interface in Snowsight. You can review the results and, if needed, fine-tune the model through additional training. Storing these results within a Snowflake class object in your account may result in storage costs. Additionally, uploading documents to either an internal or external stage may incur storage costs when extracting information using SQL.

Credit consumption for Document AI depends on the following:

  • Refers to the total number of pages in multipage document formats (e.g., PDFs).
  • Indicates the quantity of individual documents processed.
  • Reflects how densely packed the content is on each page (e.g., text-heavy vs. sparse).
  • Specifies the specific information (fields, keywords, entities) to extract from the documents.

SELECT * FROM SNOWFLAKE.ORGANIZATION_USAGE.METERING_DAILY_HISTORY

WHERE service_type ILIKE ‘%ai_services%’;

Limitations

1. Supports English language processing; results for other languages may be unsatisfactory.

2. Processes specific document formats and sizes.

The documents must be no more than 125 pages long.

The documents must be 50 MB or less in size.

Document pages must have dimensions of 1200 x 1200 mm or less.

The images must be between 50 x 50 and 10,000 x 10,000 pixels.

The documents must be in one of the following formats:

PDF, PNG, DOCX, EML, JPEG, JPG, HTML, TEXT, TXT, TIF, TIFF

3. Maximum of 20 documents per query.

4. No whole table extraction in a single query.

5. No privilege inheritance between roles.

6. No simultaneous model building by multiple users in Snowsight.

7. Supports AWS and Microsoft Azure commercial regions, except for specific AWS regions (Asia Pacific Singapore, Asia Pacific Osaka, and EU Paris).

Demo Use Cases

1. Resume Analyzer – This utility will scan through a candidate’s resume in a specified standard KIPI format and extract certain key pointers using Document AI. Later, the same data will be used to summarize the profile of the candidate, which will help the Tag team to identify the potential candidate, and that can save them from scanning thousands of pages. 

2. Invoice Scanner- This utility will scan through invoice images, PDF etc., and extract certain key pointers using Document AI. Later, the same data will be inserted into a table for further transformation. This will make the pipeline job simple and efficient. We can eliminate the heavy lifting by using any third-party tool/technology like Python to do this extraction. This makes the pipeline thinner, more transparent, and easier to troubleshoot.

Snowsight UI Sample

In the below architecture, documents are uploaded to a Snowflake stage. The Document AI processes these documents through an ingestion pipeline, extracting relevant information. The final analytics derived from this data are then presented on a dashboard.

June 28, 2024