Authored by – Anusree K J and Vijaysai Turai
Introduction
In today’s data-driven world, where information reigns supreme, businesses are constantly seeking innovative solutions to efficiently manage and analyze their data . Snowflake, a leading cloud data platform, has been at the forefront of this revolution, consistently pushing the boundaries of what’s possible in data management. Their latest offering, Snowflake Cortex, represents a significant leap forward in data processing capabilities, promising to revolutionize how organizations harness the power of their data.
Understanding Snowflake Cortex
Snowflake Cortex is an advanced data processing engine built on top of Snowflake’s existing cloud data platform. It integrates seamlessly with Snowflake’s data warehouse and data lake capabilities, providing users a unified platform for storing, processing, and analyzing their data at scale. At its core, Cortex is designed to address modern data workloads’ growing complexity and diversity, offering unparalleled performance, scalability, and flexibility. By seamlessly integrating ML and AI capabilities into the Snowflake platform, Cortex enables users to leverage advanced analytics and gain deeper insights from their data.
Key Features and Capabilities
- AI-Powered Data Processing: Snowflake Cortex leverages AI and ML to optimize queries, identify data usage patterns, and predict future resource needs, enhancing data processing efficiency.
- Massively Parallel Processing (MPP): By distributing tasks across multiple cloud nodes, Cortex accelerates query performance and enables real-time analytics with high availability, fault tolerance, and scalability.
- Native Support for Semi-Structured Data: Cortex supports formats like JSON and Parquet, allowing users to ingest, query, and analyze various data types without complex transformations.
- Multi-Cloud Compatibility: Cortex operates across AWS, Azure, and Google Cloud, offering flexibility in deployment and seamless workload migration to avoid vendor lock-in.
- Advanced Security and Governance: With robust encryption, access controls, and auditing, Cortex ensures data security and compliance with regulatory standards, integrating smoothly with existing identity management systems.
Large Language Model (LLM) Functions
Snowflake Cortex gives you instant access to industry-leading large language models (LLMs) developed by researchers at companies like Mistral, Reka, Meta, and Google, including Snowflake Arctic, an open enterprise-grade model developed by Snowflake.
Since Snowflake fully hosts and manages these LLMs, using them requires no setup. Your data stays within Snowflake, giving you the performance, scalability, and governance you expect. Snowflake Cortex features are accessible as SQL functions and are also available in Python. The available functions are summarized below:
- COMPLETE: Generate responses to prompts or conversations.
- EXTRACT_ANSWER: Extract relevant information from unstructured data.
- EMBED_TEXT_768: Create vector embeddings of text for semantic comparison.
- SUMMARIZE: Automatically generate summaries of text.
- SENTIMENT: Analyze the mood or tone of text.
- TRANSLATE: Translate text documents to different languages.
Additionally, Snowflake Cortex offers text embedding and vector comparison features for semantic analysis, which are currently in private preview with select customers. These capabilities provide users with instant access to state-of-the-art LLMs without the need for complex setup, ensuring data security within the Snowflake environment.
LLM Function: COMPLETE
In data processing and sentiment analysis, Snowflake Cortex’s COMPLETE function is a standout tool for automating sentiment assessment in text data. This function generates responses using a chosen language model based on a given prompt. It supports simple use cases with single-string prompts and more complex interactive chat scenarios with multiple prompts and responses. Additionally, users can customize the output by specifying hyperparameter options for style and size.
SYNTAX
SNOWFLAKE.CORTEX.COMPLETE( <model>, <prompt_or_history> [ , <options> ] )
<model>: The COMPLETE function supports the following models.
snowflake-arctic mistral-large reka-flash Reka-core mixtral-8x7b | llama2-70b-chat llama3-8b llama3-70b mistral-7b Gemma-7b |
<prompt_or_history> : Your response to the prompt. It can be a string or a table.
Eg: Here review is the table
SELECT SNOWFLAKE.CORTEX.COMPLETE(
‘mistral-large’,
CONCAT(‘Critique this review in bullet points: <review>’, content, ‘</review>’)
) FROM reviews LIMIT 10;
<options>: Argument to control the inference hyperparameters in a single response. Specifying the options argument, even if it is an empty object ({}), affects how the prompt argument is interpreted and how the response is formatted.
Example: Here the hyperparameters are temperature and max_tokens
SELECT SNOWFLAKE.CORTEX.COMPLETE(
‘llama2-70b-chat’,
[
{
‘role’: ‘user’,
‘content’: ‘how does a snowflake get its unique pattern?’
}
],
{
‘temperature’: 0.7,
‘max_tokens’: 10
}
);
temperature: A value from 0 to 1 (inclusive) that controls the randomness of the output of the language model.
max_tokens: Sets the maximum number of output tokens in the response. Small values can result in truncated responses.
Example
Let’s explore a technical application of the COMPLETE function in real-world scenarios, specifically for analyzing sentiment in financial data.
Problem: Determine the sentiment of customer responses in the FINANCIAL_SENTIMENT_DATA table.
Solution: Using the llama2-70b-chat model and customer responses from the table, the COMPLETE function predicts whether each response is positive, negative, or neutral. The prompt includes the string and the sentence column from the FINANCIAL_SENTIMENT_DATA table, specifying that the response should be a single word. The output for each sentence is then added to a new column, COMPLETE_OUTPUT, in the same table.
OUTPUT:
The following output shows the CORTEX_RESULT column, which represents the output generated by the COMPLETE function model. This table demonstrates how the COMPLETE function assesses sentiment in customer responses, categorizing each as positive, negative, or neutral.
In conclusion, by utilizing Snowflake Cortex’s COMPLETE function and Snowflake’s robust data processing capabilities, organizations can automate sentiment analysis tasks and extract valuable insights from textual data efficiently.
LLM Function: SENTIMENT
SYNTAX
SNOWFLAKE.CORTEX.SENTIMENT(<text>)
<text>: A string containing the text for which the sentiment score is to be determined.
RETURNS:
A floating-point number ranging from -1 to 1 (inclusive) indicates the text’s sentiment level. Negative values represent negative sentiment, positive values indicate positive sentiment and values near 0 suggest neutral sentiment.
Example:
In this example, a table named FINANCIAL_SENTIMENT_DATA_OUTPUT contains a column named SENTIMENT with user text reviews. The query returns a sentiment score for each review.
SELECT SNOWFLAKE.CORTEX.SENTIMENT(SENTIMENT) AS SCORE, SENTIMENT
FROM FINANCIAL_SENTIMENT_DATA_OUTPUT
LIMIT 10;
Output:
The sentiment score of each review is added in the SCORE column.
LLM Function: SUMMARIZE
The SUMMARIZE function condenses a given English-language text into a concise summary.
SYNTAX
SNOWFLAKE.CORTEX.SUMMARIZE(<text>)
- <text>: A string containing the English text to be summarized.
Returns:
A string with the summary of the original text.
Example:
SELECT SENTENCE,
SNOWFLAKE.CORTEX.SUMMARIZE(SENTENCE) as SUMMARY
FROM CORTEX_DB.SENTIMENT_TEST_DATASET.FINANCIAL_SENTIMENT_DATA_OUTPUT
limit 100;
Output:
Consider one sentence from the above output:
SENTENCE: The percentages of shares and voting rights have been calculated in proportion to the total number of shares registered with the Trade Register and the total number of voting rights related to them.
SUMMARY: Shares and voting rights have been proportionally calculated based on the total registered shares and related voting right.
LLM Function: EXTRACT_ANSWER
The EXTRACT_ANSWER function retrieves an answer to a specific question from a text document.
SYNTAX
SNOWFLAKE.CORTEX.EXTRACT_ANSWER(
<source_document>, <question>)
- <source_document>: A string containing the plain-text or JSON document with the answer.
- <question>: A string containing the question.
Returns:
A string with the answer to the given question.
Example
In this example, SENTENCE is a column from the FINANCIAL_SENTIMENT_DATA_OUTPUT table:. To extract an answer from each row of the table:
SELECT SENTENCE,
SNOWFLAKE.CORTEX.EXTRACT_ANSWER(SENTENCE, ‘What dishes does this review mention?’) as ANSWER
FROM CORTEX_DB.SENTIMENT_TEST_DATASET.FINANCIAL_SENTIMENT_DATA_OUTPUT
LIMIT 10;
LLM Function: EMBED_TEXT_768 (preview feature)
The EMBED_TEXT_768 function generates a vector embedding from English-language text.
SYNTAX
SNOWFLAKE.CORTEX.EMBED_TEXT_768( <model>, <text> )
- <model>: A string specifying the embedding model to use (e.g., ‘snowflake-arctic-embed-m’, ‘e5-base-v2’).
- <text>: The text to be embedded.
Returns:
A vector embedding of type VECTOR.
Example
In this example, a vector embedding is generated for the phrase hello world using the snowflake-arctic-embed-m model:SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_768(‘snowflake-arctic-embed-m’, ‘hello world’);
LLM Function: TRANSLATE
The TRANSLATE function translates text from one supported language to another
Syntax
SNOWFLAKE.CORTEX.TRANSLATE(
<text>, <source_language>, <target_language>)
- <text>: A string containing the text to be translated.
- <source_language>: A string specifying the current language code of the text.
- <target_language>: A string specifying the language code to translate the text into.
Returns:
A string containing the translated text.
Example
The following example translates each row of a table from English to German (in this example, review_content is a column from the reviews table):
SELECT SNOWFLAKE.CORTEX.TRANSLATE(review_content, ‘en’, ‘de’)FROM reviews
LIMIT 10;
Machine Learning (ML) Functions
Snowflake Cortex’s ML Functions empower analysts, data engineers, and business users to perform predictive analysis and gain valuable insights from their structured data. These functions, accessible through SQL, include:
- Forecasting: Predict future values based on historical data.
- Anomaly Detection: Identify potential outliers in datasets.
- Contribution Explorer: Determine factors influencing a value of interest.
Additionally, Snowflake Cortex offers advanced ML features, such as classification functions and a low-code web interface for ML functions within Snowsight, which is currently in private preview with select customers.
Example:
Pros:
Low-Code Implementation: Provides a simplified approach for anomaly detection and forecasting, requiring minimal coding effort.
Simple SQL Commands: Utilizes straightforward SQL commands for model training, making it accessible to a broader audience.
Minimal Data Preprocessing: Eliminates the need for complex data preprocessing steps such as normalization or one-hot encoding (though handling missing values is still necessary).
No Prior ML Knowledge Needed: Designed for data analysts and data engineers, removing the necessity for advanced data science or machine learning expertise.
Limitations:
Time Series Data Requirement: Applicable exclusively to time series data, with a minimum time difference between data points of one second.
Immutable Models: Once trained, models cannot be updated. New data requires training a separate model.
Training Data Limitation: Cannot detect anomalies or forecast values within the training data, potentially leading to issues with bias and variance.
When to Use:
Fixed Data Training: Ideal for training anomaly detection or forecasting models on time series data where no new data is expected.
Target Users: Perfect for data analysts and data engineers looking for an efficient, low-code ML solution.
Snowpark ML: Empowering Data Scientists
For data scientists and developers seeking more customization and control over ML models, Snowflake Cortex offers Snowpark ML. Snowpark ML provides a Python API that allows users to develop, deploy, and utilize their ML models directly within the Snowflake environment. This flexibility enables data scientists to tailor ML solutions to their specific needs and seamlessly integrate them into their data workflows.
In summary, Snowflake Cortex represents a significant advancement in data analytics and AI, providing users with a comprehensive suite of ML and AI capabilities directly within the Snowflake platform. From LLM Functions for processing unstructured text data to ML Functions for predictive analysis, Cortex offers a range of tools to unlock insights and drive innovation. With Snowpark ML, data scientists can further extend Cortex’s capabilities, empowering them to develop custom ML solutions tailored to their unique requirements.
Real-World Applications
The versatility and power of Snowflake Cortex make it well-suited for a wide range of use cases across industries:
- Business Intelligence and Analytics: Cortex enables organizations to perform complex analytics, generate actionable insights, and make data-driven decisions in real-time, driving business growth and competitive advantage.
- Data Warehousing and ETL: Cortex simplifies the process of building and maintaining data warehouses, accelerating ETL (Extract, Transform, Load) pipelines, and reducing the time-to-insight for data-driven initiatives.
- Predictive Modeling and Machine Learning: By combining Cortex’s AI capabilities with advanced ML algorithms, organizations can develop and deploy predictive models for forecasting, anomaly detection, and other data-driven applications.
- IoT (Internet of Things) and Sensor Data Analysis: Cortex can handle the massive volume and velocity of data generated by IoT devices and sensors, enabling organizations to derive valuable insights, optimize operations, and drive innovation in areas such as smart manufacturing, healthcare, and logistics.
Conclusion
Snowflake Cortex represents a paradigm shift in data processing technology, empowering organizations to unlock the full potential of their data assets and drive digital transformation at scale. By combining the power of AI, MPP, and cloud-native architecture, Cortex offers unmatched performance, scalability, and flexibility, making it the ideal platform for modern data-driven enterprises. As businesses continue to embrace the era of data-driven decision-making, Snowflake Cortex stands ready to lead the charge into a future where data is not just a resource but a strategic asset driving innovation and growth.