Author: Sudhendu Pandey
Introduction
If you are a Snowflake customer, the ease of maintenance is definitely one of the selling points. You can spin up your Snowflake instance in minutes, pay using your credit card, never worry about auto-scaling, and many other features are definitely an eye-catcher and provide a low entry barrier.
But with such simplicity, there are times when you need to get things done in a certain way. Data archival strategy is one good example where you have to think beyond what is offered OOTB. Especially to accommodate compliance requirements
Data Archival
Data archival is a practice in data warehousing (or any data application), where infrequent data is moved to low-cost, low-performance storage. This ensures two things:
- reduces data volume at the datawarehouse ensuring better (query) performance.
- store data past the ‘business window’ until the ‘compliance window’ end.
Data Archival ≠Data Backup
Although the mechanism might be the same, data archival is not the same as data backup. Data backup is the response to the Disaster Recovery Strategy and Business Continuity Plan (BCP), data archival is for performance, cost, and compliance. Backups are purpose-built for quick recovery, unlike archives, which can be slow in recovery.
Data Archival — Type
Data can be archived at various stages based on your business requirements. On a high level, there are three stages to the data archival journey:
- HOT: This is where your data is not yet archived. It is readily available with top performance and query time. Here, dollar to bytes cost is the highest and should be since it’s your business data. On the ground, this is data stored in your tables that the downstream applications (reporting, BI, Data Science, etc.) will be using.
- WARM: There can be an agreed-upon cut-off time (like 1 year), post which the data is not regularly requested by businesses. But it still needs to stay to ensure Just In Case needs and compliance needs. This is warm storage. Not too hot, yet not too cold. It can be easily moved out to the HOT region to make it business ready. These are typically copied tables or unloaded data stored in S3 (or any object storage) and compressed.
- COLD: This is where the data has moved out of business requirements and entered the compliance and security window. Here, the important factor is costs, hence, whichever storage mechanism is selected, need not be the best in query performance. When required by the compliance team, this data can be technically moved to the warm and then hot area to be queried.
Tip: Everything after this line is explaining the decision flowchart above. If you understand the flow, you might as well skip through! ?
While setting up the archival strategy, there are a few motivating factors that can help us decide which way to go forward.
Factor 1: Cost and Performance
Snowflake is rather cheap (if managed correctly). So if the main motivation is to manage costs and performance, there are a couple of options you can look at before moving forward with an elaborate archival strategy:
- Snowflake provides micro-partitioning and clustering keys which ensures query pruning is optimal. If your query performance is not at par, it will be a good idea to verify and optimize the clustering keys.
- If storage cost is the factor, know that Snowflake compresses all the table data and time travel data (5x reduction). So 10 TB of data will only require storage of 2 TB at Snowflake. If that still is the issue, it will be a good idea to move to the Snowflake Capacity Plan instead of the on-demand plan. The storage cost reduces by more than 40%!
Factor 2: The likelihood of data being requested post the warm or cold archival
Would there be a case where the user might want to query archived data?
Factor 3: Is your Snowflake database architecture layer? (Raw, Confirmed, Reporting, etc?)
If layered architecture, do we require a different pattern for each layer?
Factor 4: Is the Archiving strategy fully compliant with relevant compliance regulations?
There is no point if your archiving strategy is not in-line with the regulatory requirement that you follow. Archiving is one of those things which if half done is the same as not done at all. Please consult with your IT officers to verify compliance requirements before defining the strategy.
Factor 5: Is the organization willing to procure and spend on other low-cost storage tools?
Archiving in Snowflake is not rocket science. Here are a few very good articles and discussions pointing out how to get it done. The tools might depend on your underlining cloud provider, but the technique is the same everywhere.