This post is for Day 15 of Mercari Advent Calendar 2023, brought to you by @sathiya from the Mercari Data Operations team.
The article lists the best practices, tips, and tricks from the nooks and corners of the BigQuery Documentation. Some of these may be known to you and some will blow your mind. So, get ready to unleash the performance and bring out cost optimization in your BigQuery Data Warehouse.
Organizing Data
There are many ways of organizing data in BigQuery, which include: Sharded tables, Partitioned tables, and Clustered Tables. In sharded tables, the data resides in many tables. BigQuery has to maintain the schema and metadata for all the tables. Given the cumbersome maintenance and query performance, Google suggests using Partitioning instead.
Partitioned tables are divided into multiple segments based on the column on which the partition is made or based on the ingestion time using the pseudo _partitiontime
column. When a query is made to a partitioned table with the corresponding filter based on the partitioned column, BigQuery queries only the relevant partitions.
If the nature of query filters and the columns are known prior, the performance of the partitioned tables can be throttled by defining clustered columns. The user creates this table property on partitioned tables based on those columns used in the filter and helps query relevant data faster and cheaper.
If you are wondering how to migrate from Sharded to Partitioned tables, here are the instructions on creating time-partitioned tables from sharded tables.
Data Catalogs for Data Exploration
Data Exploration and Data Catalogs go hand in hand. As the data-driven organization grows and expands, navigating and exploring the tables stored in BigQuery can be difficult. This is where data catalogs are quite helpful to make the existing data useful.
Just to give ideas, Data Catalogs;
- Can associate the data assets with their respective data owners
- Helps to understand the data lineage – the relationship between the datasets and tables
- Documents and maintains the datasets for everyone’s usability
- Defines the data products
- Acts as an invaluable tool for the Finops Team
Although Data Catalogs are built by a specific team, the data catalogs are made helpful when the employees A.K.A the data citizens of an organization contribute towards the enhancement of information by adding more documentation and properties to the datasets.
At Mercari, the data operations team maintains the organization’s data catalog and constantly improves it.
Demystifying Prorated BigQuery Storage Costs
BigQuery storage costs are straightforward and the calculation is on a prorated basis. Storage is classified into active and long term storage. The flat pricing for active storage is $0.02 per GB per month. When a table/partition goes unedited (SELECTs Only, No DDL+DML) beyond 90 days, the storage mode changes from active to long term storage, which results in a 50% drop in the price from 0.02$ to 0.01$.
Consider the following calculations:
Storage Size | Period | Cost in Dollars [Active Storage] |
---|---|---|
1000MB | 1 Month | $0.02 |
100MB | 1 Month | $0.002 |
100MB | 1/2 Month | $0.001 |
Beyond 90 days with no edits to the table/partition, the long term storage applies.
Storage Size | Period | Cost in Dollars [Long Term Storage] |
---|---|---|
100 MB | Stored for 1/2 Month | $0.0005 |
Storing 100MB of data for half a month will cost $0.001. Similarly, consider storing 1GB of data for 24 hours. This results in the following calculations:
Storage Size | Period | Cost in Dollars [Active Storage] |
---|---|---|
1000MB | ^730 hrs | $0.02 |
1000MB | 24 hrs | $0.00066 |
^ 1 Month consists of 730 hours
If you are thinking about archiving your tables to Google Cloud Storage (GCS) for cost-saving purposes, you should consider the coldline/archive GCS pricing.
Here’s a comparison between BigQuery and GCS pricing.
- Active storage in BQ is similar to the Standard storage in GCS
- Long term storage in BQ is similar to the Nearline storage in GCS
So, it’s better to consider Coldline storage in GCS while archiving for more cost-saving. When the tables are exported to GCS, external queries can be used to fetch data from the GCS location from BQ. There are no charges for data retrieval but you pay for the slot usage. ⚠️ This behavior was observed when this article was written and this may change in the future. ⚠️
Table And Partitions Expiry Settings
The tables and partition expiry settings are often the least considered by BigQuery users but can bring in huge cost savings long term. These expiry settings can be applied at a dataset level or a table level. This can often lead to confusion which can be simplified as follows:
- Table-level properties take precedence over the dataset-level expiry definitions
- Partition expiry settings take precedence over the table expiry settings
- Partitions that are expired are deleted immediately when the table-level partition expiry settings are applied
Unlearn SELECT * FROM
BigQuery stores data in columnar formats, like all other modern data warehouses. We are accustomed to running SELECT * FROM
in our queries. It’s about time that we unlearn this habit and switch to SELECT col1, col2 FROM
instead: fetching only the required columns while querying the tables. This leads to massive cost benefits with the lesser number of bytes to be processed.
Using Table Previews & Temporary Tables
Why do a SELECT … FROM
when we could preview the table instead? For FREE! Many times, we forget about the preview option in the BigQuery Console and it comes in quite handy during the data exploration.
Fig 1 – Table Preview – Screenshot from Google BigQuery
Speaking of Freebies, did you know that when you create temporary tables, the results will be deleted in 24 hours and of course, the storage cost is free for those 24 hours? The reason behind this is that the results of every query are cached as a temporary table and the results are retained for 24 hours.
This can be very helpful while creating tables for exploratory analysis, provided you don’t intend to share the temporary tables and do not want to store the table for more than a day. ⚠️ This feature existed at the time when the article was written and may change in the future. ⚠️
Look before you Leap
Apart from the previous points, the following are some of the quirks of BigQuery that one must keep in mind to bring the best out of BigQuery.
- It’s always good to have an eye on the estimated bytes to be processed for a query before it’s run. You can find this in the top right corner of the query editor on the BigQuery console. This helps save unwanted spending on query runs and slot usage.
Fig 2 – Look Before You Leap – Screenshot from Google BigQuery
- While using Common Table Expressions (CTEs) A.K.A the ‘WITH’ clause, if you are about to refer to the expression multiple times, make sure that you use recursive CTEs rather than regular CTEs using the ‘WITH RECURSIVE’ keyword. You can read the full usage at this link.
- Use Table Constraints wherever you can establish relationships between the BigQuery tables. This will indirectly improve query execution and performance.
Conclusion
Constant updates and features are being implemented in BigQuery. We believe that the tips and tricks that we have mentioned in the article can be useful for BigQuery users.
Tomorrow’s article will be by Lukas from the Design System Team. Stay tuned!