Materialized View in Google BigQuery
What is a Materialized View?
In the data analytics world, when a data analytics team creates a report/dashboard to get insights into the data they have to wait to load or refresh the data on the same. The reason behind this is the tool fetches the data from multiple tables of the database using joins or a big complex SQL query. Although, they can optimize the query as much as possible but still the load time is not as per the expectation.
They have another solution to get data by creating the view at the database side and call the view at the BI tool side. That will improve the performance of the report. But wait, there are two things on the database side.
- Materialized View
So, view is nothing more than a virtual table created by a table or multiple tables which depends on the written SQL query. It is the same as the database table but it is not stored physically in the disk. It works as a virtual table and whenever you want to fetch the information from View it normally fetches from the base table. So, it always shows updated data from the base table.
A Materialized View is a database object that contains the snapshot results of a query. It is populated only at the time of creation. Whenever you want to have the updated result from the Materialized View, you have to refresh it.
Materialized View is mostly used in the data warehouse or Business intelligence applications where frequent queries of the actual base tables can be expensive.
BigQuery is a very popular fully managed, serverless, cloud data warehouse service provided by Google where you can store and query a massive amount of data. It is a very cost-effective solution for the data warehouse. BigQuery has the potential to scale petabytes of data in seconds and it has real-time data analysis capabilities.
Materialized View in BigQuery
The Materialized View in BigQuery is a precomputed view that periodically caches the results of a query in order to improve performance and efficiency. BigQuery leverages pre-computed results from materialized views and read-only delta changes from the base table to compute up-to-date results.
The Materialized View results faster and consumes fewer resources. All the incremental data of the base table will automatically be added in the Materialized View.
Creating Materialized View in BigQuery
In order to demonstrate the performance of Materialized View, I’m using a MusicBrainz dataset which has a recording table and the size of the table is 1.49 GB and it has around 16 million rows.
There is a good feature of BigQuery that if the query which you have written in the editor is correct syntax wise then it will show you the amount of data that will fetch the query on the top right corner and if you run the below query, you can see that the query fetched 581.4 MB data when you fire it directly on the table and as per the execution plan the elapsed time took 29.7 seconds.
SELECT name, COUNT(*) AS name_count, SUM(artist_credit) AS sum_artist_credit FROM `bigquery-263309.musicbrainz.recording` where edits_pending=0 GROUP BY Name
After getting a result, you can also see in the Execution plan that the majority of the time the query takes to compute the data in the input stage and output stage. let’s have a look at it by creating Materialized View.
Create the materialized view
In order to create Materialized View for the same query, you will use the below query and when you execute the query, the query takes 0B to execute.
CREATE MATERIALIZED VIEW `bigquery-263309.musicbrainz.materialized_view` AS SELECT name, COUNT(*) AS name_count, SUM(artist_credit) AS sum_artist_credit FROM `bigquery-263309.musicbrainz.recording` where edits_pending=0 GROUP BY Name
Now, you can measure the performance and efficiency of Materialized View by running the below query-
SELECT * from `bigquery-263309.musicbrainz.materialized_view`
You will get to see that the query will take 332.3 MB to fetch the data and you will get the elapsed time 19.7 sec.
After getting the result of the query and you can also see that the majority of time is spent in the output stage as the result of the query is precomputed.
Alter the Materialized View
To alter the Materialized View for setting the enable refresh to true, run the below query–
ALTER MATERIALIZED VIEW `bigquery-263309.musicbrainz.materialized_view` SET OPTIONS (enable_refresh=true)
Drop the Materialized View
To drop the Materialized view –
DROP MATERIALIZED VIEW IF EXISTS `bigquery-263309.musicbrainz.materialized_view`
The alteration and deletion of this Materialized View take 0B to processed.
Refreshing the Materialized View
BigQuery allows automatic and manual refresh of Materialized View.
By default, the materialized view refreshes itself every 30 min. This can be enabled or disabled at any time by altering the view and setting enable_refresh as true or false.
Setting up frequency cap
You can also set the frequency cap means how often the automatic refresh run. You can enable and disable it at any time by altering the view and set refresh_interval_minutes = 60. The minimum and maximum frequency cap is 1 minute and 7 days respectively.
Materialized View can be refresh manually at any time. In order to do the same, you need to call BQ.REFRESH_MATERIALIZED_VIEW system procedure. BigQuery identifies the changes and reflects them in the materialized view.
Monitor the Materialized View Job Refresh
In BigQuery, obviously, the Materialized View also takes time to refresh itself whether it refresh automatically or manually. And as you are working on a very massive amount of data. You can monitor the Materialized View status/refresh interval/time in refresh from information_schema as shown with the help of the below query
SELECT creation_time, start_time, end_time, state, DATE_DIFF(end_time, start_time, SECOND) AS time_taken_in_second, job_id, total_slot_ms, total_bytes_processed FROM region-US.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE job_id LIKE "%materialized_view_refresh_%" ORDER BY Creation_time DESC
BigQuery is good choice for analyzing massive amount of data. Materialized View provide fast response time to analytics queries.
To load data of MusicBrainz visit at – https://cloud.google.com/architecture/performing-etl-from-relational-database-into-bigquery?hl=ltShare