Many people understand that BigQuery is a very fast database, but not a lot of people understand why.
Let's go through a simple exercise to illustrate this.
Compare these two queries.
Query 1:
-- Selecting all columns from the natality table
SELECT *
FROM `bigquery-public-data.samples.natality`
WHERE year = 2000
LIMIT 1000;
Query 2:
-- Select only weight_pounds from the same natality table
SELECT weight_pounds
FROM `bigquery-public-data.samples.natality`
WHERE year = 2000
LIMIT 1000;
In the second query, instead of SELECT star, we are just selecting one column - weight_pounds. We're using the same table, same WHERE clause, same LIMIT.
The only difference is how many columns we're asking for.
Both queries return exactly 1000 rows, as indicated by the LIMIT clause. So will they really be equally expensive? Or will one be cheaper than the other?
Which one do you think will be cheaper, and by how much?
When you run these queries in BigQuery, you'll see that the second query, where we only bring in one column, is dramatically cheaper.
This might seem obvious, but let's really think about why that is.
The reason is BigQuery's columnar storage architecture.
Most traditional databases store data row by row. Each row sits together on disk.
For example, in the case of this table, all of the columns - weight_pounds, plurality, mother_age, gestation_weeks - would all be packed sequentially.
If you want a single column, you still read every column, because they are physically next to each other on the disk.
BigQuery does the opposite of this approach. It stores data column by column. This is known as columnar architecture.
All of the values for weight_pounds are stored together, contiguously, on the disk, separate from the values for plurality, mother_age, gestation_weeks, and so on. Conceptually, you can think of each column as being stored separately from the others.
So when we write "SELECT weight_pounds," BigQuery opens up the storage area reserved for weight_pounds and only that storage area. It never touches the other columns. The data it reads is roughly proportional to the columns in my SELECT clause, and how large those columns are, not the number of rows in the table.
BigQuery reads any column referenced anywhere in your query.
SELECT, WHERE, GROUP BY, ORDER BY, and JOIN conditions all require BigQuery to read the columns mentioned in those clauses.
A query that selects one column but filters in the WHERE column on five other columns still reads six columns.
NOTE: LIMIT does not change the number of bytes scanned. It only changes the number of rows returned to you at the end.
This is why BigQuery charges by bytes scanned and not by rows returned.
The real cost driver is how many columns your query touches and how large those columns are.
The natality table in our example has about 50 columns. The first query read all of them, while the second query read one. That's where the difference comes from.
Another thing worth mentioning here is something known as compression. Because each column contains values of the same type (floats, strings, integers, depending on your schema), and often the same actual values (like repeating the same year), the compression ratio of that data is much higher than in a row store where types have to be interleaved. Compression algorithms handle those patterns very efficiently.
In contrast, each row, if the data is stored row by row, can have different types of data within it, and therefore you can't compress the data as much. This is another advantage of Google's approach.
It means less data read from disk and less data moved across the network inside GCP's infrastructure.
In a row store, adding a column to a table is basically free and adds almost nothing to the cost to read the data, since you are already reading the entire row.
In a column store, every column you add is a column that careless queries might scan.
So you have to be aware of this as you are deciding your schema.
Wide tables can cost more.
Denormalization can also cost more.
And not to beat a dead horse, but SELECT * can therefore be much more costly in a columnar store database.
It might seem a little strange to harp on this fact about BigQuery, but this single insight - the fact that storage layout determines query cost - is basically the foundation for everything else in BigQuery. If in the future we go over partitioning, clustering, standard/materialized views, schema design, these will all be strategies for reading fewer bytes and therefore they will involve some consideration and knowledge of the columnar architecture of BigQuery.
It's worth thinking about, if the storage format changes the economics of every query, what else does it change? What assumptions from row-store databases no longer hold when we use BigQuery?
I'll cover that in future lessons.