Understanding PostgreSQL: How bad query practices are hurting your backend

Dec 26, 2021

Greetings of the season to you! In between baking cakes and making dum biriyani with my family, I found some time to compose this piece for you. Do lemme know if you want my plum cake recipe. Attaching a photo of the same at the end of the newsletter. Thanks for reading! 😃

Today we will discuss about PostgreSQL. It’s very important to understand your backend framework, be it Django or Spring. But it’s equally important to understand your database and how it carries out various operations. Reason being, your framework, in an effort to make it easier for users to work with it and also to generalise for all database systems, can make compromises in terms of performances.

Since I have been working on PostgreSQL for a while, I am familiar with various design patterns of it and how it can impact performance of your backend application. I am going to highlight a few in this issue.

💡 Did you know?
When you update a row in PostgreSQL, it doesn’t update it in place, but it creates a new row with the updated value and marks the old one as a ‘dead tuple’. Same is the case with delete operation. It doesn’t delete the row immediately but just marks it as dead. This is done so that if there are ongoing transactions which are dependent on the old values, can still access it.

Not using Vacuum, Analyze, and Reindex effectively

The information given above is important. Because this would mean that any kind of update that you do results in a bunch of dead tuples in the database which is taking up unwanted space. VACUUM command helps in clearing this out. You can run VACUUM against a table or the entire database. You can read more about it in the official documentation here. This operation is quite expensive, so make sure you don’t do this when your web traffic is high.

PostgreSQL also stores statistics about your database. Things like how many rows are there in each table, which are the index for each table etc. When you execute a query in your database, Postgres comes up with an effective plan so that it can perform the query optimally. This statistics that Postgres stores, helps it to come up with an optimal query. For e.g. if it knows that a particular column has index enabled for it, then it will do an Index Scan rather than Heap Scan(Postgres term for going through the entire DB). Since this statistics keep varying when you update the database, it’s important to update them frequently. That’s where ANALYZE command comes into play. You can read more about it here.

AUTOVACUUM is usually enabled by default for Postgres. Autovacuum automates the execution of VACUUM and ANALYZE commands. If you want to check whether your database has AUTOVACUUM enabled you can run the following query

SELECT name, setting FROM pg_settings WHERE name ILIKE '%autovacuum%';

There can be cases where your index gets corrupted or it becomes bloated (containing a lot of empty pages, which can make your queries not so optimum). In such cases you may want to use the REINDEX command to rebuild your indexes. You can read more about it here.

Using SELECT * for every query

While working with frameworks, the biggest mistake that I see developers doing is using queries which get entire rows of data from the DB even when they don’t need many of the column values. Let’s take the example of Django, the following is the query you use to get all columns from a hypothetical table named Customers

Customer.objects.filter(is_order_placed=True)

Compare this with the query which retrieves only 3 columns from the same table

 Customer.objects.filter(is_order_placed=True).values('name', 'email_id', 'num_orders')

See how it’s longer and more cumbersome as compared to previous one? No wonder we become lazy. Also, there’s this gnawing feeling that, what if you need more columns in the future?

But it’s expensive to do this. When you query like this, the database has to do more work to get all the columns. It has to convert the data from the binary form in which it stores it to something which can be transferred over the network. There is an associated bandwidth cost which comes with it as well. This becomes evident when your query is returning thousands of rows.

Thanks to Postgres’ strict policy that no row is allowed to span across multiple pages, else the impact of this would be drastic. But if your columns contain large text fields (for e.g. product description), Postgres will store it using a technique called TOAST, which is basically a different table. So when you are querying everything it will have to look up in a different table now!

I am not getting into the discussion of column store vs row store data storage in this issue. But just leaving a hint here. If your tables are following the column store method (entire columns are stored sequentially), retrieving all columns will be very expensive.

Extensive use of offsets

Everyone is a fan of pagination. Implementing pagination would mean that instead of sending a 1000 entries at once, we can send it in chunks of 10 and the user experience can be improved using lazy loading.

But when you do pagination, you are creating queries which include OFFSET commands which tells the database to return the values after offsetting a particular number of values. For e.g.

SELECT * FROM customers WHERE is_order_placed=true OFFSET 1000 LIMIT 10

So what does Postgres do in the background with the above query? Does it only fetch 10 rows? No, in fact it has to look for the first (1000 + 10) rows and then discard the first 1000 rows, returning the remaining 10. It doesn’t seem very optimal anymore, does it?

In such situations you need to be smart. You framework will let you code this logic with ease but it’s quiet expensive as you can see. But you will have to transform this in such a way that you only query using the id field and not use offset. For e.g. if you can figure out till what ID you have provided results in the previous page, you can change the above query to

SELECT * FROM customers WHERE is_order_placed=true and id > 1000 LIMIT 10

This can be done by passing a pagination cursor with every result set you send to the frontend. When frontend needs the next page, it can pass this cursor, using which you can calculate the next page ID.

Wrapping up

Everything that your database system does is also based on algorithms designed by humans. While they can be the best overall logic possible, understanding the system you are working with can help you come up with the most optimal algorithms for your application.

Keep in mind that these are for advanced use cases, and while your application is still in its infancy, you don’t have to worry about it so much. Since this topic is huge, I will be writing more on it in the upcoming issues. Stay tuned!

If you need guidance regarding scaling up your application, you can reach out to me at: dennysam14@gmail.com

If you liked this issue of my newsletter, consider buying me a coffee!

As promised, attaching a pic of my plum cake. Happy Weekend! 😊

Effective Software Engineering

Discussion about this post