Hi, Pablo here

My tips and tricks when using Postgres as a DWH

In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I also drafted and deployed the first version of its data platform.

The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts, this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even dare to say it has provided a better experience than other, more trendy alternatives could have. I'll jot down my rationale for picking Postgres one of these days.

Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky at times. There are multiple ways to make your life better with it, as well as related tools and practices that you might enjoy, which I'll try to list here.

Use `unlogged` tables

The Write Ahead Log comes active by default for the tables you create, and for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by making your tables unlogged. Unlogged tables will provide you with much faster writes (roughly, twice as fast) which will make data loading and transformation jobs inside your DWH much faster.

You pay a price for this with a few trade offs, the most notable being that if your Postgres server crashes, the contents of the unlogged tables will be lost. But, again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.

If you are using dbt, you can easily apply this by adding this bit in your dbt_project.yml :


models:
    +unlogged: true

Tuning your server's parameters

Postgres has many parameters you can fiddle with, with plenty of chances to either improve or destroy your server's performance.

Postgres ships with some default values for it, which are almost surely not the optimal ones for your needs, specially if you are going to use it as a DWH. Simple changes like adjusting the work_mem will do wonders to speed up some of your heavier queries.

There are many parameters to get familiar with and proper adjustment must be done taking your specific context and needs into account. If you have no clue at all, this little web app can give you some suggestions you canstart from.

Running `VACUUM ANALYZE` right after building your tables

Out of the box, Postgres will automatically run VACUUM and ANALYZE jobs automatically. The triggers that determine when each of those gets triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely re-building your non-staging tables will cause Postgres to run them.

But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast, but not right after you build each table. This poses a performance issue: if your intermediate sections of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant without having an ANALYZE on the first one before might hurt you.

Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have tables int_orders and int_order_kpis. int_orders holds all of our orders, and int_order_kpis derives some kpis from them. Naturally, first you will materialize int_orders from some upstream staging tables, and once that is complete, you will use its contents to build int_order_kpis.

Having int_orders ANALYZE-d before you start building int_order_kpis is highly benefitial for your performance in building int_order_kpis. Why? Because having perfectly updated statistics and metadata on int_orders will help Postgres' query optimizer better plan the necessary query to materialize int_order_kpis. This can improve performance by orders of magnitude in some queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for example.

Now, will Postgres auto VACUUM ANALYZE the freshly built int_orders before you start building int_order_kpis? Hard to tell. It depends on how you build your DWH, and how you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can be that sometimes it happens, and other times it doesn't. Flaky and annoying. Some day I'll write a post on how this behaviour drove me mad for two months because it made a model sometimes built in a few seconds, and other times in >20min.

My advice is to make sure you always VACUUM ANALYZE right after building your tables. If you're using dbt, you can easily achieve this by adding this to your project's dbt_project.yml:


models:
    +post-hook:
        sql: "VACUUM ANALYZE {{ this }}"
        transaction: false
        # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
        # It's pointless for views, but it doesn't matter because Postgres fails
        # silently withour raising an unhandled exception.

Monitor queries with `pg_stats_statements`

pg_stats_statements is an extension that nowadays ships with Postgres by default. If activated, it will log info on the queries executed in the server which you can check afterward. This includes many details, with how frequently does the query get called and what's the min, max and mean execution time being the ones you probably care about the most. Looking at those allows you to find queries that take long each time they run, and queries that get run a lot.

Another important piece of info that gets recorded is who ran the query. This is helpful because, if you use users in a smart way, it can help you isolate expensive queries on different uses cases or areas. For example, if you use different users to build the DWH and to give your BI tool read access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH transformation ones. Another example could be internal reporting vs embedded analytics in your product: you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal dashboards. Using different users and pg_stats_statements makes it possible for you to dissect performance issues on those separate areas independently.

Dalibo's wonderful execution plan visualizer

Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in a DWH this ends up happening with queries that involve many large tables in sequential joining and aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again, etc).

You can get the query's real execution details with EXPLAIN ANALYZE, but the output's readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until I came across Dalibo's execution plan visualizer. You can paste the output of EXPLAIN ANALYZE there and see the query execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so I encourage you to try the tool with some nasty query and see for yourself.

Local dev env + Foreign Data Wrapper

One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.

Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when they are developing on our DWH dbt project. In the early days, you could grab some production dump with some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if you were patient.

A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.

Luckily, we came across Postgres' Foreign Data Wrapper. There's quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres server give access to some table in a different Postgres server while pretending they are local. So, you query table X in Postgres server A, even though table X is actually stored in Postgres server B. But your query works just the same as if it was a local genuine table.

Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being read from production server's. The approach has been great so far, enabling them to actually test models before commiting them to master in a convenient way.

back to home