Hi, Pablo here
My tips and tricks when using Postgres as a DWH
In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I also drafted and deployed the first version of its data platform.
The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts, this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even dare to say it has provided a better experience than other, more trendy alternatives could have. I'll jot down my rationale for picking Postgres one of these days.
Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky at times. There are multiple ways to make your life better with it, as well as related tools and practices that you might enjoy, which I'll try to list here.
Use unlogged
tables
The Write Ahead Log comes active by default for the tables you create, and
for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
making your tables unlogged
. Unlogged
tables will provide you with much faster writes (roughly, twice as fast) which will make data
loading and transformation jobs inside your DWH much faster.
You pay a price for this with a few trade offs, the most notable being that if your Postgres server crashes, the contents of the unlogged tables will be lost. But, again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.
If you are using dbt, you can easily apply this by adding this bit in your dbt_project.yml
:
models:
+unlogged: true
Tuning your server's parameters
Postgres has many parameters you can fiddle with, with plenty of chances to either improve or destroy your server's performance.
Postgres ships with some default values for it, which are almost surely not the optimal ones for
your needs, specially if you are going to use it as a DWH. Simple changes like adjusting the
work_mem
will do wonders to speed up some of your heavier queries.
There are many parameters to get familiar with and proper adjustment must be done taking your specific context and needs into account. If you have no clue at all, this little web app can give you some suggestions you canstart from.
Running VACUUM ANALYZE
right after building your tables
Out of the box, Postgres will automatically run
VACUUM
and
ANALYZE
jobs automatically. The triggers that determine when each of those gets
triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
re-building your non-staging tables will cause Postgres to run them.
But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
but not right after you build each table. This poses a performance issue: if your intermediate sections
of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
without having an ANALYZE
on the first one before might hurt you.
Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
tables int_orders
and int_order_kpis
. int_orders
holds all of our
orders, and int_order_kpis
derives some kpis from them. Naturally, first you will
materialize int_orders
from some upstream staging tables, and once that is complete, you
will use its contents to build int_order_kpis
.
Having int_orders
ANALYZE
-d before you start building
int_order_kpis
is highly benefitial for your performance in building
int_order_kpis
. Why? Because having perfectly updated statistics and metadata on
int_orders
will help Postgres' query optimizer better plan the necessary query to
materialize int_order_kpis
. This can improve performance by orders of magnitude in some
queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
example.
Now, will Postgres auto VACUUM ANALYZE
the freshly built int_orders
before you
start building int_order_kpis
? Hard to tell. It depends on how you build your DWH, and how
you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
be that sometimes it happens, and other times it doesn't. Flaky and annoying. Some day I'll
write a post on how this behaviour drove me mad for two months because it made a model sometimes built
in a few seconds, and other times in >20min.
My advice is to make sure you always VACUUM ANALYZE
right after building your tables. If
you're using dbt, you can easily achieve this by adding this to your project's
dbt_project.yml
:
models:
+post-hook:
sql: "VACUUM ANALYZE {{ this }}"
transaction: false
# ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
# It's pointless for views, but it doesn't matter because Postgres fails
# silently withour raising an unhandled exception.
Monitor queries with pg_stats_statements
pg_stats_statements is an extension that nowadays ships with Postgres by default. If activated, it will log info on the queries executed in the server which you can check afterward. This includes many details, with how frequently does the query get called and what's the min, max and mean execution time being the ones you probably care about the most. Looking at those allows you to find queries that take long each time they run, and queries that get run a lot.
Another important piece of info that gets recorded is who ran the query. This is helpful
because, if you use users in a smart way, it can help you isolate expensive queries on different uses
cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
transformation ones. Another example could be internal reporting vs embedded analytics in your product:
you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
dashboards. Using different users and pg_stats_statements
makes it possible for you to
dissect performance issues on those separate areas independently.
Dalibo's wonderful execution plan visualizer
Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in a DWH this ends up happening with queries that involve many large tables in sequential joining and aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again, etc).
You can get the query's real execution details with EXPLAIN ANALYZE
, but the output's
readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
I came across Dalibo's execution plan
visualizer. You can paste the output of EXPLAIN ANALYZE
there and see the query
execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
I encourage you to try the tool with some nasty query and see for yourself.
Local dev env + Foreign Data Wrapper
One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.
Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when they are developing on our DWH dbt project. In the early days, you could grab some production dump with some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if you were patient.
A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
Luckily, we came across Postgres' Foreign Data Wrapper. There's quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres server give access to some table in a different Postgres server while pretending they are local. So, you query table X in Postgres server A, even though table X is actually stored in Postgres server B. But your query works just the same as if it was a local genuine table.
Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being read from production server's. The approach has been great so far, enabling them to actually test models before commiting them to master in a convenient way.