Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Postgres and Parquet in the data lake (crunchydata.com)
90 points by pramsey on May 4, 2022 | hide | past | favorite | 25 comments



I would suggest to look onto Delta Lake (https://delta.io/) - it's built on top of the Parquet, but has advantages over the plain parquet:

- transactions - you don't get a garbage in your table if your write failed

- supports update/delete/merge operations (in some implementations)

- metadata allows faster discovery of data, for example, if you have a lot of partitions on the cloud storage.

- metadata also allow to support features like data skipping, when you can filter out files that doesn't contain necessary data

- time travel - you can get back to previous versions of the tables


> "Generating Parquet"

It is also useful for moving data from Postgres to BigQuery! ( batch load )

https://cloud.google.com/bigquery/docs/loading-data-cloud-st...

Thanks for the "ogr2ogr" trick! :-)

I hope the next blog post will be about GeoParquet and storing complex geometries in parquet format :-)

https://github.com/opengeospatial/geoparquet


GeoParquet just embeds the geometry as WKB, so getting PostGIS geometry is as simple as going "SELECT geom::geometry FROM my_parq_fdw_table". The spec doesn't yet have anything to say about how to get a good spatial sort or carry out a spatial filter, so it's all pretty ad hoc on that end. Thanks to ogr2ogr, generating a GeoParquet file is now pretty straightforward.


Cloud SQL has BigQuery connections that can be leveraged. But yea, this seems like a nice solution if you have a postgres instance outside of Cloud SQL. Another approach would be to write the CDC to a message queue and archive that to parquet.


After reading a few comments here, here's the experiment I'm going to run this weekend to couple Postgres (AWS rds) and DuckDB

There's already an FDW for DuckDB [1]! But it's not available from within AWS RDS because they only allow a few extensions. So, I would launch an EC2 instance and launch a postgres instance on it. Then I would install the DuckDB FDW on this postgres instance. Then, on the AWS RDS postgres instance, I would use the postgres_fdw extension to connect to the postgres instance running on EC2 and indirectly gain access to it's DuckDB FDW. I'm not sure if this chained FDW strategy will work out. But here's why I think it's a great idea:

1. Any data that I am willing to put into DuckDB is guaranteed to have backups and/or an ETL already setup to sync data. Therefore I don't need this EC2 setup to be "reliable". If it goes down, I'll just rerun the ETL.

2. I need to pay a small DevOps cost to have some performance gain but the analytics complexity remains unchanged because at the end of the day it's just Postgres. I get to continue to use DBT against the original AWS RDS.

The only thing I need to see is if there are any performance gains to be had using this setup. I won't take that for granted until it's tested.

[1] https://github.com/alitrack/duckdb_fdw


How did it go? It sounds super interesting.


Wouldn't it be nice if Postgresql could gain a native new storage engine which serializes columnar data as parquet files? Does this even make sense?

DuckDB already has fine Parquet support and I find it really helpful.


There is the cstore extension, which is pretty much "native" (at some point these terms start to lose meaning... it's running in the same process space and transaction aware, etc.)


No need to change the database itself, these kind of features belong in foreign data wrappers. You can already treat csv files as tables or access tables in other databases(mysql, oracle etc)


With a massive performance impact (for reading csvs anyway).

It's a matter of how much you want to keep using postgres versus if you want interactive-level performance.

I'm not saying one is better or worse than the other but they are definitely not the same.


> Does this even make sense?

Yes, but may not be practical. A non-trivial amount of code makes assumptions about the underlying data layout, so things like the cost-based optimizer and MVCC would basically need to be rewritten.


> Wouldn't it be nice if Postgresql could gain a native new storage engine which serializes columnar data as parquet files? Does this even make sense?

Surely that's exactly what parquet_fdw is?


For [Ex-]Googlers: I thought this sounded a bit like Capacitor and indeed it is mentioned in this page:

https://cloud.google.com/blog/products/bigquery/inside-capac...

(I don't think there's a direct analogue to Postgres though? Maybe people sometimes take data out of Spanner and put it in a Capacitor file instead sometimes?)


And for Non-Googlers: BigQuery does support Parquet in the lake. If you don't need PQ compatibility this is an easy option. https://cloud.google.com/bigquery/external-data-cloud-storag...

People dump data from the main OLTP to the "lake" or some OLAP database and call it data-warehouse vs. Internet-oriented companies like Google tend to focus more on querying the log and its aggregated data.


This is a great post demonstrating the wide variety of things you can do with postgres foreign data wrappers.

But I've worked on analytics systems built on this and it should just be clear that when you use this kind of system you're accepting ease of operation/ease of integration for actual interactive-level performance.

At some point if you want interactive-level performance on this historic data (which is very much possible) you'll have to start putting this data in clickhouse or elasticsearch and some system on top of postgres with a better sharing scheme.

If you wanted better performance without giving up the postgres frontend it would (to me) make more sense for folks to invest in building FDWs for elastic or clickhouse rather than over parquet or csv. But that's just a theory.


While using data lake for this case is a good idea. I'm not sure if parquet is the particular good storage format for data like event log, audio log. The data is often row oriented and does not take full advantage of a column based format.


Another similar option is cstore_fdw [0] -- it's now part of Citus but can still be used standalone as a foreign data wrapper. We use it at my startup to do OLAP on Postgres. It has some advantages on parquet_fdw:

* Supports writes (actually generating Parquet files was also difficult in my testing: I used odbc2parquet [1] but thanks for the ogr2ogr tip!) so you can write directly to the foreign table by running INSERT INTO ... SELECT FROM ...

* Supports all PG datatypes (including types from extensions, like PostGIS)

* Performance was basically comparable in my limited testing (faster for a single-row SELECT with cstore_fdw in our case since we do partition pruning, same for a full-table scan-and-aggregation).

Re: performance overhead, with FDWs we have to re-munge the data into PostgreSQL's internal row-oriented TupleSlot format again. Postgres also doesn't run aggregations that can take advantage of the columnar format (e.g. CPU vectorization). Citus had some experimental code to get that working [2], but that was before FDWs supported aggregation pushdown. Nowadays it might be possible to basically have an FDW that hooks into the GROUP BY execution and runs a faster version of the aggregation that's optimized for columnar storage. We have a blog post series [3] about how we added agg pushdown support to Multicorn -- similar idea.

There's also DuckDB which obliterates both of these options when it comes to performance. In my (again limited, not very scientific) benchmarking of on a customer's 3M row table [4] (278MB in cstore_fdw, 140MB in Parquet), I see a 10-20x (1/2s -> 0.1/0.2s) speedup on some basic aggregation queries when querying a Parquet file with DuckDB as opposed to using cstore_fdw/parquet_fdw.

I think the dream is being able to use DuckDB from within a FDW as an OLAP query engine for PostgreSQL. duckdb_fdw [5] exists, but it basically took sqlite_fdw and connected it to DuckDB's SQLite interface, which means that a lot of operations get lost in translation and aren't pushed down to DuckDB, so it's not much better than plain parquet_fdw. I had a complex query in the PG dialect generated with dbt that used joins, CTEs and window functions. I don't remember the exact timings, but it was even slower on duckdb_fdw than with cstore_fdw, whereas I could take the same query and run it on DuckDB verbatim, only replacing the foreign table name with the Parquet filename.

This comment is already getting too long, but FDWs can indeed participate in partitions! There's this blog post that I keep meaning to implement where the setup is, a "coordinator" PG instance has a partitioned table, where each partition is a postgres_fdw foreign table that proxies to a "data" PG instance. The "coordinator" node doesn't store any data and only gathers execution results from the "data" nodes. In the article, the "data" nodes store plain old PG tables, but I don't think there's anything preventing them from being parquet_fdw/cstore_fdw tables instead.

[0] https://github.com/citusdata/cstore_fdw

[1] https://github.com/pacman82/odbc2parquet

[2] https://github.com/citusdata/postgres_vectorization_test

[3] https://www.splitgraph.com/blog/postgresql-fdw-aggregation-p...

[4] https://www.splitgraph.com/trase/supply-chains

[5] https://github.com/alitrack/duckdb_fdw

[6] https://swarm64.com/post/scaling-elastic-postgres-cluster/


Agg pushdown is a thing in FDW, I hadn't really thought about the extent to which pulling rows into PgSQL and then summarizing them is a waste of time. I imagine actually implementing agg pushdown in the parquet_fdw might be too much to ask of it (basically writing part of an execution engine in the FDW) but boy it is an interesting thought, since so much data lake querying is aggregation.


If you were ready to throw away a bit of archival quality, I wonder if writing Apache Arrow straight to disk and then just mmap your columns back in for a query.


It's certainly possible. But Parquet and Arrow are being built with each other in mind. I doubt the tradeoffs on storage, durability, and future-proofness are worth skipping the marginal overhead to load a Parquet file into memory.


Isn't that what Feather, the arrow file format, is? https://arrow.apache.org/docs/python/feather.html


Please correct if I am wrong but Parquet is columnar store whereas PSQL is row based?


Parquet is a column-based file format. PostgreSQL is a SQL query engine, and the default storage is row-oriented, but the engine can do all kinds of fun things, hence the parquet_fdw extension and the cstore extension for example.



Thank you, this is really cool!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: