Part 1: Continuous ETL from PostgreSQL to Elasticsearch with Apache Flink

Oct 6, 2022

Introduction


One of our customers built a platform to help recruiters source and manage candidates. Having used modern cloud-native tools, a small team built a great product that scaled for a large number of customers. The core product is backed by an AWS PostgreSQL Aurora database and hosts a large dataset:

  • 20+TB of data

  • 450+ tables

  • 18+B rows


When the team set out to launch a new feature that requires faceted search, they needed to find a way to setup a data pipeline between their PostgeSQL instance and Elasticsearch. Since the search functionality is part of the core product experience, it was important for the team to deliver an experience that was both fast in the serving path (customers loathe to wait on a spinner) and fresh (showing search results from yesterdays data is a bad product experience).



Launch on the batch pipeline


The initial implementation relied on a daily batch pipeline. The team used a hosted solution to move data from PostgeSQL to Snowflake, where they run a daily SQL transformation job that migrates from OLTP friendly data model into a format more suitable for Elasticseach. The output of the SQL query is loaded into Elasticsearch via a Python script.

Outside of scale, one of the more interesting properties of this workload is the level of complexity of the SQL expressing the transformation. In order to transform from a highly normalized OLTP data model into a set of JSON documents suitable for an Elasticsearch index built, the team wrote hundreds of lines of SQL containing:

  • 79 joins (37 inner, 41 left outer, and 1 full outer)

  • 49 distinct tables referenced

  • SQL window functions

  • Subqueries and CTEs

  • Variety of aggregate functions manipulating complex data types (e.g. arrays and JSON).


While the batch pipeline allowed the team to launch the feature, it left several of their goals unfulfilled:

  • Efficiency. The batch pipeline needs to reevaluate the entire dataset daily, which in most cases takes approximately 14h end-to-end to populate the index. Not only is it cost inefficient, but it’s also operationally fragile since the operation cannot be restarted — an error means restarting the whole pipeline and missing the daily SLA.

  • Data freshness. User experience suffers because the index is at a minimum 14h out of date.


At this point, many organizations look at moving to an incremental approach to improve efficiency and freshness. However, the complexity of the data model and transformation makes it infeasible to move to an incremental model without significant refactoring of the app.



Continuous loads with Rubicon


The team investigated a variety of different streaming-based approaches to get to a continuous load model that would improve efficiency and provide near real-time data freshness. They ultimately chose to work with Rubicon for the following reasons:

  • Expansive SQL compatibility. Rubicon's native SQL support handles their complex transformation workload out of the box. Other competitive platforms either have significant limitations on SQL support or require a non-SQL interface altogether.

  • Managed service. Rubicon is a serverless platform for continuous transformation. No server and resource management necessary.

  • CDC based replication. Rubicon uses a super low overhead CDC based approach to become a live read replica for upstream sources, which protects your production database from taking on any additional query loads which may slow down your main database.

  • Scalability. Rubicon is a cloud native service built for scale and performance.

  • Continuous processing. Rubicon is designed form the ground up for continuous processing. Within seconds of updates to the upstream PostgreSQL instance, Rubicon automatically materializes the necessary changes to the transformation SQL query ready to be served to the team’s customer.



Product Experience

Visit our product page to learn more about Rubicon and our upcoming features.



Architecture


At Rubicon, we use a good deal of open source software. In particular, we rely on Apache Flink as the base stream processing engine for continuous SQL transformation. After deploying our system against a handful of real-world production workloads, we’ve learned that while Flink has a really strong foundation, it cannot handle many of the workloads that we encountered out of the box — including the one describe in this post.



Continued in Part 2: Modifying Flink for large scale continuous ETL