At Toplyne, we have architected & polished our data strategy iteratively thanks to Snowflake cloud data warehouse. We are now diversifying our infra by incorporating open technologies for both storage & compute. After a good 8 months of exploration & PoCs, we are preparing our first MVP using Apache Hudi, Apache Spark & Dagster.
This article is about how we are setting ourselves up on Apache Hudi. In the coming, weeks, we’ll publish about the downstream systems.
what do we do at Toplyne?
Toplyne is a startup serving AI solutions 🤖 to the likes of Cloudflare, Notion, Zapier, Clickup, and many other SaaS and D2C companies. We are constantly crunching first-party data 🥣. Our data tech enables sales and marketing teams in different organizations to identify & engage with their high-intent customers & deliver more value 💼.
For a small company with no right to exist 😇, we needed to gain access to our customers’ data in the easiest way possible. This, in turn, made using Toplyne easy to use for our customers. The solution we went with was bulk loading existing data that has been collected 💾 by Segment, Amplitude, Mixpanel, etc.
However, the inherent bulkiness of this data presents its own set of challenges & expenses 💰 to Toplyne’s data teams. It is optimal to ingest & crunch this data very very slowly 🐌. Slowly as in once-a-day level slow 🚲. Plus, there is a constant performance tuning activity 🔧 to ensure that our systems are purring along as efficiently as possible.
To read more about Toplyne’s current tech stack, hop onto an article we published sometime back [1].
the data stack problem statement.
Over time, we have learned a few things about our data product.
A significant percentage of bulk-loaded data is not relevant to our use case. We require only a few precise data points to build our ML models.
We need to crunch their data more frequently 🏎️ to deliver value quickly. Our customers want us to act on their data points more quickly than we currently do. As of now, in the worst-case scenario, it takes us a few days to incorporate customers’ new data points in our ML models.
The first-party data is synced into our systems. But we don’t own the first-party data. These syncs 💿 are quite expensive primarily due to their high bandwidth requirements. Plus different tools are required to sync the data depending on the source due to varying compatibility specifications.
The combination of these three factors increases the TCO (total cost of ownership) at the data ingestion layer. We have written multiple iterations ♻️ of our ETL application which can understand a wide variety of upstream data schemas and produce the data in a standardized schema. Every new first-party data source increases the complexity & TCO of our architecture. As Toplyne grows & scales up 🚀, this cost will become unsustainable.
motivating a solution.
In response to these limitations, we have now started building our own ingestion layer. Our customers can integrate ⚓ our ingestion solutions into their applications directly. This will let them stream their first-party data into our systems directly 🌊. This does away entirely with the inherent latency concerns that we have faced till now. All the data can be directly ingested in a standard schema, so there are no ETL requirements either. In one fell swoop 🏋️, we have solved our ingestion & ETL problems.
Till now, since we had bulk load requirements; our go-to solution had been Snowflake Cloud Data Warehouse ❄️. With the addition of our own custom ingestion layer, we started exploring Lakehouse storage solutions. The lakehouse architecture 🛖 brings with it a wide array of compute choices [2]. The lakehouse architecture significantly reduces vendor lock-in while enabling us engineers to go deeper into the big data stack.
the solution is a lakehouse on Apache Hudi.
We have now selected Apache Hudi as our choice of storage layer 🏡. Apache Hudi was originally designed by Uber to solve for streaming data ingestion [3], which is pretty much our use-case as well. This also justifies our choice of Hudi over Apache Iceberg and Delta Lake. We are using Apache Kafka as a middleware between our SDK http application and the Apache Hudi Lakehouse.
This architecture is rather straightforward and can be composed by following Hudi’s documentation [4].
Moving data from the HTTP 🌐 port all the way into Kafka is a well-documented design and trivial to implement. We have decided to use MSK primarily to give us time to focus on the data pipeline & MLOps efforts in the medium term 🕰️.
Moving data from Kafka to Hudi however, turned out to be an interesting implementation and moderately non-trivial. We’ll zoom 🧐 into how we identified a suitable integration solution as far as moving data from Kafka to Hudi is concerned. Also, we’ll quickly throw light 🔦 on our choice of using the Hive catalog and its relevance.
kafka → hudi.
Putting data into Hudi from Kafka has two well-known techniques:
Kafka Connect [5]
Hoodie Streamer [6]
kafka connect.
Since Kafka Connect was the shiny new toy 😊, we decided to take it up first. For us, it was a head-first dive into unknown territory. Brand new to Kafka, MSK, Kafka Connect & Hudi 🧑🏼🎓. We had a significantly high onboarding cost.
Unfortunately, as of v0.14 & v0.15, Hudi’s Kafka Connect is still WIP and has a bunch of silent error scenarios. These errors show up without proper messages, making them harder to debug. Plus there aren’t many documentation & demo solutions that can be followed, so bootstrapping the solution is quite time-consuming.
Another issue is that MSK Connect is on version v2.7. Now considering Kafka is on v3.7, MSK Connect seems to be super outdated. To circumvent this, we’ll need to self-host our own instance of Kafka Connect 🏗️. Now this additional service increases the TCO.
Another nuance is that we are looking to use MSK with IAM auth and the Hudi Kafka Connect’s doc doesn’t do a great job of explaining the nuances of configuring this auth. All in all, we spent 4 days trying to set up Kafka Connect with MSK, but faced challenges at every step which significantly reduced our confidence in our choice of technology ⚡.
Hudi v1.0 will come with significant improvements in Kafka Connect, but as of the latest v0.15 stable release, the above mentioned problems exist. Chris has mentioned the limitations of Kafka Connect in his publication as well [7]. We’ll take another stab at the Kafka Connect setup down the line & publish our detailed PoC 🤞🏻.
tl;dr, we evaluated it to be unusually expensive to maintain & run so chose to not go with it.
hoodie streamer.
Hoodie Streamer is a spark 💥 application, that creates RDD by incrementally reading data from specified Kafka topic and then writes this data into Hudi tables 🌈 <FIN>.
The simplicity is very hard to miss. Unlike Kafka Connect push mechanism, this is a pull mechanism. If seconds latency is not a requirement, I believe that Hoodie Streamer is a super viable mechanism. We have set up EMR for starters and then use Airflow to trigger the streamer every few minutes ⌛. Airflow triggers the job by submitting the spark job as an EMR step. The application easily syncs the data to AWS Glue Data Catalog. Auth with Kafka is easy to set up as well.
Since we are ingesting data for multiple customers, we create topics per customer. Every stream application has a 1:1 mapping between the topic and the destination table. Hence, there will be a lot of spark applications, running regularly at varying time intervals that perform the data sync.
catalog.
Apache Hudi doesn’t have any dependency on a data catalog, but supports Hive metastore 🐝. Since we have historically used Snowflake as our daily driver, we are extremely comfortable with the `db.schema.table` syntax. Replacing it with POSIX file system reference will make our code, let’s just say, weird. The conciseness 🛝 of the former syntax is simply too hard to miss. Hence we have decided to incorporate the AWS Glue Data Catalog. Plus, we are leveraging AWS IAM to get RBAC on all the data that is going to sit in Hudi tables 📋.
Hoodie streamer makes it easy to sync table data into AWS Glue Data Catalog and also pretty straightforward to query this data via Athena. Hence on day 0, we can validate all the data is landing into our Hudi tables from Athena itself.
monitoring.
Both MSK as well as HoodieStreamer emit Cloudwatch metrics. We are tracking these metrics to monitor the health of our system and infra. As we onboard more customers on this stack, we’ll define more refined observability guidelines for our system.
fin.
Now that the data has landed in our system, we are building out new feature engineering pipelines. Till now, we have had long-running ETL applications that require multi-step orchestration and trigger Snowflake SQL. We have an elaborate setup on Airflow for the same.
But our new setup doesn’t require ETL, so we can directly jump into feature engineering 🧑🏻🔬 on Spark. For this, we have started writing test pipelines on Dagster.
We are excited about what the future holds for us and we can’t wait to share it with everyone as well. More on that coming soon.
Bibliography
[1] https://medium.com/towards-data-science/cooking-with-snowflake-833a1139ab01
[2] https://arxiv.org/abs/2308.05368
[3] https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
[4] https://hudi.apache.org/docs/docker_demo
[5] https://github.com/apache/hudi/tree/master/hudi-kafka-connect
[6] https://hudi.apache.org/docs/hoodie_streaming_ingestion/
[7] https://materializedview.io/i/145590156/improving-kafka-connect
I would also check out onehouse's lakeview which is a free tool to see file and partition data skew.