A transformation framework that understands your data: our investment in Tobiko Data

A transformation framework that understands your data: our investment in Tobiko Data

As the demands for data increases and cost of storage decreases, most data pipelines have moved to an ELT paradigm: loading all data to a warehouse in unaltered form, then building series of transformations for downstream consumers. 

Historically, building these pipelines required the expertise of centralized data teams who could develop in analytics engines like Spark. But in recent years, we’ve seen the rise of open-source, easy-to-use frameworks that democratized the development of data pipelines with SQL. 

These tools work great at small scale, but as an organization grows – in terms of headcount, data volume, or number of models – they start to break. That’s because these platforms typically re-run all computations after any change to data or a model.

As a result, pipelines take hours to run and cloud costs balloon because huge volumes of data are processed unnecessarily. Developers don’t know if a change will break someone else’s model downstream, and must wait hours each time the pipeline runs. Maintaining development/staging and production environments is complex; each deployment typically involves a full re-run. Running backfills and forward-only changes requires custom infrastructure or manual workarounds.  

Tobiko Data solves these problems with an open-source data transformation framework that is just as simple to develop in with SQL, while reducing costs significantly and bringing DevOps best practices at scale. We are thrilled to lead their $17.3 million Series A, joined by Unusual Ventures and angels including George Fraser, CEO of Fivetran, and Jordan Tigani, CEO of MotherDuck. 

Tobiko Data: a transformation framework that understands your data

The journey of Tobiko started with an open-source project called SQLGlot, a no-dependency SQL parser, transpiler, optimizer, and engine. Co-founder Toby Mao, then at Netflix, realized how useful it would be to have a simple Python library that could parse and transpile across the various SQL dialects his team used.

Because SQLGlot allows you to understand the meaning of and build abstract syntax trees (ASTs) for any SQL code, Toby and co-founders Tyson Mao and Iaroslav Zeigerman realized it could be the foundation for a new type of data transformation framework built to address some of the large-scale data pipeline challenges they saw at Airbnb, Netflix, and Google.

The core problem with current data transformation frameworks is that they (1) don’t understand how the code you write corresponds to flows of data through the pipeline, and (2) are stateless. Because of this, the entire computational graph is re-run after any change to a model or data, unless a user implements complex custom logic and instrumentation.

Enter SQLMesh – an open-source transformation framework based on a semantic understanding of SQL. This allows SQLMesh to keep track of data as it flows through data pipelines, and when your team updates models and data over time. This helps organizations develop more effectively and save huge amounts of time and money, providing:

  • Column-level lineage and incremental loads by default – with any change to the data or models, SQLMesh will only re-run the parts of the pipeline are directly impacted
  • Virtual data environments for development/staging, allowing you to test changes and roll back/forward easily and without recomputing any completed models
  • CI-runnable tests to provide instant feedback on change impacts; automated categorization of breaking vs non-breaking changes

In addition to the core transformation framework, SQLMesh includes an orchestrator, CI/CD testing framework, and virtual data environments to manage the promotion of changes to production – all in the open-source library.

SQLMesh has a robust community and has partnerships with companies including Harness, Fivetran, Pipe, Wealthsimple, Textio, and Dreamhaven. 

For Harness, switching to SQLMesh reduced their cloud warehouse spend by 30-40% by avoiding unnecessary recalculations. It also made their developers more productive, reducing model build time by 80% and flagging breaking changes to allow for fast iteration.

This month, Tobiko is launching a managed version of SQLMesh called Tobiko Cloud. This will allow any organization to easily run SQLMesh without managing pipeline state, while all data processing remains on the customer’s own infrastructure. 

Conclusion

As the importance of data continues to grow, the biggest challenge companies face is how to manage it effectively. How can the data analysts build pipelines quickly without giving the data platform engineer a headache? How can teams support larger and more complex pipelines without blowing performance and cost out of the water? 

We are so excited to partner with Tobiko Data, who are ending the historical tradeoff between data transformation usability and scalability to enable the next generation of great data companies.

Geeshan Willink

CEO @ Nefta | Exit 2019 | M&A 2021

3mo

Congrats team 👏

Thank you Theory Ventures! The future is bright with your support!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics