A transformation framework that understands your data: our investment in Tobiko Data

Theory Ventures

We invest $1-25m in early stage software companies that leverage technology discontinuities into go-to-market advantages

Published Jun 5, 2024

As the demands for data increases and cost of storage decreases, most data pipelines have moved to an ELT paradigm: loading all data to a warehouse in unaltered form, then building series of transformations for downstream consumers.

Historically, building these pipelines required the expertise of centralized data teams who could develop in analytics engines like Spark. But in recent years, we’ve seen the rise of open-source, easy-to-use frameworks that democratized the development of data pipelines with SQL.

These tools work great at small scale, but as an organization grows – in terms of headcount, data volume, or number of models – they start to break. That’s because these platforms typically re-run all computations after any change to data or a model.

As a result, pipelines take hours to run and cloud costs balloon because huge volumes of data are processed unnecessarily. Developers don’t know if a change will break someone else’s model downstream, and must wait hours each time the pipeline runs. Maintaining development/staging and production environments is complex; each deployment typically involves a full re-run. Running backfills and forward-only changes requires custom infrastructure or manual workarounds.

Tobiko Data solves these problems with an open-source data transformation framework that is just as simple to develop in with SQL, while reducing costs significantly and bringing DevOps best practices at scale. We are thrilled to lead their $17.3 million Series A, joined by Unusual Ventures and angels including George Fraser, CEO of Fivetran, and Jordan Tigani, CEO of MotherDuck.

Tobiko Data: a transformation framework that understands your data

The journey of Tobiko started with an open-source project called SQLGlot, a no-dependency SQL parser, transpiler, optimizer, and engine. Co-founder Toby Mao, then at Netflix, realized how useful it would be to have a simple Python library that could parse and transpile across the various SQL dialects his team used.

Because SQLGlot allows you to understand the meaning of and build abstract syntax trees (ASTs) for any SQL code, Toby and co-founders Tyson Mao and Iaroslav Zeigerman realized it could be the foundation for a new type of data transformation framework built to address some of the large-scale data pipeline challenges they saw at Airbnb, Netflix, and Google.

The core problem with current data transformation frameworks is that they (1) don’t understand how the code you write corresponds to flows of data through the pipeline, and (2) are stateless. Because of this, the entire computational graph is re-run after any change to a model or data, unless a user implements complex custom logic and instrumentation.

Recommended by LinkedIn

Laktory SparkChain - A serializable spark-based data…

Olivier Soucy 4 months ago

Real-time Data Processing with Google Dataflow

Inflexion Analytics 3 months ago

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 1 month ago

Enter SQLMesh – an open-source transformation framework based on a semantic understanding of SQL. This allows SQLMesh to keep track of data as it flows through data pipelines, and when your team updates models and data over time. This helps organizations develop more effectively and save huge amounts of time and money, providing:

Column-level lineage and incremental loads by default – with any change to the data or models, SQLMesh will only re-run the parts of the pipeline are directly impacted
Virtual data environments for development/staging, allowing you to test changes and roll back/forward easily and without recomputing any completed models
CI-runnable tests to provide instant feedback on change impacts; automated categorization of breaking vs non-breaking changes

In addition to the core transformation framework, SQLMesh includes an orchestrator, CI/CD testing framework, and virtual data environments to manage the promotion of changes to production – all in the open-source library.

SQLMesh has a robust community and has partnerships with companies including Harness, Fivetran, Pipe, Wealthsimple, Textio, and Dreamhaven.

For Harness, switching to SQLMesh reduced their cloud warehouse spend by 30-40% by avoiding unnecessary recalculations. It also made their developers more productive, reducing model build time by 80% and flagging breaking changes to allow for fast iteration.

This month, Tobiko is launching a managed version of SQLMesh called Tobiko Cloud. This will allow any organization to easily run SQLMesh without managing pipeline state, while all data processing remains on the customer’s own infrastructure.

Conclusion

As the importance of data continues to grow, the biggest challenge companies face is how to manage it effectively. How can the data analysts build pipelines quickly without giving the data platform engineer a headache? How can teams support larger and more complex pipelines without blowing performance and cost out of the water?

We are so excited to partner with Tobiko Data, who are ending the historical tradeoff between data transformation usability and scalability to enable the next generation of great data companies.

Theory Ventures Blog

2,475 followers

+ Subscribe

Geeshan Willink

CEO @ Nefta | Exit 2019 | M&A 2021

3mo

Congrats team 👏

1 Reaction

Robert S.

3mo

Thank you Theory Ventures! The future is bright with your support!

2 Reactions

See more comments

To view or add a comment, sign in

A transformation framework that understands your data: our investment in Tobiko Data

Theory Ventures

We invest $1-25m in early stage software companies that leverage technology discontinuities into go-to-market advantages

Recommended by LinkedIn

Theory Ventures Blog

2,475 followers

More articles by this author

Insights from the community

Others also viewed

Unveiling the Power of Parquet Files and Databricks

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed

5 Obstacles to Achieving Scalable Data Science, and How to Overcome Them

Tackling Skewed Data in Spark 3.0: Adaptive Query Execution Strategies

A lightweight Data Quality framework on GCP with just SQL and DataForm (part 2)

Great solutions for modern data stack

Multi-Notebook Data Transformation to Metadata-Driven Single-Notebook Data Transformation in Azure Synapse Spark (Spark or Databricks)

Exploring Data Ingestion: File Formats and Sample Data Demystified

🌟Exploring File Formats🌟

Explore topics

Recommended by LinkedIn

Theory Ventures Blog

2,475 followers

Where will AI disrupt or sustain knowledge workers?

Aug 14, 2024

Building For the Next Generation of Data Management - Our Investment in Allium

Jul 18, 2024

Composable software platforms in practice

Jul 10, 2024

Theory Ventures 2024 GTM Survey

Jun 25, 2024

Do websites go away with AI agents?

May 23, 2024

Every LLM company is a search company, and search is hard: the future of LLM retrieval systems

May 1, 2024

Calling in AI reinforcements for security teams - Our investment in Dropzone AI

Apr 25, 2024

The new age of composable software

Mar 14, 2024

Why LLM-powered automation is different

Mar 12, 2024

LLM-market fit

Jan 25, 2024

Insights from the community

Others also viewed

Unveiling the Power of Parquet Files and Databricks

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed

5 Obstacles to Achieving Scalable Data Science, and How to Overcome Them

Tackling Skewed Data in Spark 3.0: Adaptive Query Execution Strategies

A lightweight Data Quality framework on GCP with just SQL and DataForm (part 2)

Great solutions for modern data stack

Multi-Notebook Data Transformation to Metadata-Driven Single-Notebook Data Transformation in Azure Synapse Spark (Spark or Databricks)

Exploring Data Ingestion: File Formats and Sample Data Demystified

🌟Exploring File Formats🌟

Explore topics