Getty Images/iStockphoto

Trino set to advance open source SQL query performance

The rebranding of PrestoSQL to Trino has been a boon to the open source effort, as new capabilities and adoption of the query technology are growing in 2021.

The open source Trino distributed SQL query engine has had a big year in 2021 and is gearing up for more innovation in the year to come.

At the recent Trino Summit virtual event, supporters and users of Trino detailed use cases for the open source distributed SQL query engine. The event was sponsored by commercial Trino vendor Starburst, one of the leading contributors to the Trino open source project.

Before late 2020, Trino was known as PrestoSQL, which was a competitive effort to a related technology backed by The Linux Foundation known as PrestoDB and now simply called Presto.

At the Trino Summit, multiple users including LinkedIn, Electronic Arts, Robinhood and DoorDash took the virtual stage to explain how their organizations are using Trino at scale to enable distributed data queries.

We use Trino to build our core data query platform that empowers us to make data-driven analysis and decisions.
Grace LuSenior software engineer, Robinhood

"We use Trino to build our core data query platform that empowers us to make data-driven analysis and decisions," said Grace Lu, a senior software engineer at investing app vendor Robinhood, during a user session on Oct. 22.

How Trino helps Robinhood with a distributed SQL engine

Robinhood uses Trino for its own internal-facing applications. Those applications include data analytics and business intelligence, as well as overall platform visibility to help troubleshoot availability and performance problems.

Robinhood has multiple Trino clusters that connect to different data sources and enable the company's users to run queries against those data sources.

Among the data sources are multiple PostgreSQL databases Robinhood uses as its primary transactional data source. Robinhood also uses an Alation data catalog as well as the Looker analytics platform, which are both connected to Robinhood's data sources with Trino to enable users to query data.

DoorDash is onboarding Trino for distributed SQL queries

The pandemic has sparked an upsurge in business for food delivery services, including DoorDash. In a user session on Oct. 21, Akshat Nair, engineering manager at the San Francisco-based company, detailed how the organization uses Trino to enable distributed data queries.

DoorDash has a complex data architecture that uses PostgreSQL, Apache Cassandra and CockroachDB as core data sources. For real-time event streaming, DoorDash uses Kafka. Some of the data lands in a Snowflake cloud data warehouse, while other data flows to an Amazon S3-based data lake.

DoorDash is now in an early adoption phase for Trino and is using it to enable queries across its data architecture, Nair said. DoorDash's initial use case is similar to that of Robinhood, enabling internal users to run data analytics on business processes and operations.

"We are in an adoption phase at this point in time, so the volume of queries is not huge, but the data being processed is measured in terabytes and petabytes for some of these tables," Nair said.

Screenshot of DoorDash data architecture
DoorDash has a complex data architecture and is now starting to use Trino to enable distributed SQL queries.

The state of Trino moving forward

Martin Traverso, co-creator of Presto and Trino and CTO of Starburst, gave insight during a keynote presentation Oct. 21 into the technical progress Trino has made this year and where the vendor is headed.

Traverso explained that PrestoSQL, which was rebranded as Trino last December, and PrestoDB really began to diverge in 2019. He noted that while the two projects have a shared history, more than 40% of the changes have occurred since 2019 and all those changes are exclusive to Trino.

A number of new capabilities will come to Trino over the coming months, Traverso said. Among them is a capability that Traverso referred to as granular fault tolerance.

One of the big limitations of Trino now is that if a query exceeds the amount of memory available in a cluster, the query will fail. With the granular fault tolerance capability, the query engine will be able to retry a query to help it succeed, instead of just failing entirely.

Trino uses the Java programming language at its foundation. Traverso noted that Trino currently is based on Java 11, which is several years old. In the coming months Trino is moving to the newer Java 17 as a foundation.

"We've actually started doing some benchmarking with Java 17, and we see that we get 20% improvement in performance," Traverso said. "So it is very important to be able to move to Java 17 as the platform on top of which Trino is built."

Dig Deeper on Database management