Jul 10, 2019 | 1 min read

Conversation with Tomer Shiran

Podcast #66: How Dremio Makes Analytics Easier

Tomer Shiran is the Co-Founder and CEO of Dremio, which is a new company that’s doing a lot of really interesting work at the intersection of cloud technology and data. In our conversation he outlines the challenges that have faced different waves of big data technology, and the origins for the Dremio solution. He shares the company’s unique approach to enabling big data analytics from a variety of data sources. He also shares the implications of recent consolidation in the market, and opines on the broad potential value add from applied analytics across industries.



The Hard Thing About Hard Things – by Ben Horowitz


View Transcript

Good day everyone, and welcome to another Momenta Podcast. This is Ed Maguire, Insights Partner at Momenta, and our guest today is Tomer Shiran, Co-Founder and CEO of Dremio, which is a new company that’s doing a lot of really interesting work at the intersection of cloud technology, and data. We’re going to dive into what they’re doing, and a little bit of the context around their technology and their solutions, and then open things up a bit. So, Tomer thank you so much for joining us.

Thanks for having me here.

First of all, could you provide a bit of context about yourself, your background, and what has led you to your current role as the CEO and co-founder of Dremio?

I can go a long way back, but maybe we can start in 2009, it’s when I joined a company in the big data space called MapR, it was one of the Hadoop vendors. At the time we were just four people when I joined, these were the days when everybody was watching what Google had done internally, and it seemed like maybe that would be applicable to the enterprise. But it was a time when really no company out there was using this thing called Hadoop, so that’s when I joined the company with the goal of changing how companies can interact with data.

Great. Tell us a bit about the context that led you to found Dremio.

So, going back to around that timeframe, the thinking was, if we could create a single platform where companies could put all their data into one place, we now call a data lake, then magic would happen, anybody could access that data, anybody could do anything with that. The goal was always to create what we call data as a service, if you think about the last 20 years starting with say with Salesforce introducing software as a service, and then Amazon introducing infrastructure as a service in 2006, and Microsoft shortly after; even Uber and Lyft now providing transportation as a service, everything is becoming as a service model very on-demand, yet when it comes to data it’s an engineering project every single time. So, if I’m an analyst, or somebody else who wants to consume data in order to do my job, I’m really dependent today on an IT team, or on a data engineering team in order to do anything.

Of course, data engineers they don’t like doing reactive work, doing one-off projects every time anyone else wants to do something. So, when I was at MapR in those early days, it seemed like if we could create that single platform, that data lake that all data could be dumped into, that would solve the problem, and would make it so that companies could become data driven. The reality was it didn’t quite materialize that way, and what happened is first of all we quickly realized companies couldn’t realistically get all their data into one place. So, as much as they tried there was still  lot of data sources out there within the organization, whether it’s Oracle databases, SQL server databases, no SQL databases etc. that still had a lot of live and valuable data that needed to be analyzed, in addition to what had been loaded into the data lake.

The second problem was, the performance on top of these data lakes was just not there, it was way too slow and inefficient for say a BI user, someone using Tableau or Power BI from Microsoft, to be able to export and analyze that data. So, that didn’t really work out.

Then the third problem was that it was really designed for engineers, so it was just too hard for the typical user to take advantage of these data lakes, and companies ended up building these really complex stacks and data infrastructure which involved exporting data from the data lake, into these data warehouses, and data marts, maybe the last 30 days or some aggregate level data. That wasn’t even fast enough so they would export that data into cubes, or into BI extracts, or aggregation tables, and then finally maybe a BI user could interact with a little piece of data at decent speed, but that complexity really eliminated the chance to have a self-service experience.

So that was the reason for starting Dremio, we thought the goal of having data as a service, that’s the holy grail of analytics, and I believe the company that solves that will be many tens of billions of dollars in opportunity there. But the approach of physically moving all the data into one of these object stores, or any kind of system was just not realistic, and so we set about to solve that in a much more modern way and taking into account all the lessons that we had learned.

Before we jump into what you’re doing specifically at Dremio, could you talk about how the transition away from traditional data warehousing, and the promise of the concept of the data lake had really opened up a lot of experimental approaches to analytics, that in a sense created a lot of more inherent problems. In this regard I think you alluded to it, with the traditional data warehouse, or if you’re doing OLAP you would have an EVL process where you would scrub the data, and of course by the time it gets loaded into the data warehouse there’s an enormous amount of pre-prep involved, creating the data models and structures so that they could be analyzed appropriately, and of course map with the appropriate metadata, so this enormous amount of pre-prep.

I’m thinking in the initial days of the business intelligence, companies like BusinessObjects and Cognos, and high PNS space, when this concept of a data lake emerged with Hadoop, and around  2009 there was enormous amount of promise, that storage is cheap, you don’t have to be constricted by rigid hierarchical data structures, now you have all this freedom to store whatever you want, but as you eluded it created a whole host of problems afterthe fact that made it very difficult to get to the insights.

Could you talk about what you had learned in the process of being there at the early stages of the Hadoop wave?

That’s an interesting point, the data warehouses have been around for a long time, and the limiting factor was the lack of agility there, they were designed like you said, with the assumption that you would have many-many weeks, if not months to get the data ready for analysis, build all the models; but at the rate at which businesses are moving now, competing, and digital transformation being the norm, that lack of agility just wasn’t good enough. Often on top of that we are now living in a world where more and more people are comfortable and are interested in using data to do their job, these are the millions of business analysts that we have now. Everybody now learns in school, you go to college, even High School you’re learning how to write Python quota, how to use ARN, and do those kinds of things, how to write SQL queries.

That didn’t used to be the case 20-years ago. So, in order for all these users to be able to take advantage of data, it can’t be something that has to be prepared with so much expertise, and so much effort involved; it has to be much more agile, and much more self-service. That was the premise of the data lake, the storage would be really cheap and scalable, so you could have all your data accessible, it wouldn’t be limited just to something that people had time to prepare and organize in advance, and then it would be very open so you can use different types of processing capabilities on top of that.

But I think one of the things maybe we didn’t quite think about and appreciate back then was, getting the data into a place isn’t the end game, it has to be fast enough. So, even if you were able to get all the data into one place, if the ability to courier that data is not fast enough, if every time you dragged the mouse and your BI tool it takes you 15 minutes or an hour to come back, then you can’t really export the data, and you can’t really analyze it on your own, and you’re back to, ‘I’m going to run a nightly battery port’, which is not what we want anymore.

So, those were the challenges that we ran into, on top of just how difficult the whole Hadoop platforms work. So, we ended up, and at the time our competitors and now generally these are our partners, but we ended up having to sell a lot of professional services, and help companies that didn’t have big engineering teams, help them be successful. So, that was an inhibitor to the success.

That’s very typical of any category of software in the early stages of development. I’d love to dive into your product and understand a little bit about the platform that you’re developing.

Basically, what we’re doing at Dremio, if you look at that dream or that vision of having data as a service, or analytics as a service, that’s what we’re after here at Dremio, that’s what we’re trying to provide. So, if you think about the challenges that people are faced with data lake 1.0, those are the problems that we’re solving. So, Dremio is really data lake analytics built for the cloud, and what I mean by that is, we’re a platform that you can connect to your object store, things like S3 and ADLS, as well as HDFS. But you can also connect it to other sources of data that you might have, so things like SQL Server, Oracle, MongoDB and Elasticsearch. You can connect to multiple data lakes, you can connect to data lakes and data warehouses, so it really lets you span a broader range of data. That’s the first thing that makes data lake analytics for the cloud different.

I’d say the second thing is, the element of performance. We’ve developed a lot of unique technology that makes it possible to run a query directly on your data where it lives now, and to achieve a sub-second response time, so you can have that interactive experience with your data, which may be many terabytes in a single dataset and still have a very fast response time, which you couldn’t achieve by just running a SQL and general on a data lake.

The third thing, we provide self-service semantic layer. It looks visually like Google Docs for your data, or users can interact with datasets and create new virtual datasets and share them with their colleagues and build on top of each other, and IT can use the semantic layer to provide data governance, masking, and security. So, it really makes the platform a much more agile experience, so you’re not having to create lots and lots of copies of data in order to provide performance, or to prepare data.

Typically, how would a customer get started with the platform? As you discussed the different components of the platform it sounds like there’s a discovery process, I’m interested in diving a little bit deeper into how you get to your results.

Sure, the only thing I was going to add is, we’re big believers in not having vendor lock-in, I think a lot of people were burnt by that in the days of the data warehouses; they got locked into a data warehouse, all the customer extensions of that data warehouse, and the data being stored in a proprietary format which made it really hard then to get off of that system. We see that both on premise data warehouses, cloud data warehouses, basically they suffer from that same problem, so our belief is that the data should be stored in an open format, something that is open sourced, things like Parquet, RC, etc. and it should be able to run both on any cloud, as well as on premise, and entirely in the company’s cloud account, so their AWS account or their Azure account, or if it’s on prem then on their Hadoop Cluster, or Incriminates.

But in terms of to your question about how we do this, and what the components are, there are two main components outside of the user interface and the experience.

  • A query and acceleration engine. This is the ability to run a SQL query and get an interactor response time, no matter how big the data is. That’s a combination of a really fast performance, leveraging Apache Arrow which is an open source project we created. Then something we call Data Reflections, which allow us to really accelerate the orders of magnitude, the speed at which a query comes back.


  • What we call Governed Self-Service semantic layer, so that’s the layer that allows users both the data engineering teams as well as the data consumers to create virtual datasets, and to collaborate through that semantic layer.

So those are the two main components of the system.

That sounds really straightforward when you’re thinking about getting the insights on a heterogenous set of data. What are some of the use cases that you’re finding some of your customers and initial users are using your technology for?

By and large a very common use case for us is to make the cloud data lake work and make it consumption friendly. Today most companies when they think about the cloud they’re putting data on S3, and they’re putting data on ADLS in Azure, but then they don’t really have a good enough way to analyze the data where it is, and so they end up having to load that data into a data warehouse, into these data marts, and to create the iExtracts, cubes, and all these copies, transformations, and movement of data, just so they can get fast performance on data in these systems. So, that’s a very common use case for us, and then mostly these companies have other sources, maybe relational databases for example where they have data.

So, I can give you an example of this; one of the companies that uses Dremio extensively, and they’ve talked about this at a recent O’Reilly conference, is Royal Caribbean Cruise Lines, they basically built a data lake on Azure and used Dremio to power that data lake. They brought in data from over 20 different systems, ranging from reservations, to the casino, to property management, and created that customer 360 type of environment for them. So, they land all this data on ATLS, what they call the raw zone for the data, and then they utilize Dremio’s decoy layer to courier all that data using Microsoft Power BI, as well as other data science applications.

But then they also have a semantic zone which is where these virtual datasets in Dremio reside, and that semantic zone is where they can create curated versions of the data, and users can interact with that data in a much more self-service way, and then the system automatically takes care of accelerating access to that data. They have some datasets in other systems as well, outside of ADLS, things like MongoDB and Graph data bases, and SQL Server, so really a way for the company to become data driven, and to really take advantage of all the data that they have about their customers, the people that go on cruises and do a lot of research before the cruise, and provide feedback after the cruise, having that single point of view on the customer.

The way you’ve described the virtual semantic layers strikes me as being super-important in terms of the flexibility and usability. How do you handle changes in the underlying data when you’ve set up some structure, you’ve set up some reports, organized some aggregates of data for analysis, how does your system handle changes that may happen in the underlying systems?

There are basically two aspects to the system here. One is, when we think about the semantic zone it really consists of virtual datasets, so these datasets we don’t store any data, these are entirely views on top of the physical datasets, and other virtual datasets in the system. So, users can create these views whether its IT creating views to mask some of the data, or to clean some of the data, or its users creating these views of virtual datasets on top of those things. That creates this logical world of data, an entirely virtual layer where we’re not creating any copies of the data, or storing anything, but if I’m connecting to Dremio with a BI tool like Power BI or Tableau, I can directly interact with all of these virtual datasets in the semantic zone.

Underneath the hood we have this layer which we call data reflections, these are basically various materializations of data that we store on ADLS or S3 or HDFS, and these materializations are various aggregations, partitioning of various pieces of data, but those are never exposed to the end-user; so if I’m a Tableau user connecting to Dremio, or just a user, a data consumer interacting with the user interface, I will never even be aware of these data reflections, and they may change over time to optimize various workloads. The data reflections themselves get updated through an incremental update process, so as more data is added to say my data lake, one of my datasets there these reflections get updated incrementally, and basically that can happen on a schedule, or based on an SLA that the company has around that dataset.

That seems very straightforward and pretty powerful. As you look at potential use cases more broadly, how might users think about applying Dremio in an industrial IoT context; do you have flexibility to be able to connect to different types of data sources that may incorporate maybe some traditional legacy, industrial data, or proprietary types of stores?

The way this mostly happens, and actually we have quite a few customers now using Dremio Enterprise for industrial IoT use cases, these are some fairly large players in the IoT space. By and large what ends up happening is, IoT data and industrial IoT data is a lot of time series data. So, we’re talking about everything from robots that are painting cars, so some of the large car manufactures the robots that are painting cars on the manufacturing line, and doing other operations, all that data is being collected into a Dremio-powered data lake.

Another example for us is a very large power company with a lot of wind turbines which are kind of aggregated into windfarms, and these systems are all generating terabytes of data a week, so very large volumes of data; but it tends to be time series data where each of these robots or devices is producing various measurements, could be temperature, could be air pressure and things like that. Then the operators of these systems and environments are interested in being able to understand various trends, looking at historically what happened at specific time periods and understanding various cause and effect type of things. So, the common thing is first to try to create this data lake and leveraging cloud such as S3 and ADLS’s is a common pattern there because these systems are very cheap and very scalable.

So, they load the data, they collect the data from these systems into their environment, and then they’re able to use Dremio on top of that to courier the data directly. Obviously this data is often too big to put into a data warehouse, and also that’s not a very efficient use of resources.

Yes, it sounds super-flexible. So, as you look forward could you share a little bit of your vision of where you’d like to take the company, are there any partnerships, industries, or specific types of customers you guys are focused on working with?

We’re actually working across a very broad range of customers, which is one of the things that I really enjoy about the job, everything from tech companies like Microsoft and Intel, to the largest financial services firms, companies like UBS, all the way to the German Train Company. We’ve had a lot of success in a variety of verticals, financial services has been a very strong one for us, multiple areas there, but risk analysis for example is something that we see a lot, and we see a lot of used cases in supply chain as well. More broadly though, I think companies in all industries have got to a point where they realize they have to be data driven, that’s the primary asset that most companies have is their data, and if they don’t take advantage of that data then another company will come and do just that, their competitors.

When I think about the future, if you look at how our lives have changed from a consumer stand point, our personal lives, I have four kids, three in Elementary School right now, they talk to Google, they have this little Google home device in their bedrooms, they ask Google a question, they get an answer; yet I come to work and I have questions based on data, and it’s much harder to answer that question. So, our goal as a company is to create that same experience that we now have in our personal lives, to do that in terms of our experience with data, and data specifically in the workplace, in the business.

I think you’re surfing on probably a multi-decade trend, the value of data and being able to turn that into insights and into action, is something that’s probably going to last throughout our lifetimes. It’s interesting too, just in the last couple of weeks we’ve seen Google has just paid 2.6 billion for Looker, and then four days later Salesforce paying almost 16 billion in stock for Tableau. I think it clearly underscores there’s economic value there, but it’s also a real priority. How does the consolidation impact on the way that you look at the market?

It’s interesting, both of those companies are very close partners of Dremio. A lot of our customers use Tableau and Looker, and I know the executives of those companies very well, so, it’s cool to see. I think it’s great proof point in terms of how strategic data has become for companies, and obviously the buyers in these two cases recognize that. Both Tableau and Looker had products that were easy to use, and customers liked, that’s another thing I think is really important, and that we pay a lot of attention to as a software company, much like those companies is, how do you make something that was previously really-really difficult, how do you make that easy? I think that ends up driving a lot of adoption and creating a lot of value.

I think you’re definitely onto something there. The old saying about data warehousing, I think this goes back 25 years, it was the 2-2-50 rule that projects would take; two-years costs $2 million and has a 50 percent chance of success! That’s a long way in the rearview mirror, I think it’s great to see a lot more ease of views, just as the data problems become more complex, as long as the tools make it easier it’s definitely a positive development.

One final question for folks that are interested in getting a bit more information about the work you guys are doing, do you have some recommendations you might be able to share with our audience?

I think there’s a lot of information on the Dremio website, just Google for Dremio and you’ll find it. We’ve made it really easy now to try it out, there’s a deploy button at the top right of the website, you can start it up on Azure, or AWS with almost zero effort, so that should make things really easy, and there’s a lot of videos and tutorials there. I’d recommend that.

For people that are interested in data science in Open Source, check out the Apache Arrow Project, that’s something we started and now has over 4 million downloads a month, it’s really more about data science and accelerating the speed of analytics. And then if you’re interested in how to build companies and startups, I recently read a great book called, ‘The Hard Thing About Hard Things’, by Ben Horowitz, which I really enjoyed, I’d recommend that.

Yes, that’s a terrific recommendation. I hear he’s got another one, I don’t know if it’s out or on the way, but he’s certainly got the stars to prove his success, he earned his success there’s no doubt.

It’s been a pleasure talking to you. Again, we’ve been speaking with Tomer Shiran, co-founder and CEO of Dremio. This has been Ed Maguire, Insights Partner at Momenta, with another episode of our Momenta Podcast. Tomer thank you so much for joining us.

Thank you.