Monday, December 1, 2014

Hortonworks accelerates the big data mashup between Hadoop and HP Haven

This latest BriefingsDirect deep-dive big data thought leadership interview examines how Hortonworks is working with HP on improved management of very large -- and very active -- datasets.

We'll explore how HP and Hortonworks are integrating Hadoop into more of the HP Haven family to make it easier for developers and data scientists to access business intelligence (BI) and analytics as a service.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. 

To learn how, BriefingsDirect sat down with Mitch Ferguson, Vice President of Business Development at Hortonworks at the recent HP Big Data 2014 Conference in Boston. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: We heard the news earlier this year about HP taking a $50-million stake in Hortonworks, and then about Hortonworks' IPO plans. Please fill us in little bit about why Hortonworks and HP are coming together.

Ferguson: There are two core parts to that answer. One is that the majority of Hadoop came out of Yahoo. Hortonworks was formed by the major Hadoop engineers at Yahoo moving to Hortonworks. This was all in complete corporation with Yahoo to help evolve the technology faster. We believe the ecosystem around Hadoop is critical to the success of Hadoop and critical to the success of how enterprises will take advantage of big data.

Ferguson
If you look at HP, a major provider of technology to enterprises, both at the compute and storage level but the data management level, the analytics level, the systems management level, and the complimentary nature of Hadoop as part of the modern data architecture with the HP hardware and software assets provides a very strong foundation for enterprises to create the next generation modern data architecture.

Gardner: I'm hearing a lot about the challenges of getting big data into a single set or managing the large datasets.
Fully experience the HP Vertica analytics platform...
Get the free HP Vertica Community Edition

Become a member of myVertica
Users are also trying to figure out how to migrate from SQL or other data stores into Hadoop and into HP Vertica. It’s a challenge for them to understand a roadmap. How do you see these datasets as they grow larger, and we know they will, in terms of movement and integration? How is that path likely to unfold?
Machine data

Ferguson: Look at the enterprises that have been adapting Hadoop. Very early adopters like eBay, LinkedIn, Facebook, and Twitter are generating significant amounts of machine data. Then we started seeing large enterprises, aggressive users of technology adopt it.

One of the core things is that the majority of data being created everyday in an enterprise is not coming from traditional enterprise resource planning (ERP) or customer relationship management (CRM) financial management systems. It's coming from websites like Clickstream, data, log data, or sensor, data. The reason there is so much interest in Hadoop is that it allows companies to cost effectively capture very large amounts of data.

Then, you begin to understand patterns across semi-structured, structured, and unstructured data to begin to glean value from that data. Then, they leverage that data in other technologies like Vertica, analytics technologies, or even applications or move the data back into the enterprise data warehouse.

As a major player in this Hadoop market, one of the core tenets of the company was that the ecosystem is critical to the success of Hadoop. So, from day one, we’ve worked very closely with vendors like Microsoft, HP, and others to optimize how their technologies work with Hadoop.

SQL has been around for a long time. Many people and enterprises understand SQL. That's a critical access mechanism to get data out of Hadoop. We’ve worked with both HP and Microsoft. Who knows SQL better than anyone? Microsoft. We're trying to optimize how SQL access to Hadoop can be leveraged by existing tools that enterprises know about, analytics tools, data management tools, whatever.

That's just one way that we're looking at leveraging existing integration points or access mechanisms that enterprises are used to, to help them more quickly adopt Hadoop.
The technology like Hadoop is optimized to allow an enterprise to capture very, very large amounts of that data.

Gardner: But isn’t it clear that what happens in many cases is that they run out of gas with a certain type of database and that they seek alternatives? Is that not what's driving the market for Hadoop?

Ferguson: It's not that they're running out of gas with an enterprise data warehouse (EDW) or relational database. As I said earlier, it's the sheer amount of data. By far, the majority of data is not coming from those traditional ERP,  CRM, or transactional systems. As a result, the technology like Hadoop is optimized to allow an enterprise to capture very, very large amounts of that data.

Some of that data may be relevant today. Some of that data may be relevant three months or six months from now, but if I don't start capturing it, I won't know. That's why companies are looking at leveraging Hadoop.

Many of the earlier adopters are looking at leveraging Hadoop to drive a competitive advantage, whether they're providing a high level of customer service, doing things more cost-effectively than their competitors, or selling more to their existing customers.

The reason they're able to do that is because they're now being able to leverage more data that their businesses are creating on a daily basis, understanding that data, and then using it for their business value.

More than size

Gardner: So this is an alternative for an entirely new class of data problem for them in many cases, but there's more than just the size. We also heard that there's interest in moving from a batch approach to a streaming approach, something that HP Vertica is very popular around.

What's the path that you see for Hortonworks and for Hadoop in terms of allowing it to be used in more than a batch sense, perhaps more toward this streaming and real-time analytics approach?

Ferguson: That movement is under way. Hadoop 1.0 was very batch-oriented. We're now in 2.0 and it's not only batch, but interactive and also real-time, and there's a common layer within Hadoop.  Hortonworks is very influential in evolving this technology. It's called YARN. Think of it as a data operating system that is part of Hadoop, and it sits on top of the file system.

Via YARN, applications or integration points, whether they're for batch oriented applications, interactive integration, or real-time like streaming or Spark, are access mechanisms. Then, those payloads or applications, when they leverage Hadoop, will go through these various batch interactive, real-time integration points.

They don't need to worry about where the data resides within Hadoop. They'll get the data via their batch real-time interactive access point, based on what they need. YARN will take advantage of moving that data in and out of those applications. Streaming is just one way of moving data into Hadoop. That's very common for sensor data. It’s also a way to move it out. SQL is a way, among others, to move data.
Fully experience the HP Vertica analytics platform...
Get the free HP Vertica Community Edition

Become a member of myVertica
Gardner: So this is giving us choice about how to manage larger scales of data. We're seeing choice about the way in which we access that data. There's also choice around the type of the underlying infrastructure to reduce costs and increase performance. I am thinking about in-memory or columnar.

What is there about the Hadoop community and Hortonworks, in particular, that allows you to throw the right horsepower at the problem?

Ferguson: It was very important, from Hortonworks perspective from day one, to evolve the Hadoop technology as fast as possible. We decided to do everything in open source to move the technology very quickly and leverage the community effective open-source, meaning lots of different individuals helping to evolve this technology fast.

The ability for the ecosystem to easily and optimally integrate with Hadoop is important. So there are very common integration points. For example, for systems management, there is the Ambari Hadoop services integration point.

Whether it's an HP OpenView or System Center in the Microsoft world, that allows it to leverage, manage, or monitor Hadoop along with other IT assets that those management technologies integrate with.

Access points

Then there's SQL's access via Hive, an access point to allow any technology that integrates or understands SQL to access Hadoop.

Storm and Spark are other access points. So, common open integration points well understood by the ecosystem are really designed to help optimize how various technologies at the virtualization layer, at the operating system layer, data movement, data management, access layer can optimally leverage Hadoop.

Gardner: One of the things that I hear a lot from folks who don't understand yet how things will unfold, is where data and analytics applications align with the creation of other applications or services, perhaps in a cloud setting like a platform as a service (PaaS).

It seems to me that, at some point, more and more application development will be done through PaaS with an associated or integrated cloud. We're also seeing a parallel trajectory here with the data, along the same lines of moving from traditional systems of record into relational, and now into big data and analytics in a cloud setting. It makes a lot of sense.
What a number of people are doing with this concept is called the data lake. They're provisioning large Hadoop clusters on prem, moving large amounts of data into this data lake.

I talked to lot of people about that. So the question, Mitch, is how do we see a commingling and even an intersection between the paths of PaaS in general application development and PaaS in BI services, or BI as a service, somehow relating?

Ferguson: I'll answer that question in two ways. One is about the companies that are using Hadoop today, and using it very aggressively. Their goal is to provide Hadoop as a service, irrespective of whether it's on premises or in the cloud.

Then we'll talk about what we see with HP, for example, with their whole cloud strategy, and how that will evolve into a very interesting hybrid opportunity and maybe pure cloud play.

When you think about PaaS in the cloud, the majority of enterprise data today is on premises. So there's a physics issue of trying to run all of my big data in the cloud. As a result, what a number of people are doing with this concept is called the data lake. They're provisioning large Hadoop clusters on premises, moving large amounts of data into this data lake.

That's providing data as a service to those business units that need data in Hadoop -- structured, semi-structured, unstructured for new applications, for existing analytics processes, for new analytics processes -- but they're providing effectively data as a service, capturing it all in this data lake that continues to evolve.

Think about how companies may want to leverage then a PaaS. It's the same thing on premises. If my data is on premises, because that's where the physics requires that, I can leverage various development tools or application frameworks on top of that data to create new business apps. About 60 percent of our initial sales at Hortonworks are new business applications by an enterprise. It’s business and IT being involved.

Leveraging datasets

Within the first five months, 20 percent of those customers begin to migrate to the data-lake concept, where now they are capturing more data and allowing other business entities within the company to leverage these datasets for additional applications or additional analytics processes. We're seeing Hadoop as a service on premises already. When we move to the cloud, we'll begin to see more of a hybrid model.

We are already starting to see this with one of Hortonworks large partners, where you put archive data from on premises to store in the cloud at low-cost storage. I think HP will have that same opportunity with Hadoop and their cloud strategy.

Already, through an initiative at HP, they're providing Hadoop as a service in the cloud for those entities that would like to run Hadoop in a managed service environment.
We're seeing Hadoop as a service on prem already. When we move to the cloud, we'll begin to see more of a hybrid model.

That’s the first step of HP beginning to provide Hadoop in a managed service environment off premises. I believe you'll begin to see that migrate to on-prem/off-prem integration in a hybrid opportunity in the some companies as their data moves off prem. They just want to run all of their big-data services or have Hadoop as a service running completely in HP cloud, for example.

Gardner: So, we're entering in an era now where we're going to be rationalizing how we take our applications as workloads, and continue to use them either on premises, in the cloud, or hybrid. At the same time, over on the side, we're thinking along the same lines architecturally with our data, but they're interdependent.

You can’t necessarily do a lot with the data without applications, and the applications aren’t as valuable without access to the analytics and the data. So how do these start to come together? Do you have a vision on that yet? Does HP have a vision? How do you see it?

Ferguson: The Hadoop market is very young. The vision today is that companies are implementing Hadoop to capture data that they're just letting fall on the floor. Now, they're capturing it. The majority of that data is on premises. They're capturing that data and they're beginning to use it in new a business applications or existing analytics processes.
Fully experience the HP Vertica analytics platform...
Get the free HP Vertica Community Edition

Become a member of myVertica
As they begin to capture that data, as they begin to develop new applications, and as vendors like HP working in combination with Hortonworks provide the ability to effectively move data from on premises to off premises and provide the ability to govern where that data resides in a secure and organized fashion, you'll begin to see much tighter integration of new business or big-data applications being developed on prem, off prem, or an integration of the two. It won't matter.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: HP.

You may also be interested in:

No comments:

Post a Comment