Moving Big Data to the cloud: A big problem?

Originally published on Data Centre Review.

Digital transformation is overhauling the IT approach of many organizations and data is at the center of it all. As a result, organizations are going through a significant shift in where and how they manage, store and process this data.

To manage big data in the not so distant past, enterprises processed large volumes of data by building a Hadoop cluster on-premises using a commercial distribution such as Cloudera, Hortonworks, or MapR.

The data analyzed was mostly structured and required a large capital expenditure upfront to purchase the necessary hardware. Adding to this, Hadoop is a complex infrastructure to manage and monitor, requiring organizations to employ individuals with a specialized skill set, skills which are rare to come by.

To tackle these issues many organizations have been looking towards the cloud. Yet the benefits promised from moving big data projects to the cloud haven’t been realized for most organizations and as a result, data lakes are still being left on-premises.

Heading for the clouds

By creating or migrating their big data architecture to the cloud, organizations can take advantage of tremendous operational cost savings, nearly limitless data processing power, and the instant scaling options the cloud provides. Additionally, they don’t have to have large capital expenditure up front or worry about having intimate knowledge of Hadoop.

Many enterprises are going through this ‘lift and shift,’ where they move their on-premises data cluster to the cloud. But historically this has also come with its own inherent issues, and a lot of the challenges with moving big data projects to the cloud have centered around simply getting the right data in the right place.

It comes down to skills and cost

Moving big data to the cloud sounds simple enough. But migrating on-premise data lakes to the cloud and then connecting cloud-based big data environments with diverse data sources, while also creating Apache Spark pipelines to transform that data, requires highly technical knowledge and continuous coding resources from data engineers and core IT groups.

Developers must write code to integrate with each application’s programming interface (API) and authentication mechanisms, thereby enabling the data to freely move between the applications and the data lake. Not only is this an incredibly time-consuming process, it is also error-prone, two realities that are magnified during the maintenance stage of cloud-based big data projects.

As with any other software project, code decays over time and must be updated. If the developer who wrote the code leaves the company, often the IT organization’s ability to understand the pipeline that is being used at the code level also vanishes.

This time drain on critical IT staff is one of the biggest issues organizations have had to overcome in moving to cloud-based big data projects. The intensive management and monitoring required ultimately results in prohibitive operational costs, longer time-to-value and a strategy which does nothing to address the OpEx and skill set gap which is steadily emerging.

Finding individuals with the necessary skills and experience to build big data and cloud pipelines is a tough process. Unsurprisingly this is impacted by the current skills gap in the IT landscape.

Research from Experis has shown demand for big data skills and professionals has grown 78% in the last year, whilst demand for cloud skills and professionals has grown 30% in the same time frame.

With these individuals in such short supply, if you do manage to have them in your IT team, having them focused solely on managing and maintaining the big data environment, both pre, during and post-migration to the cloud is frankly a waste of resources. It also has a big impact on the second big issue in moving to the cloud – cost.

If you employ people who are highly skilled, you want them to deliver significant and strategic benefit to the business. To focus on higher value tasks and projects that will help the organization innovate. The flexibility and scalability of the cloud can be a huge benefit in a drive towards innovation. But the proposed time-to-innovation identified at the start of the cloud migration will never be realized if teams are focused solely on infrastructure management to make the big data project work.

Buy vs build

The solution to this problem is relatively simple, and it all comes down to buy vs build. Unless you’re Google, chances are you’re not going to self-build every aspect of your IT estate. So why should you self-build all the connections you need too?

For big data projects to flourish in the cloud sooner, organizations should look towards implementing a completely managed data architecture, including data integrations (iPaaS), processing (BDaaS) and storage (SaaS).

By doing this, organizations should then be able to effortlessly deliver large data sets to and from their cloud-based data lakes, regardless of where the data is coming from. This approach can also increase productivity by eliminating cumbersome manual tasks around adding information and transforming data, allowing teams to focus on those value delivering activities instead.

By supporting this managed data architecture with self-service, organizations can free up even more time within the IT team. Self-service integration is making it fast and easy for organizations to create automated data pipelines with no coding, and self-service analytics is making it easy for analysts and business users to manipulate data without IT intervention.

By using self-service tools like this it’s not just organizations with fully fleshed out IT teams that can benefit, businesses struggling to recruit coding talent can also develop their own cloud big data pipelines as part of their managed data architecture in the cloud.

Removing the complexities

Running big data projects in the cloud should be simple. All organizations, regardless of size, should be able to realize all the benefits cloud provides as soon as they get up and running, not years down the line. It’s only in taking a step back at the planning stage and removing the complexities surrounding cloud migration and integration that businesses will finally be able to use their big data projects for innovation and to deliver business value.