Three considerations for Hadoop-to-Cloud migration

Enterprises consider their data and analytics platforms strategic assets that are crucial to digital transformation and business continuity. Yet even as these systems increasingly form the foundations of enterprise business models, some of them remain a massive challenge to organizations. On-premises Hadoop deployments are an example of this — they are complex, unscalable, and increasingly a burden for IT departments.

That’s why more and more enterprises are migrating away from Hadoop and towards modern cloud-based platforms.

Why migrate?

There are numerous forces driving enterprises to migrate away from Hadoop. Often, it’s a combination of inherent Hadoop limitations alongside demands for advanced analytics services from the field - services that Hadoop can’t effectively provide. More specifically, enterprise teams are looking to leave Hadoop due to:

Project roadblocks

Enterprises are discovering that Hadoop can’t keep up with their business goals. If only samples of big data can be processed, rather than entire petabyte-scale datasets; or if network computations can’t be completed even within weeks or months rather than days, then the viability of Hadoop deployments is clearly in question.

Unreliable and unscalable

When clusters can’t scale up to meet computing requirements or scale down to cut costs, enterprises relying on Hadoop are frequently left in data, productivity, and budgetary limbo. And the problem isn’t just with the usage and output of these systems — maintaining, patching, and upgrading Hadoop is an operational and human resources burden, too.

Questionable long term viability

We’ve discussed in previous articles the (rather dire) long-term outlook for on-premises Hadoop. And we’re not the only ones who think so. Even enterprises still strategically committed to Hadoop question the platform’s technological viability and the business stability of its vendors. This is leading enterprises to view Hadoop not only as an impediment, but also as a liability.

Three top Hadoop-to-cloud migration considerations

Once the decision to move away from Hadoop has been made, here are three questions to take into consideration before implementation:

1. What’s the scale of the data migration?

As a rule, the larger the scale, the more complex the migration. And while numerous options exist for small data volumes, few of these work well at scale. Migrating large volumes of data takes time. So, if you’re migrating data over a network, make sure to calculate the time it will take based on your network’s bandwidth while taking into consideration the schedule and size of other workloads.

2. What amount of data changes occur in your Hadoop environment?

Business disruption is a top concern for planned Hadoop migration projects, and handling on-premises data changes during migration is a key challenge noted by enterprises that have already migrated Hadoop data to the cloud. Handling this is challenging because typical Hadoop production environments are very active, with high levels of data ingests and updates. Measurements at one of our customer’s implementations showed peak loads for their on-premises Hadoop deployment reaching upwards of 100,000 file system events per second, and loads over a 24 hour period averaging 20,000 file system events per second. This ongoing activity adds to migration time and complexity, leaving enterprises with three options for managing changes during migration:

  1. Don’t allow changes to happen (leads to system downtime and business disruption)

  2. Develop a custom solution to manage changes

  3. Leverage tools (like Cirata) that are purpose-built to handle changes

3. Will your migration approach require manual or custom development efforts?

There are a number of Hadoop-to-cloud data migration methodologies and approaches, each with its own considerations. For example, data transfer devices like the Azure Data Box can get Petabyte-scale datasets from Point A to Point B. Yet these solutions may require system downtime or some method for handling data changes that occur during the transfer process. Similarly, network-based data transfer with manual reconciliation of data changes may work for small volumes, but isn’t viable at scale.

Hadoop comes packaged with DistCp, a free tool that is frequently used to start data migration projects…but less so to finish them. The problem is that DistCp was designed for inter/intra-cluster copy of data at a specific point in time — not for ongoing changes. DistCp requires multiple passes and custom code or scripts to accommodate changing data, making it impractical for an enterprise-class migration.

Finally, there are next-gen automated migration tools (like Cirata LiveData Migrator) that allow migrations to occur while production data continues to change — with no system downtime or business disruption. These solutions enable IT resources to focus on strategic development efforts, not on migration code.

The bottom line

As enterprises migrate away from Hadoop in favor of cloud-based platforms, they are looking more closely not just at the end results of migration, but at the process itself. Large-scale enterprise data migration is a massive enterprise project — there’s no question. Yet by choosing the right tools for the job — tools that enable business data to flow freely and core business functions to continue unhindered, even during Petabyte-scale migration — the viability of this strategic shift increases dramatically.

Subscribe for updates

Check out our other recent blogs

Cirata unlocks GenAI with Databricks

To succeed in the development of generative AI applications, businesses must use their unique data assets more efficiently and effectively. Cirata and Databricks are partnering to help users more quickly build and deploy...

Cirata and Apache Iceberg

What is Apache Iceberg Originally a Netflix project, Iceberg was made open source in 2018 and graduated from incubation under Apache Software Foundation governance in 2020. It is now a leading standard for what is called...

Scale AI in IBM watsonx.data with Cirata Data Migrator

Cirata announces the general availability of Data Migrator 3.0, with full support for IBM watsonx.data. Cirata and IBM continue to collaborate on their partnership with a focus of driving data and simplifying integration...

Cookies and Privacy

We use technology on our website to collect information that helps us enhance your experience and understand what information is most useful to visitors.
By clicking “I ACCEPT,” you agree to the terms of our privacy policy.

Cookie Setting