DistCp for Hadoop-to-Cloud migrations: beware of the risks

Distributed copy (DistCp) seems to be the go-to Hadoop to cloud migration tool simply because it is free and most Hadoop admins are familiar with it for inter-cluster copying. But does that mean DistCp is the best choice for Hadoop-to-cloud migrations?

Not necessarily.

Cloud architects are struggling with cloud migration, often facing tight deadlines to make the transition. Perhaps it’s due to sunsetting a data center, existing equipment nearing the end of its lifecycle, a data center lease exit, or various software events. Whatever the reason, architects are looking for a solution, and often, the immediately available answer is distributed copy, commonly known as DistCp.

This free tool originally came bundled with Hadoop for large inter- and intra-cluster copying of data for transferring data at a single point in time. As organizations embarked on migrating their data and infrastructure to the cloud, DistCp began to be viewed as part of the migration project.

The problem is that DistCp was never designed to be used for large scale Hadoop-to-cloud migrations.

To be clear, DistCp has its purposes, but it has very distinct limitations as well. Let’s take a look at what DistCp is good for, how it is problematic for cloud migration, and the risks of using it to move massive data sets to the cloud.

The benefits of DistCp for Hadoop-to-cloud migrations

DistCp is a viable solution for copying a relatively low volume of data that doesn’t change frequently between Hadoop clusters. DistCp is appropriate when data volumes are relatively small (e.g. less than 100 TB) with minimal data changes during the migration. If the data sets to be moved are not undergoing rapid change during migration (e.g. less than 50-100 events per second), then DistCp can be a cost-efficient data migration technology. DistCp is ideal to use when migrating historical data that doesn’t change at all.

The drawbacks of using DistCp for Hadoop-to-cloud migrations

The problems start to occur when cloud architects, Hadoop administrators, or others migrating data to the cloud think that DistCp can be used just as simply to migrate petabytes of data to the cloud as if it was moving data between Hadoop clusters. There are three common data traps that companies fall into when making DistCp the go-to solution for cloud-based migration.

  1. Companies wrongly assume small data and big data are the same when it comes to migration.

    Companies make the incorrect assumption that the approaches used for small data will work just as well for large volumes of data. If the data set is static, then the question is whether there is enough bandwidth to migrate the data or whether there is enough time to load it onto a bulk transfer device, such as AWS Snowball or Azure Data Box, and have that device shipped to the cloud service provider and then uploaded. These kinds of solutions are rarely practical, as most large data sets are constantly changing, and companies can rarely afford to shut down operations, even for a short period of time.

  2. Companies miss updates that are made to data at the source after the migration has begun.

    When migrating data that is actively changing, one solution is to allow the data to continue to change at the source, in which case, all those changes need to be taken into account to ensure that when the migration is complete, there isn’t a copy in the cloud that's already woefully out of date.

    To prevent data inconsistencies between source and target, there needs to be a way to identify and migrate any changes that may have occurred. The typical approach is to perform multiple iterations, rescan the data set, and catch changes since the last iteration. This method allows architects to iteratively approach a consistent state. However, if there’s a huge volume of data that is changing frequently, it may be impossible to completely catch up with the changes being made to the source data.

  3. Companies make the mistake of selecting a migration approach that requires data to be frozen at the source, resulting in business disruption

    Another option is to freeze the data at the source to prevent any changes from occurring. This makes the migration task a lot simpler, as administrators can be confident that the data copy that was uploaded to the new location is consistent with what exists at the source because there weren't any changes allowed during the migration process. The problem with this approach is that systems need to be shut down, causing an unacceptable period of business disruption. To move a petabyte of data over a one-gigabit link would take over 90 days and, for the vast majority of organizations, days, weeks, or months of downtime is just not acceptable.

Using DistCp for Hadoop-to-cloud migrations is risky

Many firms are initially drawn to DistCp because it is familiar -- they’ve used DistCp for intra-cluster data movement -- and because it’s free. Using DistCp, however, will very often result in hidden costs, project delays, and some form of business disruption due to the mismatch between the old, on-premises backup use case that DistCp was built for and the new, cloud-based model that cloud architects are moving toward. The labor-intensive nature of adapting DistCp to these modern data architectures and cloud-based strategies means that using DistCp requires custom script development, manual migration management, and complex reconciliation of changes -- all of which tend to add hidden costs, time, and effort to the total migration project. What might have been seen as a cost-efficient and simple way to migrate data to the cloud often turns into a complex problem for which there is no easy solution.

Data Migrator: A powerful alternative to DistCp

Since it was purpose-built for modern data migration and modernization strategies, Data Migrator provides an automated solution that enables the migration of petabytes of data to occur while production data continues to change. This means that businesses can perform migrations with no system downtime or business disruption. IT departments can easily handle data changes that happen during migration, as Data Migrator is specifically designed for data migration and allows further configurability -- such as network bandwidth usage -- so it won’t impact current production activity.

With Data Migrator, companies can confidently hit migration deadlines by eliminating manual workarounds and knowing exactly how much time the migration will take. They can also more quickly gain business value from the data because processes happen in the background without having to do a point-in-time migration. Ultimately, staff can focus their time and energy to make sure that applications that are running in the cloud work effectively and perform as designed. DistCp has its uses, but when it comes to migrating petabytes of data to the cloud, it pays to give Data Migrator a try.

Subscribe for updates

Check out our other recent blogs

Cirata unlocks GenAI with Databricks

To succeed in the development of generative AI applications, businesses must use their unique data assets more efficiently and effectively. Cirata and Databricks are partnering to help users more quickly build and deploy...

Cirata and Apache Iceberg

What is Apache Iceberg Originally a Netflix project, Iceberg was made open source in 2018 and graduated from incubation under Apache Software Foundation governance in 2020. It is now a leading standard for what is called...

Scale AI in IBM watsonx.data with Cirata Data Migrator

Cirata announces the general availability of Data Migrator 3.0, with full support for IBM watsonx.data. Cirata and IBM continue to collaborate on their partnership with a focus of driving data and simplifying integration...

Cookies and Privacy

We use technology on our website to collect information that helps us enhance your experience and understand what information is most useful to visitors.
By clicking “I ACCEPT,” you agree to the terms of our privacy policy.

Cookie Setting