The company had started their journey to cloud computing by adopting Microsoft Azure. They wanted to move to the cloud for several reasons:
Initially, the company’s cloud program was very application focused. This is a large company with thousands of applications, and as they started to look at their application journey, one of their big realizations was that it wasn’t so much about the applications as it was about the data. This was especially true for the data science office, so the organization started to think about a “datafirst” approach and finding a way to get the data to the cloud to start getting benefits early. They wanted to be able to quickly demonstrate the business benefits of accelerating their business and improving customer experiences, revenue, and customer retention, as well as achieving cost savings.
The challenge was regarding the volume of existing data (tens of petabytes) and the fact that new data was continuously being ingested. How could the company migrate this volume of data efficiently and without impacting current business operations?
The company looked at a variety of solutions, including data transfer devices, ETL tools, and open source tools such as DistCp based software. Each of these solutions weren’t fit for purpose. With data transfer devices, the data needs to be copied onto multiple devices, the devices must be transported by truck to the cloud data center, and the data then must be copied from the devices onto the cloud storage. After realizing multiple trucks would be needed, the security issues associated with shipping the data via trucks became clear. They would either need to bring their production systems down to prevent changes from occurring during the transfer process or undertake custom code development to identify and migrate any new or changed data. They quickly discarded transfer devices as a viable approach. Using ETL or open-source software tools also had similar issues in handling changing data, and the company estimated it would be too costly to develop and maintain the custom solutions they would need to develop with those tools. A better alternative was required.
The company ended up selecting Cirata Data Migrator to enable the data-first approach they were looking for and to automate the Hadoop-to-Azure data migration process without requiring any business disruption.
Initially, a short production pilot was conducted using Data Migrator to transfer 100 TB of Hadoop data directly from the company’s on-premises production environment to ADLS Gen2 storage. The pilot was performed over a weekend without the need for any custom development and without any impact to their production systems. And, the data was immediately available for use by Azure services.
The pilot was very successful. It showed the company that they could achieve their data-first strategy with Data Migrator. To perform the migration even faster, they decided to put in an order for additional network bandwidth. However, nothing prevented them from proceeding with the current available bandwidth. Their goal was to migrate about 1 PB per month and get the initial set of data migrated from their on-premises Hadoop cluster into Azure within 12 months.
“We cut our entire cloud data migration timeline for moving 13 petabytes in half.”
– Vice President of Data and Analytics, Global Telecommunications Company