Data and analytics are critical to the organizations aspiration, moving money and information to move the world. This includes their on-premises data lake, which consists of two on premise clusters running Cloudera Distributed Hadoop (CDH) 6.3. They have also started to use cloud analytics solutions from Databricks and one of their objectives is to build out a complete new set of analytics use cases and do so in the Azure cloud so they can:
The organizations on-premises data lake includes data that they do not want to transfer, or for regulatory reasons has not been approved by their legal department to move to the cloud. They need to be able to easily select and control the data that is transferred and data that needs to remain on-prem. Since their on-premises data lake is business critical it needs to be available 24x7 for analytics as well as for data ingest and changes that occur daily and cannot afford any system downtime or business disruption. The organization also established throughput requirements that they need the data transfer process to achieve.
In summary, the key challenges and requirements include:
Following a proof-of-concept (PoC) the organization selected Cirata for their on-premises data lake to Azure cloud data transfer process. Data Migrator is an automated, scalable, high performance, and cloud-agnostic data integration solution that simplifies making data available in and immediately usable across on-premises environments and with any cloud platform. The PoC demonstrated that Data Migrator would meet all of their requirements and address their data transfer challenges.
The organization also evaluated alternative solutions such as DistCp (distributed copy) and AZCopy (Microsoft Azure’s DistCp based technology). They indicated that they were not able to reach their throughput requirements with AZCopy, and similarly saw a “performance lag” with DistCp. Furthermore, DistCp and AZCopy are designed to copy data based on a single point in time. Any data ingested or changed since the copy process started would not be picked up, and subsequent scans are needed to capture ongoing data changes. To prevent this from happening requires the production system to be brought down, which was unacceptable.
Data Migrator performs the initial data transfer using a single scan of the source storage, while also supporting continuous replication of any ongoing changes from source to target with zero disruption to current production systems.
Data Migrator is installed on an edge node of the source cluster, and deployment can be performed in minutes and does not require any custom coding or changes to source applications. The organization was able to easily configure data transfer jobs to meet their specific requirements, such as data sets to transfer, exclusion rules, bandwidth management and more. Verification capabilities ensure all data is transferred, and the product user interface allows for management and to monitor the full data transfer process from a single console.
Data Migrator enabled the organization to:
Data Migration as a Service: The organization elected to use Cirata’s fixed price professional service offering where Cirata data integration specialists manages the migration setup and assists in the entire migration. This enabled their team to focus on other elements of the new analytics platform.
“We selected Cirata Data Migrator to transfer data from our on-premises data lake to the cloud. Data Migrator provided superior performance and throughput over the alternatives we evaluated, and the organization delivered excellent support during the initial proof of concept, overall project, and continue to do so today.”
– Senior Director Technical Operations, Global leader in payments and financial technology