Migrate on-premises CDH clusters to Amazon S3 using Cirata LiveData Platform

Summary

This pattern provides guidance for migrating on-premises CDH (Cloudera Distribution including Apache Hadoop) clusters to a data lake on Amazon Web Services (AWS) using Cirata LiveData Platform. This pattern incorporates Amazon Elastic Compute Cloud (Amazon EC2) compute capacity, Amazon Simple Storage Service (Amazon S3) storage, and Amazon Virtual Private Cloud (Amazon VPC). Cirata LiveData Platform enables you to migrate your clusters without downtime and without blocking client or application activity.

Cirata provides the ability to continuously replicate data between your on-premises Apache Hadoop environment and Amazon S3, guaranteeing strong consistency between data residing on premises and in the cloud.

You can also customize this pattern to enable disaster recovery scenarios for your on-premises Hadoop cluster by provisioning an Amazon EMR cluster that references data replicated into Amazon S3. (Note this pattern doesn't cover the deployment of Amazon EMR.) Amazon EMR can provide an effective low-cost disaster recovery environment in the event of on-premises Hadoop cluster failure. Migrating from an on-premises Hadoop cluster to Amazon EMR can reduce your costs in maintaining physical infrastructure, whether it is because of over-provisioned or idle hardware.

Prerequisites and limitations

Prerequisites

  • An active AWS account.
  • A subscription to the Amazon Machine Image (AMI) for Cirata LiveData Platform in AWS Marketplace; you can choose from a no-commitment, metered option or use the Bring Your Own License (BYOL) option for a 14-day trial, after which you can purchase a license by contacting Cirata.
  • Cirata LiveData Platform servers located in the on-premises CDH cluster require a license and installer (contact Cirata for details).
  • Root access on the CDH edge nodes.
  • Root access on the Cloudera Management Server.
  • Administrator access to CDH.
  • Default options used during the installation of Cirata LiveData Platform on CDH edge nodes.

Architecture

Source technology stack

  • Cirata LiveData Platform (Hadoop)
  • CDH cluster

Source architecture

  • CDH on-premises cluster
  • Two on-premises edge nodes deployed in the CDH cluster for Cirata LiveData Platform

Target technology stack

  • Data lake on AWS
  • Cirata LiveData Platform
  • Amazon S3

Target architecture

  • A virtual private cloud (VPC) configured with an Availability Zone
  • An AWS Direct Connect (DX) setup that enables network connectivity between the on-premises CDH edge nodes and the VPC
  • An AWS Identity and Access Management (IAM) role to control access to the resources created (this role is used to control Cirata LiveData Platform access to Amazon S3 for data synchronization)
  • AWS Auto Scaling to establish the initial configuration and connectivity between instances
  • An S3 bucket to store the content synchronized by Cirata LiveData Platform

Design considerations

AWS recommends deploying workloads into private subnets for security purposes, and Cirata LiveData Platform on AWS is launched on EC2 instances within a VPC. Your on-premises Cirata LiveData Platform components need to establish connections with these VPC-resident services. For more information, see the Amazon Virtual Private Cloud Connectivity Options whitepaper.

The other option is to use AWS Direct Connect to directly connect on-premises data center nodes to AWS.

image

Tools used

Cirata LiveData Platform - Cirata LiveData Platform is a software application that allows replication of data between HCFS (for example, Apache Hadoop) deployments even where clusters are running different versions of Apache Hadoop. Cirata LiveData Platform allows replication of data between different vendor distributions and versions of Apache Hadoop. Cirata LiveData Platform also supports moving data between Apache Hadoop and Amazon S3. Cirata LiveData Platform provides:

  • A virtual file system for Apache Hadoop, compatible with all Apache Hadoop applications.
  • A single, virtual namespace that integrates storage from different types of Apache Hadoop deployments, including CDH, Hortonworks Data Platform (HDP), Dell EMC Isilon, Amazon S3, EMR File System (EMRFS), and MapR.
  • Storage that can be globally distributed.
  • WAN replication using Cirata’s LiveData technology, which delivers single-copy consistent Hadoop Distributed File System (HDFS) data, replicated between geographically dispersed data centers.

Setup and Installation Steps

1. Prepare your AWS account

Task Description Skills Required
Create an AWS account. See https://aws.amazon.com for guidance. General AWS
Select an AWS Region. Use the region selector in the navigation bar to choose the AWS Region where you want to deploy the stack on AWS. General AWS
Create a key pair. Create an access/secret key pair in your preferred AWS Region. General AWS
If necessary, request a service limit increase. Depending on requirements, the Cirata LiveData Platform server(s) may require a larger EC2 instance type. General AWS

2. Configure the network and Amazon EC2

Task Description Skills Required
Configure the AWS Region, VPCs, Availability Zone, and subnets. Configure the AWS infrastructure, including the Availability Zone, CIDR ranges, and subnets. General AWS
Configure the key pair. Configure the public/private SSH key pair, which allows you to connect securely to the instance(s) after launch. General AWS
Configure bastion host CIDR ranges (optional). Configure the CIDR IP range that allows external SSH access to the bastion host instances. General AWS

3. Configure the VPC and DX

Task Description Skills Required
Configure the Availability Zone. This is the Availability Zone to use for the subnets in the VPC. General AWS
Configure AWS Direct Connect (DX). Set up DX between the on-premises CDH cluster Cirata LiveData Platform nodes and the VPC. General AWS

4. Download the Cirata LiveData Platform installer for CDH

Task Description Skills Required
Download the installer file for CDH distributions. Download the installer file and place it on the two edge nodes designated for Cirata LiveData Platform on the CDH cluster. System Admin

5. Complete the initial configuration of Cirata LiveData Platform for CDH

Task Description Skills Required
Specify the CDH version of the on-premises cluster. During the CLI portion of the Cirata LiveData Platform installation, select which version of CDH is being used. You can leave all other options at their default values. System Admin

6. Configure the Cirata LiveData Platform application for CDH

Task Description Skills Required
Access the web URL of the Cirata LiveData Platform node. To proceed with the UI portion of the Cirata LiveData Platform installation, use the fully qualified domain name (FQDN) of the Cirata LiveData Platform node to access the web URL on port 8083. System Admin
Upload the Cirata LiveData Platform license. Use the local desktop path to the Cirata LiveData Platform license key to be used with the on-premises installation of Cirata LiveData Platform. System Admin
Configure the FQDN of the LiveData Platform node network interface. Provide the hostname of the LiveData Platform server for installation that must be accessible to and from the EC2 instances. System Admin
Provide the zone name and node name. Provide the name that identifies the operating zone for the Cirata LiveData Platform server, and the name that was given to the local node. System Admin
Confirm the URI selection. Use the default setting HDFS URI with HDFS for live replication. System Admin
Provide the Cloudera Manager configuration details. Provide the Cloudera Manager hostname, port, user name, and password. Check whether Secure Sockets Layer (SSL) is enabled on the CDH UI, and adjust the port accordingly. System Admin
Provide Kerberos details, if required. Provide the configuration file path of the Key Distribution Center (KDC), the keytab file path, and the principal name for the Cirata LiveData Platform system user on the Cirata LiveData Platform node. System Admin; Security Admin
(Optional) Enable HTTP authentication and API authorization. If Kerberos is enabled, you can enable HTTP authentication by providing the keytab file path for the HTTP principal. You can also enable API authorization, if desired. System Admin; Security Admin
(Optional) Provide a Cirata LiveData Platform administrator user name. Provide a different user name, if desired. Note the generated password, or generate a new one. System Admin

7. Install the Cirata LiveData Platform client on the CDH cluster

Task Description Skills Required
Provide the location for parcel distribution on the Cloudera Management Server. Specify the file system location for the Cirata LiveData Platform client parcel on the Cloudera Management Server. System Admin

Tags: amazon vpc, amazon ec2, amazon s3, apache, hadoop, hadoop distributed file system, hdfs, container, dataset, aws auto scaling, hybrid, mapr, amazon emr, cdh, hortonworks data platform, hdp, isilon

Cookies and Privacy

We use technology on our website to collect information that helps us enhance your experience and understand what information is most useful to visitors.
By clicking “I ACCEPT,” you agree to the terms of our privacy policy.

Cookie Setting