This pattern provides guidance for migrating on-premises CDH (Cloudera Distribution including Apache Hadoop) clusters to a data lake on Amazon Web Services (AWS) using Cirata LiveData Platform. This pattern incorporates Amazon Elastic Compute Cloud (Amazon EC2) compute capacity, Amazon Simple Storage Service (Amazon S3) storage, and Amazon Virtual Private Cloud (Amazon VPC). Cirata LiveData Platform enables you to migrate your clusters without downtime and without blocking client or application activity.
Cirata provides the ability to continuously replicate data between your on-premises Apache Hadoop environment and Amazon S3, guaranteeing strong consistency between data residing on premises and in the cloud.
You can also customize this pattern to enable disaster recovery scenarios for your on-premises Hadoop cluster by provisioning an Amazon EMR cluster that references data replicated into Amazon S3. (Note this pattern doesn't cover the deployment of Amazon EMR.) Amazon EMR can provide an effective low-cost disaster recovery environment in the event of on-premises Hadoop cluster failure. Migrating from an on-premises Hadoop cluster to Amazon EMR can reduce your costs in maintaining physical infrastructure, whether it is because of over-provisioned or idle hardware.
AWS recommends deploying workloads into private subnets for security purposes, and Cirata LiveData Platform on AWS is launched on EC2 instances within a VPC. Your on-premises Cirata LiveData Platform components need to establish connections with these VPC-resident services. For more information, see the Amazon Virtual Private Cloud Connectivity Options whitepaper.
The other option is to use AWS Direct Connect to directly connect on-premises data center nodes to AWS.
Cirata LiveData Platform - Cirata LiveData Platform is a software application that allows replication of data between HCFS (for example, Apache Hadoop) deployments even where clusters are running different versions of Apache Hadoop. Cirata LiveData Platform allows replication of data between different vendor distributions and versions of Apache Hadoop. Cirata LiveData Platform also supports moving data between Apache Hadoop and Amazon S3. Cirata LiveData Platform provides:
Task | Description | Skills Required |
---|---|---|
Create an AWS account. | See https://aws.amazon.com for guidance. | General AWS |
Select an AWS Region. | Use the region selector in the navigation bar to choose the AWS Region where you want to deploy the stack on AWS. | General AWS |
Create a key pair. | Create an access/secret key pair in your preferred AWS Region. | General AWS |
If necessary, request a service limit increase. | Depending on requirements, the Cirata LiveData Platform server(s) may require a larger EC2 instance type. | General AWS |
Task | Description | Skills Required |
---|---|---|
Configure the AWS Region, VPCs, Availability Zone, and subnets. | Configure the AWS infrastructure, including the Availability Zone, CIDR ranges, and subnets. | General AWS |
Configure the key pair. | Configure the public/private SSH key pair, which allows you to connect securely to the instance(s) after launch. | General AWS |
Configure bastion host CIDR ranges (optional). | Configure the CIDR IP range that allows external SSH access to the bastion host instances. | General AWS |
Task | Description | Skills Required |
---|---|---|
Configure the Availability Zone. | This is the Availability Zone to use for the subnets in the VPC. | General AWS |
Configure AWS Direct Connect (DX). | Set up DX between the on-premises CDH cluster Cirata LiveData Platform nodes and the VPC. | General AWS |
Task | Description | Skills Required |
---|---|---|
Download the installer file for CDH distributions. | Download the installer file and place it on the two edge nodes designated for Cirata LiveData Platform on the CDH cluster. | System Admin |
Task | Description | Skills Required |
---|---|---|
Specify the CDH version of the on-premises cluster. | During the CLI portion of the Cirata LiveData Platform installation, select which version of CDH is being used. You can leave all other options at their default values. | System Admin |
Task | Description | Skills Required |
---|---|---|
Access the web URL of the Cirata LiveData Platform node. | To proceed with the UI portion of the Cirata LiveData Platform installation, use the fully qualified domain name (FQDN) of the Cirata LiveData Platform node to access the web URL on port 8083. | System Admin |
Upload the Cirata LiveData Platform license. | Use the local desktop path to the Cirata LiveData Platform license key to be used with the on-premises installation of Cirata LiveData Platform. | System Admin |
Configure the FQDN of the LiveData Platform node network interface. | Provide the hostname of the LiveData Platform server for installation that must be accessible to and from the EC2 instances. | System Admin |
Provide the zone name and node name. | Provide the name that identifies the operating zone for the Cirata LiveData Platform server, and the name that was given to the local node. | System Admin |
Confirm the URI selection. | Use the default setting HDFS URI with HDFS for live replication. | System Admin |
Provide the Cloudera Manager configuration details. | Provide the Cloudera Manager hostname, port, user name, and password. Check whether Secure Sockets Layer (SSL) is enabled on the CDH UI, and adjust the port accordingly. | System Admin |
Provide Kerberos details, if required. | Provide the configuration file path of the Key Distribution Center (KDC), the keytab file path, and the principal name for the Cirata LiveData Platform system user on the Cirata LiveData Platform node. | System Admin; Security Admin |
(Optional) Enable HTTP authentication and API authorization. | If Kerberos is enabled, you can enable HTTP authentication by providing the keytab file path for the HTTP principal. You can also enable API authorization, if desired. | System Admin; Security Admin |
(Optional) Provide a Cirata LiveData Platform administrator user name. | Provide a different user name, if desired. Note the generated password, or generate a new one. | System Admin |
Task | Description | Skills Required |
---|---|---|
Provide the location for parcel distribution on the Cloudera Management Server. | Specify the file system location for the Cirata LiveData Platform client parcel on the Cloudera Management Server. | System Admin |
Tags: amazon vpc, amazon ec2, amazon s3, apache, hadoop, hadoop distributed file system, hdfs, container, dataset, aws auto scaling, hybrid, mapr, amazon emr, cdh, hortonworks data platform, hdp, isilon