CDP part 2: CDP Public Cloud deployment on AWS

CDP part 2: CDP Public Cloud deployment on AWS

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

The Cloudera Data Platform (CDP) Public Cloud provides the foundation upon which full featured data lakes are created.

In a previous article, we introduced the CDP platform. This article is the second in a series of six to learn how to build end-to-end big data architectures with CDP:

More specifically, we are going to:

  1. Create a credential that permits CDP to manage resources on AWS
  2. Configure an AWS CloudFormation stack that serves as root of our deployment
  3. Deploy a CDP Environment including a Data Lake to AWS

The configuration and deployment can be accomplished via the web interfaces of Cloudera and Amazon - generally referred to as the AWS console or the CDP console - or via their respective CLI tools. We cover both approaches. First, we demonstrate how to perform all preparatory steps and the actual deployment via the consoles. Second, we provide the console commands to perform the same tasks from a terminal using the CLI tools.

Before we begin, a couple of important remarks:

  1. This deployment is based on the AWS quickstart documentation by Cloudera and aims to provide a useable environment as quickly as possible. It is not optimized for production use, and it is also not suitable for use cases in which you want to use existing infrastructure components - such as VPCs and subnet groups - instead of CDP-managed ones.

  2. If you decide to follow along, be aware that CDP creates resources on your AWS account that incur costs. You find a list of resources that are created during this deployment and a ballpark estimate of the associated costs at the end of the article. Always make sure to delete cloud resources that are no longer in use to avoid unwanted costs.

With that said, let’s begin by configuring our CDP and AWS accounts. As a reminder, you need at least Power User privileges on CDP and Administrator access on AWS to follow along.

Deploy using the CDP and AWS Web Interfaces

This approach is recommended if you are new to CDP and/or AWS. It is slower but gives you a better idea of the various steps involved in the deployment process. If you did not install and configure the CDP CLI and the AWS CLI as described in the first part of the series, this is also your only option.

If you want to go faster and use the terminal to manage your deployment, scroll down to the Deploy from the Terminal section. Note that you still have to use the CDP console to create your CDP credential. We recommend you to follow the below steps until the point where you copy your Cross-account access role Amazon Resource Name (ARN).

Create a CDP Credential

CDP Public Cloud creates and manages AWS resources on your behalf. It is therefore necessary to delegate access to your AWS account via a cross-account access role. Our first step is to create this role for your AWS account and store it in your CDP account as credential.

  • To begin, log in to the Cloudera console and access the Management Console:

    Navigate to management console

  • Navigate to Shared Resources > Credentials and click on Create Credential on the top right:

    Navigate to Shared Resources > Credentials

  • In the Create Credential menu, select AWS, then enter a name and optionally a description for your credential. This name and description are used on the CDP-side of your architecture.

    Create credential menu

  • Copy the AWS IAM policy that is available under Create Cross-account Access Policy. Be sure to select the version with Default permissions, not the one with Minimal permissions.

  • In a new browser tab, navigate to Identity and Access Management (IAM) - Policies in your AWS Console and click Create Policy.

    AWS IAM: Create policy

  • Paste the policy document you have copied from the CDP console:

    AWS IAM: Paste policy document

  • Click Next, optionally add tags and click Next again:

  • Review the policy document, provide a name and an optional description. AWS displays a warning message that you may ignore. Click Create policy.

    AWS IAM: Review and create policy

  • Stay in your AWS IAM console and navigate to Roles, then select Create role:

    AWS IAM: Create role

  • Under Trusted Entity Type select AWS Account. Select Another AWS account below and tick the option Require external ID:

    AWS IAM: Create role

  • Return to your CDP console and copy the Service Manager Account ID and the External ID into the corresponding fields on AWS.

    CDP: Copy IDs

  • In the AWS IAM console, click Next after you pasted the two ids:

    AWS IAM: Click next

  • Under Permissions policies, find the policy you created earlier and tick the checkbox on the left, then click Next:

    AWS IAM: Add permissions

  • Under Name, review, and create, enter a name and optionally a description for your role. Scroll down, optionally add tags and then click Create:

    AWS IAM: Create role

  • Find your newly created role in the AWS IAM console:

    AWS IAM: Find role

  • Copy the ARN of your newly created role:

    AWS IAM: Copy arn

  • Go back to your CDP console and paste the ARN of your cross-account access role into the corresponding field, then click Create:

    CDP: Paste AWS ARN

Congratulations, you have set up your credential to manage AWS resources via CDP.

Configure an AWS CloudFormation Stack

Next, we create a CloudFormation stack. This stack is going to contain the basic IAM policies, roles and instance profiles that are used by our CDP resources as well as the basic configuration of our data lake.

  • To start, download the CloudFormation stack template provided by Cloudera

  • Next, access your AWS console and navigate to the CloudFormation service.

  • Important: Make sure you are connected to the AWS region you want to create your stack in. For the purpose of this tutorial, we stay in the EU Ireland (eu-west-1) region.

  • Click on Create stack.

    AWS CloudFormation: create stack

  • Select Template is ready and Upload template file, then use the file upload dialog to upload the stack template you downloaded earlier. When done, click Next.

    AWS CloudFormation: Upload template

  • Configure your stack as follows:

    1. Choose a stack name, for example my-cdp-stack
    2. Choose a S3 bucket and directory to store backups, for example my-unique-cdp-bucket/backups
    3. Choose a S3 bucket and directory to store logs, for example my-unique-cdp-bucket/logs
    4. Choose a S3 bucket and directory to store data, for example my-unique-cdp-bucket/data
    5. Decide a prefix to use for all IAM resources generated by this stack, for example cdp

    AWS CloudFormation: Configure stack

Remember that your S3 bucket name must be globally unique. Be sure to use the same bucket for all three storage locations (/backups, /logs, and /data).

  • Click Next, optionally add tags for your stack but change nothing else and click Next again.

  • Under Review stack, scroll all the way to the bottom and confirm you acknowledge that AWS CloudFormation might create IAM resources with custom names. Click Submit to create your stack.

    AWS CloudFormation: Confirm stack

  • Wait for your stack to create. You see a green CREATE COMPLETE message in CloudFormation once the process has completed successfully.

    AWS CloudFormation: Stack created

And that’s it! You now have a stack on which you may deploy a CDP Public Cloud Environment in AWS.

Create an SSH Key Pair

When you create your CDP environment you are required to provide an SSH Key pair. While you have the option to create a new key pair as you register the environment, it is preferable to create it in advance.

  • To create a new SSH key pair, access your AWS console and navigate to EC2 > Network & Security > Key Pairs. Make sure you are in the region you want to create your environment in and click Create key pair:

    AWS EC2: Create key pair

  • Under Create key pair, provide a name for your key pair. You are going to need this name later when you create your environment. Choose RSA as Key pair type and .pem as Private key file format. Optionally add some tags and click Create key pair.

    AWS EC2: Create key pair

Register a CDP Environment in AWS

With all the setup complete, we are now finally ready to launch our CDP environment on AWS.

Before we proceed it is important to remind you that the resources launched by CDP are not free. If you decide to follow along, you will incur some cost on your AWS account. Whenever you practice with any cloud service, be sure to remove resources when done.

  • To begin deploying an environment via the CDP console, navigate to Management Console > Environments and click Register Environment:

    CDP: Navigate to Management Console

    CDP: Navigate to Environments

  • In the Register Environment dialog, provide a name and optionally a description for your environment. Select AWS as Cloud Provider and pick the credential you created earlier, then click Next:

    CDP: Register Environment

  • Provide a name and select a runtime version for your data lake. Always select the latest available runtime version unless you have a specific requirement for an earlier version.

    CDP: Configure Datalake

  • Under Data Access and Audit select the roles, instance profiles and storage locations you created when you registered your stack.

    CDP: Configure Access

  • If you don’t remember the details, look them up in AWS CloudFormation. Simply click on your stack and select the Parameters tab:

    AWS CloudFormation: View stack parameters

  • Under Scale, select the desired configuration of your data lake. Light Duty should be sufficient for our use case. Click Next.

    CDP: Select scale

  • In Region, Networking and Security, apply the following configuration:

    • Region: Select the AWS region you created your stack in

    • Network: Select Create new network

    • Be sure to enable Public Endpoint Access Gateway

    CDP: Configure Region, Networking, Security

  • Leave the proxy configuration at the default setting Do not use Proxy Configuration.

  • Under Security Access Settings, leave the default setting Create New Security Groups with an access CIDR of 0.0.0.0/0.

  • In SSH Settings, choose Existing SSH public key and select the key you created earlier from the drop down.

    CDP: Configure security

  • Optionally add some tags. These tags are applied to all AWS resources created by this step. We recommend to always tag your resources for easier monitoring and deletion. When done, click Next.

  • Under Logger Instance Profile enter the [YOUR-PREFIX]-log-access-instance-profile as well as the log and backup location base created in your stack. Check your CloudFormation console for the correct parameters in case you are not sure.

    CDP: Configure logs

  • Click Register Environment to start the environment creation.

And that’s it! You have now launched the deployment of a CDP Public Cloud environment on AWS. Monitor your progress via the Cloudera console:

CDP: Monitor Environment Status

Remove your CDP Environment

As soon as you no longer use your environment, you should remove it from AWS to avoid incurring unwanted costs. Note that your base stack and the S3 bucket you created via CloudFormation remain, so that you may re-deploy your environment later starting from Register a CDP Environment in AWS.

To delete your environment via the Cloudera console:

  • Navigate to Environments in the Cloudera Management Console. Tick the checkbox next to the environment you want to delete and click Delete Environment:

    CDP: Delete environment

  • In the Confirmation dialog, enter the name of the resource you want to delete and tick the first two boxes, then click Delete:

    CDP: Confirm environment deletion

Be aware that there is a chance that the environment deletion process does not complete successfully. Always double check in your AWS console that all resources managed by CDP have been removed from your account. You can use the CloudFormation service or AWS resource tags (if you configured them during deployment) to look for CDP managed resources.

Deploy from the Terminal

Deploying via the terminal is recommended for experienced users who want to launch their environment quickly. You need to have the CDP CLI and the AWS CLI installed on your system as described in the first part of the series. jq is also required for the below commands to work.

The order of operations is the same as if you deployed via the web interface: First, create a credential (which requires the use of the web interface), then create your CloudFormation stack and SSH key pair before you launch your environment.

Register Your CDP Credential

Use the web interface to create a Cross-account access role in your AWS account as described above. Follow the steps up to the point where you copy the ARN of the newly created role, then register it in CDP with the following command:

# Set your AWS Role ARN
export CDP_AWS_CROSS_ACCOUNT_ROLE_ARN=[your-role-arn]
# Register your CDP credential
export CDP_CREDENTIAL_NAME=${USER}-aws-credential
export CDP_CREDENTIAL_DESC="CDP AWS credential by ${USER}"

cdp environments create-aws-credential \
 --credential-name ${CDP_CREDENTIAL_NAME} \
 --role-arn ${CDP_AWS_CROSS_ACCOUNT_ROLE_ARN} \
 --description "${CDP_CREDENTIAL_DESC}"

There is no immediate feedback if you successfully created your credential. To validate that your credential was created use this command:

# Check the existance of a CDP credential
cdp environments list-credentials \
  --credential-name=${CDP_CREDENTIAL_NAME}

Create a CloudFormation Stack

The next step in the deployment process is the creation of a CloudFormation stack. To create the stack via the AWS CLI based on the template provided by Cloudera, use the following commands:

# Download and save the template
curl \
  -o ~/aws-cdp-template.json \
  https://docs.cloudera.com/cdp-public-cloud/cloud/quickstart-files/cloud-formation-setup.json
# Set additional environment variables
export CDP_BASE_STACK_NAME=aws-${USER}-env
export CDP_RESOURCE_PREFIX=cdp
export AWS_S3_BUCKET=cdp-${USER}-$RANDOM
export AWS_S3_BUCKET_DATA=${AWS_S3_BUCKET}/data
export AWS_S3_BUCKET_LOGS=${AWS_S3_BUCKET}/logs
export AWS_S3_BUCKET_BACKUPS=${AWS_S3_BUCKET}/backups
# Optionally set a region. If not set, the next command defaults to eu-west-1.
export AWS_REGION=eu-west-1
# Create your AWS CloudFormation stack
aws cloudformation deploy \
  --template-file ~/aws-cdp-template.json \
  --stack-name ${CDP_BASE_STACK_NAME} \
  --parameter-overrides \
      StorageLocationBase=${AWS_S3_BUCKET_DATA} \
      LogsLocationBase=${AWS_S3_BUCKET_LOGS} \
      BackupLocationBase=${AWS_S3_BUCKET_BACKUPS} \
      prefix=${CDP_RESOURCE_PREFIX} \
  --region ${AWS_REGION:-eu-west-1} \
  --capabilities CAPABILITY_NAMED_IAM

The progress of the stack creation process is displayed in your terminal.

Create an SSH Key Pair

You need to provide a SSH Key Pair when you register your environment. Use these commands to create a new key pair:

# Set a name for your key pair
export AWS_SSH_KEY=aws-cdp-${USER}
# Create the key pair & download the file
aws ec2 create-key-pair \
  --key-name ${AWS_SSH_KEY} \
  --output text > /home/${USER}/.ssh/${AWS_SSH_KEY}.pem \
  --region ${AWS_REGION:-eu-west-1} \
  && chmod 400 /home/${USER}/.ssh/${AWS_SSH_KEY}.pem

There is no feedback if you successfully created your key pair. Use this command to validate if the operation was successful:

aws ec2 describe-key-pairs \
  --key-name {$AWS_SSH_KEY} \
  --region ${AWS_REGION:-eu-west-1}

Launch your Environment and Data Lake

With all the setup done, you are now ready to launch your CDP Public Cloud Environment and Data Lake. This requires three steps that are to be executed in order:

  1. Create the base CDP environment
  2. Configure ID broker mappings
  3. Create the data lake itself

Before we begin, let’s ensure all environment variables are available in the current shell session:

# CDP resource naming
export CDP_ENV_NAME=aws-${USER}
export CDP_DATALAKE_NAME=aws-${USER}-datalake
export CDP_RESOURCE_PREFIX=$(aws cloudformation describe-stacks \
  --stack-name ${CDP_BASE_STACK_NAME:-aws-${USER}-env} \
  | jq -r '.Stacks[].Parameters[] | select (.ParameterKey=="prefix").ParameterValue')
export AWS_S3_BUCKET=$(aws cloudformation describe-stacks \
  --stack-name ${CDP_BASE_STACK_NAME:-aws-${USER}-env} \
  | jq -r '.Stacks[].Parameters[] | select(.ParameterKey=="StorageLocationBase").ParameterValue' \
  | grep -Po '[a-z0-9-]*(?=/)')
export AWS_S3_BUCKET_DATA=${AWS_S3_BUCKET}/data
export AWS_S3_BUCKET_LOGS=${AWS_S3_BUCKET}/logs
export AWS_S3_BUCKET_BACKUPS=${AWS_S3_BUCKET}/backups

# AWS resource roles and instance profiles
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity | grep -Po "(?<=\"Account\": \")[0-9]*")
export AWS_LOG_ACCESS_INSTANCE_PROFILE_ARN=arn:aws:iam::${AWS_ACCOUNT_ID}:instance-profile/${CDP_RESOURCE_PREFIX}-log-access-instance-profile
export AWS_DATA_ADMIN_ROLE_ARN=arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CDP_RESOURCE_PREFIX}-datalake-admin-role
export AWS_DATA_ADMIN_INSTANCE_PROFILE_ARN=arn:aws:iam::${AWS_ACCOUNT_ID}:instance-profile/${CDP_RESOURCE_PREFIX}-data-access-instance-profile
export AWS_RANGER_AUDIT_ROLE_ARN=arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CDP_RESOURCE_PREFIX}-ranger-audit-role

# AWS resource tagging
export AWS_TAG_GENERAL_KEY=ENVIRONMENT_PROVIDER
export AWS_TAG_GENERAL_VALUE=CLOUDERA
export AWS_TAG_SERVICE_KEY=CDP_SERVICE
export AWS_TAG_SERVICE_ENVIRONMENT=CDP_ENVIRONMENT
export AWS_TAG_SERVICE_DATALAKE=CDP_DATALAKE

Now we begin by creating our AWS environment:

# Create the base environment
cdp environments create-aws-environment \
  --environment-name ${CDP_ENV_NAME:-aws-${USER}} \
  --credential-name ${CDP_CREDENTIAL_NAME:-${USER}-aws-credential} \
  --region ${AWS_REGION:-eu-west-1} \
  --security-access cidr=${CDP_SECURITY_ACCESS:-0.0.0.0/0} \
  --tags key=${AWS_TAG_GENERAL_KEY},value=${AWS_TAG_GENERAL_VALUE} key=${AWS_TAG_SERVICE_KEY},value=${AWS_TAG_SERVICE_ENVIRONMENT} \
  --endpoint-access-gateway-scheme ${CDP_GATEWAY_SCHEME:-PUBLIC} \
  --enable-tunnel \
  --authentication publicKeyId=${AWS_SSH_KEY:-aws-cdp-${USER}} \
  --log-storage storageLocationBase=s3a://${AWS_S3_BUCKET_LOGS},backupStorageLocationBase=s3a://${AWS_S3_BUCKET_BACKUPS},instanceProfile=${AWS_LOG_ACCESS_INSTANCE_PROFILE_ARN} \
  --network-cidr ${AWS_NETWORK_CIDR:-10.10.0.0/16} \
  --create-private-subnets \
  --no-create-service-endpoints \
  --free-ipa instanceCountByGroup=${CDP_IPA_INSTANCE_COUNT:-2}

Next, we set our ID broker mappings:

# Configure ID broker mappings
cdp environments set-id-broker-mappings \
  --environment-name ${CDP_ENV_NAME:-aws-${USER}} \
  --data-access-role ${AWS_DATA_ADMIN_ROLE_ARN} \
  --ranger-audit-role ${AWS_RANGER_AUDIT_ROLE_ARN} \
  --set-empty-mappings

And finally, we create the data lake:

# Create a data lake
cdp datalake create-aws-datalake \
  --datalake-name ${CDP_DATALAKE_NAME:-aws-${USER}-datalake} \
  --environment-name ${CDP_ENV_NAME:-aws-${USER}} \
  --cloud-provider-configuration instanceProfile=${AWS_DATA_ADMIN_INSTANCE_PROFILE_ARN},storageBucketLocation=s3a://${AWS_S3_BUCKET_DATA} \
  --tags key=${AWS_TAG_GENERAL_KEY},value=${AWS_TAG_GENERAL_VALUE} key=${AWS_TAG_SERVICE_KEY},value=${AWS_TAG_SERVICE_DATALAKE} \
  --scale ${CDP_DATALAKE_SCALE:-LIGHT_DUTY} \
  --runtime ${CDP_DATALAKE_RUNTIME:-7.2.15} \
  --no-enable-ranger-raz

Monitor your environment and data lake status with the following commands:

# Check the status of the environment
cdp environments describe-environment \
  --environment-name ${CDP_ENV_NAME:-aws-${USER}} \
  | jq -r '.environment.status'
# Check the status of the data lake
cdp datalake describe-datalake \
  --datalake-name ${CDP_DATALAKE_NAME:-aws-${USER}} \
  | jq -r '.datalake.status'

If deployed successfully, your environment status is AVAILABLE, and your data lake status is RUNNING.

Teardown your Resources

Once you no longer use your environment, it is highly recommended that you remove your AWS resources in order to avoid unwanted cost. Issue the following command to delete your environment and all associated resources:

# Delete a CDP environment and all related resources
cdp environments delete-environment \
  --environment-name ${CDP_ENV_NAME:-aws-${USER}} \
  --cascading

Be sure to always validate that your resources have been deleted completely. The best way to verify that all resources have been removed is to check your AWS CloudFormation Console.

Resources and Costs

While Cloudera’s CDP Public Cloud documentation is extensive, determining which resources are created as part of your deployment is not a trivial task. Based on our observations the deployment we describe in this article - with a Light Duty configuration for the Data Lake - creates the following resources:

Hourly and other costs are for the EU Ireland region, as observed in June 2023. AWS resource pricing varies by region and can change over time. Consult AWS Pricing to see the current pricing for your region.

CDP ComponentAWS Resource CreatedResource CountResource Cost (Hour)Resource Cost (Other)
Base*S3: Bucket1n/aAWS S3 Pricing
BaseIAM: Role4No chargeNo charge
BaseIAM: Instance Profile2No chargeNo charge
BaseIAM: Managed Policy6No chargeNo charge
BaseCloudFormation: Stack1No chargeHandling costs
EnvironmentEC2 Instance: m5.large2$0.107Data Transfer Cost
EnvironmentEC2: Elastic IP Address3$0.005**No charge
EnvironmentEC2: EBS - GP2 100gb2n/a$0.11 per GB Month (see EBS pricing)
EnvironmentEC2: Security Group1No chargeNo charge
EnvironmentVPC: NAT Gateway3$0.048$0.048 per GB processed (see VPC pricing)
EnvironmentVPC: Internet Gateway1No chargeNo charge
EnvironmentVPC: Route Table4No chargeNo charge
EnvironmentVPC: Subnet Group6No chargeNo charge
EnvironmentVPC: Virtual Private Cloud1No chargeNo charge
EnvironmentCloudFormation: Stack2No chargeHandling costs
Data LakeEC2 Instance: t3.medium1$0.0456Data Transfer Cost
Data LakeEC2 Instance: r5.2xlarge1$0.564Data Transfer Cost
Data LakeRDS Postgre DB Instance: db.m5.large1$0.197Additional RDS charges
Data LakeRDS DB Snapshot1n/aDB Snapshot Export charges
Data LakeEC2 EBS - GP2 100gb2n/a$0.11 per GB Month (see EBS pricing)
Data LakeEC2 EBS - GP2 512gb1n/a$0.11 per GB Month (see EBS pricing)
Data LakeEC2: Network Load Balancer2$0.02520.006$ per NCLU hour
Data LakeEC2: Network Target Groups2No chargeNo charge
Data LakeEC2: Security Group3No chargeNo charge
Data LakeRDS: DBSubnetGroup1No chargeNo charge
Data LakeCloudFormation: Stack2No chargeHandling costs

* Base refers to the AWS resources created on your account by the initial CloudFormation stack. These resources remain on your account even if the deployment is deleted until you remove the stack.

** Per running EC2 instance, one Elastic IP Address is free of charge

Not accounting for costs that scale with usage, such as data transfer costs, and monthly costs that are pro-rated on an hourly basis, such as EBS storage costs, this basic deployment has an hourly cost of approximately $1.17.

Next step: activate Data Services

Of course, there is not much you can do yet with your brand new CDP Public Cloud environment. In order to completely deploy and use our end-to-end architecture, we’ll in the next chapter see how to activate managed Data Services.

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain