Confluent Tools Cluster

This folder contains a Terraform module for running a cluster of Confluent tools such as Schema Registry and REST Proxy. Under the hood, the cluster is powered by the server-group module, so it supports attaching ENIs and EBS Volumes, zero-downtime rolling deployment, and auto-recovery of failed nodes.

Quick start

See the root README for instructions on using Terraform modules.
See the kafka-zookeeper-confluent-oss-colocated-clusters example for sample usage in a non-production environment.
See the kafka-zookeeper-confluent-oss-standalone-clusters example for sample usage in a production environment.
See vars.tf for all the variables you can set on this module.
See Connecting to the Confluent Tools for instructions on interacting with the open source Confluent tools.

Key considerations for using this module

Here are the key things to take into account when using this module:

Confluent Open Source AMI
User Data
Kafka
ZooKeeper
Hardware
EBS Volumes
Health checks
Rolling deployments
Data backup

Confluent Open Source AMI

You specify the AMI to run in the cluster using the ami_id input variable. We recommend creating a Packer template to define the AMI with the following modules installed:

install-open-jdk: Install OpenJDK. Note that this module is part of terraform-aws-zookeeper.
install-supervisord: Install Supervisord as a process manager. Note that this module is part of terraform-aws-zookeeper.
install-confluent-tools: Install Confluent Tools like Schema Registry and REST Proxy directly from the Confluent apt or yum repos.
run-schema-registry: A script used to configure and start Confluent Schema Registry.
run-kafka-rest: A script used to configure and start Confluent REST Proxy.
run-kafka-connect: A script used to configure and start a Kafka Connect worker.
run-health-checker: A script used to configure and start health-checker, a tool that enables more sophisticated health checks.

See the confluent-oss-ami example for working sample code.

User Data

When your servers are booting, you need to tell them to start Schema Registry, REST Proxy, and health-checker. The easiest way to do that is to specify a User Data script via the user_data input variable that runs the run-schema-registry, run-kafka-rest, and run-health-checker scripts. See confluent-tools-cluster-user-data.sh for an example.

Kafka

REST Proxy exposes a RESTful API on top of Kafka, and Schema Registry stores schemas for data stored in Kafka. Therefore, these services both depend on Kafka to work. The easiest way to run Kafka is with the kafka-cluster modoule. Check out the kafka-zookeeper-confluent-oss-standalone-clusters example for how to run Kafka, ZooKeeper, and the Confluent tools in separate clusters and the kafka-zookeeper-confluent-oss-colocated-cluster example for how to run all services co-located in the same cluster.

ZooKeeper

REST Proxy and Schema Registery depend on Kafka, and Kafka depends on ZooKeeper to work. The easiest way to run ZooKeeper is with terraform-aws-zookeeper. Check out the kafka-zookeeper-standalone-clusters example for how to run Kafka and ZooKeeper in separate clusters and the kafka-zookeeper-colocated-confluent-oss-cluster example for how to run Kafka and ZooKeeper co-located in the same cluster.

Hardware

Schema Registry

Schema Registry hardware requirements are relatively light. Based on the official Schema Registry hardware recommendations:

Memory: 1 GB of memory should be more than sufficient, assuming a reasonable number of schemas.
CPU: Schema Registry requires minimal CPU.
Disk: Because all data is stored in Kafka and only log4j logging uses the disk, Schema Registry requires minimal disk performance.
Network: Standard networking is sufficient.

Based on the above, for a production cluster, at least a t2.small is required, but because Schema Registry will likely be co-located with other tools such as REST Proxy, more likely a t2.medium would be ideal.

REST Proxy

The number and type of servers you need for REST Proxy depend on the usage patterns you anticipate. For example, if REST Proxy will be used mostly for administrative actions, you can allocate just 1GB per EC2 Instance (in addition to whatever else is running on that EC2 Instance). But if REST Proxy is also serving consumers, you'll need to do some basic math around how many consumers you expect and the average amount of memory buffer to be used per consumer.

Because hardware requirements depend on your particular business needs, we cannot make a specific recommendation. For all the details, see the official REST Proxy hardware recommendations

EBS Volumes

Neither Schema Registry nor REST Proxy write to the local disk (EBS Volume) for core operations. Rather, all state is stored in Kafka. Therefore, we make no special accommodations for a "sticky" EBS Volume or high performance.

Health checks

We strongly recommend associating an Elastic Load Balancer (ELB) with your Confluent Tools cluster and configuring it to perform TCP health checks on the Schema Registry port (8081 by default) and REST Proxy port (8082 by default). But the ELB only supports health checks againt a single port, so we wrote health-checker to expose a simple HTTP web server that will check one or more TCP ports. This way a single ELB Health Check can in fact check that both Schema Registry and REST Proxy are running.

The confluent-tools-cluster module allows you to associate an ELB with the Confluent Toosl cluster, using the ELB's health checks to perform zero-downtime deployments (i.e., ensuring the previous node is passing health checks before deploying the next one) and to detect when a server is down and needs to be automatically replaced.

You may optionally connect to either REST Proxy or Schema Registery via the ELB, but keep in mind that an ELB's underlying IP addresses may change from time to time, and that the ELB isn't aware of which of the Schema Registry or REST Proxy nodes is serving as the master. Therefore, we prefer that your Schema Registry and REST Proxy clients be configured to attempt a connection at multiple endpoints, in which case your client can declare the static DNS names of the Confluent Tool servers. But if your client only supports a single endpoint, you should use the ELB, so that the ELB will only route requests to healthy nodes.

Check out the kafka-zookeeper-confluent-oss-standalone-clusters example for working sample code that includes an ELB.

Rolling deployments

To deploy updates to the Confluent Tools cluster, such as rolling out a new version of the AMI, you need to do the following:

Shut down one of the Confluent Tools servers.
Deploy the new code on a new server.
Wait for the new code to come up successfully and start passing health checks.
Repeat the process with the remaining servers.

This module can do this process for you automatically by using the server-group module's support for zero-downtime rolling deployment.

Data backup

Because Schema Registry and REST Proxy store all data in Kafka itself, no backup strategy is needed beyond what's already in place for Kafka.

Connecting to The Confluent Tools

Once you've used this module to deploy the Confluent tools, you'll want to connect to them via their RESTful APIs. This means you can simply use curl or search for more sophisticated clients in your desired proramming language that may feature a more intuitive interface and more intelligence around retries and error handling.

Because these services may be listening for inbound requests over HTTPS, the preferred way to to address the services is by their DNS names as specified in var.dns_names or var.dns_name_common_portion.

Note that these DNS names will resolve to the static IP address of an Elastic Network Interface (ENI), not the emphemeral IP address assigned to an EC2 Instance at boot. That means that you can rely on the DNS names you select as not changing often, if ever.

TLS/SSL and Security

All tools in the Confluent Stack -- Kafka, Kafka Connect, Schema Registry, REST Proxy -- support HTTPS connections so that you can transmit data in motion with encryption. But the interconnections between these tools can be confusing. Here is a general guide to what kind of security settings are possible for each tool, as well as whether this Gruntwork modules support a given security feature:

Kafka
- Accept inbound connections from clients over SSL/TLS (supported)
- Communicate between Kafka brokers over SSL/TLS (supported)
- Authenticate users using SSL or SASL (not yet supported)
- Limit access to Kafka using SASL or ACLs (not yet supported)
Schema Registry
- Accept inbound connections from clients over SSL/TLS (supported)
- Connect to Kafka brokers via SSL/TLS (supported)
- Authenticate to Kafka via SSL/SASL (not yet supported)
REST Proxy
- Accept inbound connections from clients over SSL/TLS (supported)
- Connect to Kafka brokers via SSL/TLS (supported)
- Connect to Schema Registry via SSL/TLS (supported)
- Authenticate to Kafka via SSL/SASL (not yet supported)
Kafka Connect
- Accept inbound connections from clients over SSL/TLS (supported)
- Connect to Kafka brokers via SSL/TLS (supported)
- Authenticate with SSL/SASL (not yet supported)
- Limit access using ACLs (not yet supported)
- Authenticate to Kafka via SSL/SASL (not yet supported)

In general, we try to enable the ability to connect to any service via SSL/TLS, but we do not yet support authentication and authorization. If it becomes apparent these features are in high demand, we will happily add them!

Note that you must generate TLS/SSL certificates separately for the Kafka and Confluent Tools clusters as described in the kafka-ami and confluent-oss-ami READMEs, respectively.

Reference

Inputs
Outputs

Required

allowed_inbound_cidr_blockslist(string)required

A list of CIDR-formatted IP address ranges that will be allowed to connect to Schema Registry and REST Proxy.

allowed_inbound_security_group_idslist(string)required

A list of security group IDs that will be allowed to connect to the Confluent tools cluster (Schema Registry and REST Proxy)

ami_idstringrequired

The ID of the AMI to run in this cluster. Should be an AMI that has the Confluent Tools installed by the install-confluent-tool module.

aws_regionstringrequired

The AWS region to deploy into.

cluster_namestringrequired

The name of the Confluent Tools cluster (e.g. confluent-tools-stage). This variable is used to namespace all resources created by this module.

cluster_sizenumberrequired

The number of brokers to have in the cluster.

confluent_tools_elb_security_group_idstringrequired

The ID of the Security Group associated with the ELB that fronts the Confluent Tools cluster.

instance_typestringrequired

The type of EC2 Instances to run for each node in the cluster (e.g. t2.micro).

num_allowed_inbound_security_group_idsnumberrequired

The number of security group IDs in allowed_inbound_security_group_ids. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685

subnet_idslist(string)required

The subnet IDs into which the EC2 Instances should be deployed. You should typically pass in one subnet ID per node in the cluster_size variable. We strongly recommend that you run the Confluent tools in private subnets.

user_datastringrequired

A User Data script to execute while the server is booting. We remmend passing in a bash script that executes the run-kafka-rest and run-schema-registry scripts, which should have been installed in the AMI with gruntwork-install.

vpc_idstringrequired

The ID of the VPC in which to deploy the cluster

Optional

additional_security_group_idslist(string)optional

A list of Security Group IDs that should be added to the Auto Scaling Group's Launch Configuration used to launch the Confluent Tools cluster EC2 Instances.

Default:[]

allowed_ssh_cidr_blockslist(string)optional

A list of CIDR-formatted IP address ranges from which the EC2 Instances will allow SSH connections

Default:[]

allowed_ssh_security_group_idslist(string)optional

A list of security group IDs from which the EC2 Instances will allow SSH connections

Default:[]

associate_public_ip_addressbooloptional

If set to true, associate a public IP address with each EC2 Instance in the cluster. We strongly recommend against making these nodes publicly accessible.

Default:false

attach_enibooloptional

Set to true to attach an Elastic Network Interface (ENI) to each server. This is an IP address that will remain static, even if the underlying servers are replaced.

Default:false

custom_tagsmap(string)optional

Custom tags to apply to the Confluent Tools nodes and all related resources (i.e., security groups, EBS Volumes, ENIs).

Default:{}

deployment_batch_sizenumberoptional

How many servers to deploy at a time during a rolling deployment. For example, if you have 10 servers and set this variable to 2, then the deployment will a) undeploy 2 servers, b) deploy 2 replacement servers, c) repeat the process for the next 2 servers.

Default:1

dns_name_common_portionstringoptional

The common portion of the DNS name to assign to each ENI in the Confluent Tools server group. For example, if confluent.acme.com, this module will create DNS records 0.confluent.acme.com, 1.confluent.acme.com, etc. Note that this value must be a valid record name for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id.

Default:null

dns_nameslist(string)optional

A list of DNS names to assign to the ENIs in the Confluent Tools server group. Make sure the list has n entries, where n = cluster_size. If this var is specified, it will override dns_name_common_portion. Example: [0.acme.com, 1.acme.com, 2.acme.com]. Note that the list entries must be valid records for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id.

Default:[]

dns_ttlnumberoptional

The TTL (Time to Live) to apply to any DNS records created by this module.

Default:300

elb_nameslist(string)optional

A list of Elastic Load Balancer (ELB) names to associate with the Confluent Tools nodes. We recommend using an ELB for health checks. If you're using an Application Load Balancer (ALB), use target_group_arns instead.

Default:[]

enable_detailed_monitoringbooloptional

Enable detailed CloudWatch monitoring for the servers. This gives you more granularity with your CloudWatch metrics, but also costs more money.

Default:false

enable_elastic_ipsbooloptional

If true, create an Elastic IP Address for each ENI and associate it with the ENI.

Default:false

enabled_metricslist(string)optional

A list of metrics the ASG should enable for monitoring all instances in a group. The allowed values are GroupMinSize, GroupMaxSize, GroupDesiredCapacity, GroupInServiceInstances, GroupPendingInstances, GroupStandbyInstances, GroupTerminatingInstances, GroupTotalInstances.

Default:[]

Examples

Example

   enabled_metrics = [
      "GroupDesiredCapacity",
      "GroupInServiceInstances",
      "GroupMaxSize",
      "GroupMinSize",
      "GroupPendingInstances",
      "GroupStandbyInstances",
      "GroupTerminatingInstances",
      "GroupTotalInstances"
    ]

health_check_grace_periodnumberoptional

Time, in seconds, after instance comes into service before checking health.

Default:300

health_check_typestringoptional

Controls how health checking is done. Must be one of EC2 or ELB.

Default:"EC2"

health_checker_listener_portnumberoptional

The port number on which health-checker (https://github.com/gruntwork-io/health-checker) accepts inbound HTTP connections. This is the port the ELB Health Check will actually use. Specify null to disable this Security Group rule.

Default:5500

portslist(object(…))optional

The port numbers that will be open on the server cluster from the given allowed_inbound_cidr_blocks or allowed_inbound_security_group_ids. Expects a list of maps, where each map has the keys 'port' and 'description', which correspond to the port to be opened and the description to be added to the Security Group Rule, respectively.

Type Details

list(object({
    port        = number
    description = string
  }))

Default

[
  {
    description = "Confluent Schema Registry",
    port = 8081
  },
  {
    description = "Confluent REST Proxy",
    port = 8082
  },
  {
    description = "Kafka Connect worker",
    port = 8083
  }
]

root_volume_delete_on_terminationbooloptional

Whether the root volume should be destroyed on instance termination.

Default:true

root_volume_ebs_optimizedbooloptional

If true, the launched EC2 instance will be EBS-optimized.

Default:false

root_volume_sizenumberoptional

The size, in GB, of the root EBS volume.

Default:50

root_volume_typestringoptional

The type of volume. Must be one of: standard, gp2, or io1.

Default:"gp2"

route53_hosted_zone_idstringoptional

The ID of the Route53 Hosted Zone in which we will create the DNS records specified by dns_names. Must be non-empty if dns_name_common_portion or dns_names is non-empty.

Default:null

script_log_levelstringoptional

The log level to use with the rolling deploy script. It can be useful to set this to DEBUG when troubleshooting the script.

Default:"INFO"

skip_health_checkbooloptional

If set to true, skip the health check, and start a rolling deployment without waiting for the server group to be in a healthy state. This is primarily useful if the server group is in a broken state and you want to force a deployment anyway.

Default:false

skip_rolling_deploybooloptional

If set to true, skip the rolling deployment, and destroy all the servers immediately. You should typically NOT enable this in prod, as it will cause downtime! The main use case for this flag is to make testing and cleanup easier. It can also be handy in case the rolling deployment code has a bug.

Default:false

ssh_key_namestringoptional

The name of an EC2 Key Pair that can be used to SSH to the EC2 Instances in this cluster. Set to an empty string to not associate a Key Pair.

Default:null

ssh_portnumberoptional

The port used for SSH connections

Default:22

target_group_arnslist(string)optional

A list of target group ARNs of Application Load Balanacer (ALB) targets to associate with the Confluent Tools nodes. We recommend using an ELB for health checks. If you're using a Elastic Load Balancer (AKA ELB Classic), use elb_names instead.

Default:[]

tenancystringoptional

The tenancy of the instance. Must be one of: default or dedicated.

Default:"default"

wait_for_capacity_timeoutstringoptional

A maximum duration that Terraform should wait for ASG instances to be healthy before timing out. Setting this to '0' causes Terraform to skip all Capacity Waiting behavior.

Default:"10m"

asg_names

cluster_size

dns_names

ebs_volume_ids

eni_elastic_ips

eni_private_ips

iam_role_arn

iam_role_id

security_group_id