Deploying Apache ZooKeeper

Date: 2025-01-13 12:08
Tags: distributed-systems

Something that consistently comes up in my imaginary yak shaving is distributed systems, and distributed systems often need a foundation of distributed consensus. So, I'm now running Apache Zookeeper on 5 nodes. Here are some notes on this.

Of course this is a toy deployment without particular performance guarantees. ZooKeeper is often deployed in a single data center or tightly-connected ones and can supposedly process thousands of transactions per second.

What is ZooKeeper?

Many people may not be familiar with Apache ZooKeeper. It's a distributed database with an extreme focus on consistency at the expense of everything else. It runs on a cluster of several computers (nodes), and write commands are not acknowledged until they have been durably committed to a majority (more than 50%) of nodes. Since any two majorities have at least one node in common, this means that after any arbitrary set of node crashes and restarts, as long a a majority of nodes are online and connected, at least one node has the latest data and can replicate it to the others.

If less than a majority of nodes are online, or disconnected from the rest, they take themselves offline and the database is down until this is fixed.

ZooKeeper is typically used as a foundation to store a relatively small amount of data to assist a faster system. For example, Apache Kafka uses it to remember which Kafka nodes exist and which data segments are stored on which nodes.

Requirements

I want to run ZooKeeper across my dn42 network as a base for future distributed systems projects. If possible I would like to run it directly on dn42. If that's not possible because of security concerns, I'll set up a separate VPN layer for ZooKeeper.

If ZooKeeper runs on dn42, I want to be able to give other people access to the same cluster in the future.

Performance expectations for the toy cluster are low - one write every second would be acceptable. Reads should be faster since they are served locally without waiting for cluster-wide consensus.

I'd like the cluster to be available 100% of the time, but nothing critical will break if it's down.

Unauthorized reads and writes and remote code execution definitely mustn't be possible. If ZooKeeper can't guarantee this by itself, then it has to run on a separate VPN.

A typical ZooKeeper design, with nodes in the same data center, recommends that clients should connect to random or round-robin servers from the whole cluster. Since this one is geographically distributed, it makes sense to connect to the closest node, so I created an anycast address. Currently, nodes aren't removed from the anycast rotation if they crash.

Security aspects

Traffic interception

I don't have to worry about ZK traffic crossing through other dn42 networks. All border routers reject routes to my address space from outside my network, even if the path begins from my ASN. This is part of the standard dn42 configuration.

Because my network is dynamically routed, it is possible that someone who compromises any router could possibly intercept all ZK traffic. This is not currently concerning. If this was something to be concerned about, I'd have to fix it at the network layer, not the ZK layer.

ZK servers will authenticate each other with mutual TLS. This is because they'll be open to connections from all of dn42. It may be possible to firewall the server-to-server communications separately from server-to-client, but I'm not familiar enough with ZooKeeper to trust that I've done it correctly.

Resource consumption

ZooKeeper appears completely unsafe against resource consumption denial of service. This will be something to live with, when exposing it to dn42. There's a rate limit on new connections, but that's no use against Slowloris, spamming auth packets to fill up the server's memory with credentials (it stores all of them), or plain old request spam. It would probably be wise to limit ZK's resource usage and keep track of who has been given a client certificate.

Availability

A ZK cluster always remains consistent, even with servers arbitrarily crashing, but it might not be able to process requests. Luckily, nothing very bad will happen if this one goes down due to someone causing all nodes to hit their memory limits.

Remote code execution, etc

ZooKeeper is written in Java - buffer overflows very unlikely. There have still been RCE vulnerabilities in Java programs in the past, which is another reason to put ZK in a Docker-like container. I don't think any are likely to occur in ZK - it just doesn't have very much surface area.

As part of Kerberos authentication, arbitrary commands can be specified in the configuration file, then executed. This isn't a concern for a static configuration file. If dynamic configuration is enabled, anyone who can edit the dynamic configuration might be able to execute arbitrary commands.

Authentication

The most worrying possibility with a public ZK cluster is some authentication bypass. I reviewed the code. Connection establishment takes place in phases: first a TCP connection is established, then (or at any time subsequently) the client sends an auth packet to establish their authentication parameters according to some method supported by the server, then it can send a createSession packet which allocates resources across the whole cluster, to allow the client to reconnect to a different node later on.

Two authentication methods appeal to me: IP address, and TLS client certificate. The IP address authentication method allows anyone to authenticate with a username equal to their IP address - so the ACLs on individual nodes are the only thing preventing an unknown client from accessing data (which is fine if ACLs are set correctly). The client certificate (X.509) authentication method denies authentication unless the client holds a trusted client certificate. The enforce.auth.enabled and enforce.auth.schemes options can be set to deny access to clients who don't authenticate using a specific method.

In my code review I found a case where ACLs are apparently not checked when they should be - so client certificate authentication it is. I reported it to the Zookeeper security contact.

Authorization

Who has access to which nodes? Each node has an ACL which specifies exactly who has access to five different permissions: read, write, create [children], delete [children], and admin (change ACL/quota). Only the ACL of the node being accessed is used - hierarchy is irrelevant.

Security principals are identified by the name of their authentication scheme, and a scheme-specific component. For X.509 certificate authentication, the scheme-specific component is the string representation of the certificate's dname. In our case, only the CN field is used, so the principal identifier is something like x509:CN=test.

The default nodes /, /zookeeper, and /zookeeper/quota are open to anyone. Their ACLs were changed from world:anyone:rwcda to x509:CN=admin:rwcda as part of the cluster's initial configuration.

The default node /zookeeper/config has default ACL world:anyone:r - after checking it doesn't contain any sensitive data, it's left alone, since it allows clients to discover the names of nodes apart from the one they connected to.

PKI

A local process will trust any remote process that proves control of the private key to a certificate signed by any certificate in the local process's trust store. Three separate 'trust groups' of certificates are in play:

Servers in the cluster authenticate other servers in the cluster for "quorum connections".
Servers authenticate clients, to identify the clients so they can apply authorization policies.
Clients authenticate servers, to ensure they connect to the correct system without MITM, just like web browsers do.

It's possible that 1 and 3 could be the same group (servers could use the same certificates when talking to clients as when talking to other servers) but it makes sense to keep them separate to allow them to have different key rotation policies.

Creating and signing the certificates is simple in concept but difficult in implementation. It took a few tries because some uses of certificates require certain flags to be set. Here are the commands I used for the test system, in case they may help other people:

# generate root keys
keytool -genkeypair -keystore zk_root.jks -storepass password -alias quorumroot -keyalg ed25519 -dname "cn=quorumRoot, ou=zookeeper, o=immibis_dn42, c=42" -validity 999999 -ext KeyUsage=digitalSignature,keyCertSign -ext BasicConstraints=ca:true,PathLen:2
keytool -genkeypair -keystore zk_root.jks -storepass password -alias serverroot -keyalg ed25519 -dname "cn=serverRoot, ou=zookeeper, o=immibis_dn42, c=42" -validity 999999 -ext KeyUsage=digitalSignature,keyCertSign -ext BasicConstraints=ca:true,PathLen:2
keytool -genkeypair -keystore zk_root.jks -storepass password -alias clientroot -keyalg ed25519 -dname "cn=clientRoot, ou=zookeeper, o=immibis_dn42, c=42" -validity 999999 -ext KeyUsage=digitalSignature,keyCertSign -ext BasicConstraints=ca:true,PathLen:2

# generate trust store files
keytool -exportcert -keystore zk_root.jks -storepass password -alias quorumroot | keytool -importcert -keystore zktrust_quorum.jks -storepass password -noprompt
keytool -exportcert -keystore zk_root.jks -storepass password -alias clientroot | keytool -importcert -keystore zktrust_client.jks -storepass password -noprompt
keytool -exportcert -keystore zk_root.jks -storepass password -alias serverroot | keytool -importcert -keystore zktrust_server.jks -storepass password -noprompt

# generate per-server keys
for SERVERID in 1 2; do
	keytool -genkeypair -keystore zks_${SERVERID}_quorum.jks -storepass password -alias quorum -keyalg ed25519 -dname "cn=${SERVERID}_q, ou=zookeeper, o=immibis_dn42, c=42" -validity 999999
	keytool -genkeypair -keystore zks_${SERVERID}_server.jks -storepass password -alias server -keyalg ed25519 -dname "cn=$SERVERID, ou=zookeeper, o=immibis_dn42, c=42" -validity 999999
done

# sign per-server keys with root, and import root certs into per-server keystore (needed to establish chain?)
# Separate keystore per key type, as expected by zookeeper. Supposedly, javax.net.ssl.X509KeyManager will pick a key from the keystore based on negotiation.
for SERVERID in 1 2; do
	for whichcert in quorum server; do
		keytool -exportcert -keystore zk_root.jks -storepass password -alias ${whichcert}root | \
		keytool -importcert -keystore zks_${SERVERID}_${whichcert}.jks -storepass password -alias ${whichcert}root -noprompt
		
		# clientAuth flag not needed for server cert - TODO
		keytool -certreq -keystore zks_${SERVERID}_${whichcert}.jks -storepass password -alias $whichcert | \
		keytool -gencert -keystore zk_root.jks -storepass password -keyalg ed25519 -alias ${whichcert}root -ext KeyUsage=digitalSignature,dataEncipherment,keyEncipherment,keyAgreement -ext ExtendedKeyUsage=serverAuth,clientAuth -ext SubjectAlternativeName:c=DNS:localhost,IP:127.0.0.1 | \
		keytool -importcert -keystore zks_${SERVERID}_${whichcert}.jks -storepass password -alias $whichcert 
	done
done

# generate per-client keys
for CLIENTID in test; do
	keytool -exportcert -keystore zk_root.jks -storepass password -alias clientroot | \
	keytool -importcert -keystore zkc_${CLIENTID}.jks -storepass password -alias clientroot -noprompt

	keytool -genkeypair -keystore zkc_$CLIENTID.jks -storepass password -alias mycert -keyalg ed25519 -dname "cn=$CLIENTID" -validity 999999 -ext KeyUsage=digitalSignature,dataEncipherment,keyEncipherment,keyAgreement -ext ExtendedKeyUsage=clientAuth -ext SubjectAlternativeName:c=DNS:localhost,IP:127.0.0.1
		
	keytool -certreq -keystore zkc_${CLIENTID}.jks -storepass password -alias mycert | \
	keytool -gencert -keystore zk_root.jks -storepass password -keyalg ed25519 -alias clientroot | \
	keytool -importcert -keystore zkc_${CLIENTID}.jks -storepass password -alias mycert
done

CLI connection

The ZooKeeper CLI needs a lot of parameters for TLS and has some issues connecting to an IPv6 address. Netty mode is required for TLS, but it tries to connect to the IPv6 address with an IPv4 socket. Setting io.netty.transport.noNative=true disables the broken code path and falls back to one that works.

This command works for me:

java -Dio.netty.transport.noNative=true -Dzookeeper.ssl.trustStore.location=zktrust_server.jks -Dzookeeper.ssl.keyStore.location=zkc_test.jks -Dzookeeper.ssl.trustStore.password=password -Dzookeeper.ssl.keyStore.password=password -Dzookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty -Dzookeeper.client.secure=true -cp 'lib/*' org.apache.zookeeper.ZooKeeperMain -server localhost:2181

immibis.com