Lab Exercise #c - Orchestration Cluster Operations and Resilience

Let's solve some more things to better understand the Camunda 8 orchestration cluster operations.

Q. The current orchestration cluster topology has 3 broker nodes with 3 partitions.
Cluster topology is one of the key decisions made by an Architect taking into account the non-functional requirements.
Visit Landing Page -> Topology to see the current topology and partition distribution.

Let's plan to increase the partition count to 5.

  • Note: Partition count can be increased upwards. However, it cannot be reduced to a lower number.

# Start in GCP Cloud Shell
export UNAMESPACE=$UNAMESPACE

gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs
# run this in background kubectl port-forward "svc/$UNAMESPACE-zeebe-gateway" 9600:9600 -n $UNAMESPACE & # You may dry run first to preview the partition distribution by appending ?dryRun=true to the request curl -X 'PATCH' \ 'http://localhost:9600/orchestration/actuator/cluster' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "partitions": { "count": 5, "replicationFactor": 3 } }' # Wait for a minute or so. Check the status using below. # When complete, the lastChange status will show "COMPLETED" (this appears in the last two lines of the output json) curl --request GET 'http://localhost:9600/orchestration/actuator/cluster'
# Lastly, lets stop the port forward pkill -f "kubectl port-forward"

Visit again Landing Page -> Topology to see the new topology and partition distribution.

Q. Resilient platform operations is one of the key ingredients of the Camunda 8 platform architecture.

General IT operations view
  • In the tech operations world, a single node failing is a normal Tuesday, but two nodes failing at the very same time is incredibly rare.
  • Statistically speaking, this is because independent failures follow a rule of multiplication.
    If the chance of one node crashing today is 1 in 1,000, the chance of two specific nodes crashing on the exact same day is 1 in 1,000,000 (one in a million).
Let's do some work to disrupt the broker nodes and see the resiliency in action.
  • Option #1. Evict zeebe-0 pod (or any other zeebe pod) and check the outcome.
  • Option #2. Simulate hardware failure. Evict a cluster node (VM node) from the Kubernetes cluster. This can be done during live sessions only. (not applicable for self-paced subscription)

# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE

gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs
# Evict the pod from Project Lens or via kubectl command export POD_NAME=$(kubectl get pods -n $UNAMESPACE | grep zeebe-0 | cut -f 1 -d " ")
# Let it be re-added by the system. It may take a while. kubectl delete pod -n $UNAMESPACE $POD_NAME

Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?


## Option 2 - Run by the Administrator
# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE

gcloud compute instances delete vm-node-name --zone=zone-id

Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?