Lab Exercise #c - Orchestration Cluster Operations and Resilience
Let's solve some more things to better understand the Camunda 8 orchestration cluster operations.
Q. The current orchestration cluster topology has 3 broker nodes with 3 partitions.
Cluster topology is one of the key decisions made by an Architect taking into account the non-functional requirements.
Visit Landing Page -> Topology to see the current topology and partition distribution.
Let's plan to increase the partition count to 5.
Cluster topology is one of the key decisions made by an Architect taking into account the non-functional requirements.
Visit Landing Page -> Topology to see the current topology and partition distribution.
Let's plan to increase the partition count to 5.
- Note: Partition count can be increased upwards. However, it cannot be reduced to a lower number.
# Start in GCP Cloud Shell
export UNAMESPACE=$UNAMESPACE
gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs
# run this in background
kubectl port-forward "svc/$UNAMESPACE-zeebe-gateway" 9600:9600 -n $UNAMESPACE &
# You may dry run first to preview the partition distribution by appending ?dryRun=true to the request
curl -X 'PATCH' \
'http://localhost:9600/orchestration/actuator/cluster' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"partitions": {
"count": 5,
"replicationFactor": 3
}
}'
# Wait for a minute or so. Check the status using below.
# When complete, the lastChange status will show "COMPLETED" (this appears in the last two lines of the output json)
curl --request GET 'http://localhost:9600/orchestration/actuator/cluster'
# Lastly, lets stop the port forward
pkill -f "kubectl port-forward"
Visit again Landing Page -> Topology to see the new topology and partition distribution.
Q. Resilient platform operations is one of the key ingredients of the Camunda 8 platform architecture.
Are you able to continue with transaction(s) while the rejig is happening in the background?
Are you able to continue with transaction(s) while the rejig is happening in the background?
General IT operations view
Let's do some work to disrupt the broker nodes and see the resiliency in action.- In the tech operations world, a single node failing is a normal Tuesday, but two nodes failing at the very same time is incredibly rare.
-
Statistically speaking, this is because independent failures follow a rule of multiplication.
If the chance of one node crashing today is 1 in 1,000, the chance of two specific nodes crashing on the exact same day is 1 in 1,000,000 (one in a million).
-
Option #1. Evict
zeebe-0 pod(or any other zeebe pod) and check the outcome. - Option #2. Simulate hardware failure. Evict a cluster node (VM node) from the Kubernetes cluster. This can be done during live sessions only. (not applicable for self-paced subscription)
# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE
gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs
# Evict the pod from Project Lens or via kubectl command
export POD_NAME=$(kubectl get pods -n $UNAMESPACE | grep zeebe-0 | cut -f 1 -d " ")
# Let it be re-added by the system. It may take a while.
kubectl delete pod -n $UNAMESPACE $POD_NAME
Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?
## Option 2 - Run by the Administrator
# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE
gcloud compute instances delete vm-node-name --zone=zone-id
Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?