Lab Exercise #c - Orchestration Cluster Operations and Resilience

Let's solve some more things to better understand the Camunda 8 orchestration cluster operations.

Q. The current orchestration cluster topology has 3 broker nodes with 3 partitions.
Cluster topology is one of the key decisions made by an Architect taking into account the non-functional requirements.
Visit Landing Page -> Topology to see the current topology and partition distribution.

Let's plan to increase the partition count to 5.

Note: Partition count can be increased upwards. However, it cannot be reduced to a lower number.


# Start in GCP Cloud Shell
export UNAMESPACE=$UNAMESPACE


gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs


# run this in background
kubectl port-forward "svc/$UNAMESPACE-zeebe-gateway" 9600:9600 -n $UNAMESPACE &
# You may dry run first to preview the partition distribution by appending ?dryRun=true to the request
curl -X 'PATCH' \
   'http://localhost:9600/orchestration/actuator/cluster' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
        "partitions": {
          "count": 5,
          "replicationFactor": 3
        }
      }'

# Wait for a minute or so. Check the status using below. 
# When complete, the lastChange status will show "COMPLETED" (this appears in the last two lines of the output json)
curl --request GET 'http://localhost:9600/orchestration/actuator/cluster'


# Lastly, lets stop the port forward
pkill -f "kubectl port-forward"

Visit again Landing Page -> Topology to see the new topology and partition distribution.

Q. Resilient platform operations is one of the key ingredients of the Camunda 8 platform architecture.

                    General IT operations view
                    
                        In the tech operations world, a single node failing is a normal Tuesday, but two nodes failing at the very same time is incredibly rare.
                        
                        Statistically speaking, this is because independent failures follow a rule of multiplication. 

                        If the chance of one node crashing today is 1 in 1,000, the chance of two specific nodes crashing on the exact same day is 1 in 1,000,000 (one in a million).

Let's do some work to disrupt the broker nodes and see the resiliency in action.

Option #1. Evict zeebe-0 pod (or any other zeebe pod) and check the outcome.
Option #2. Simulate hardware failure. Evict a cluster node (VM node) from the Kubernetes cluster. This can be done during live sessions only. (not applicable for self-paced subscription)


# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE


gcloud container clusters get-credentials c8-labs-sravana --region asia-south2 --project c8-labs


# Evict the pod from Project Lens or via kubectl command
export POD_NAME=$(kubectl get pods -n $UNAMESPACE | grep zeebe-0 | cut -f 1 -d " ")


# Let it be re-added by the system. It may take a while.
kubectl delete pod -n $UNAMESPACE $POD_NAME

Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?


## Option 2 - Run by the Administrator
# Start in GCP Cloud Shell
gcloud config set project c8-labs
export UNAMESPACE=$UNAMESPACE


gcloud compute instances delete vm-node-name  --zone=zone-id

Pick an earlier failed process instance via Operator and retry for another attempt.
Are you able to continue with transaction(s) while the rejig is happening in the background?