Cluster Operations

Here is the list of operations managed by CassKop at the Cluster level which have a dedicated status in each racks.

Those operations are applied at the Cassandra cluster level, as opposite to Pod operations that are executed at pod level and are discussed in the next section. Cluster Operations must only be triggered by a change made on the CassandraCluster object.

Some updates in the CassandraCluster CRD object are forbidden and will be gently dismissed by CassKop:

spec.dataCapacity
spec.dataStorage

Some Updates in the CassandraCluster CRD object will trigger a rolling update of the whole cluster such as :

spec.resources
spec.baseImage
spec.version
spec.configMap
spec.runAsUser
spec.fsGroup

Some Updates in the CassandraCluster CRD object will not trigger change on the cluster but only in future behavior of CassKop :

spec.autoPilot
spec.autoUpdateSeedList
spec.deletePVC
spec.hardAntiAffinity
spec.rollingPartition
spec.maxPodUnavailable
noCheckStsAreEqual

CassKop manages rolling updates for each statefulset in the cluster. Then each statefulset is making the rolling updated of it's pod according to the partition defined for each statefulset in the spec.topology.dc[].rack[].rollingPartition.

Initializing

The First Operation required in a Cassandra Cluster is the initialization.

In this Phase, the CassKop will create the CassandraCluster.Status section with an entry for each DC/Rack declared in the CassandraCluster.spec.topology section.

We could also have Initializing status if we decided later to add some DC to our topology.

With no `topology` defined

For demo we will create this CassandraCluster without topology section

apiVersion: "db.orange.com/v2"
kind: "CassandraCluster"
metadata:
  name: cassandra-demo
  labels:
    cluster: k8s.pic
spec:
  nodesPerRacks: 2
  baseImage: cassandra
  version: latest
  rollingPartition: 0
  dataCapacity: "3Gi"
  dataStorageClass: "local-storage"
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: false
  config:
    jvm-options:
      log_gc: "true"
  autoUpdateSeedList: true
  resources:
    requests:
      cpu: '2'
      memory: 2Gi
    limits:
      cpu: '2'
      memory: 2Gi

If no topology has been specified, then CassKop creates the default topology and status.

The default topology added by CassKop is :

...
  topology:
    dc:
    - name: dc1
      rack:
      - name: rack1

The number of cassandra nodes CassandraCluster.spec.nodesPerRacks defines the number of cassandra nodes CassKop must create in each of it's racks. In our example, there is only one default rack, so CassKop will only create 2 nodes.

important

with the default topology there will be no Kubernetes NodesAffinity to spread the Cassandra nodes on the cluster. In this case, CassKop will only create one Rack and one DC for Cassandra. It is not recommended as you may lose data in case of hardware failure

When Initialization has ended you should have a Status similar to :

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-18T15:10:51Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: Initializing
  lastClusterActionStatus: Done
  phase: Running
  seedlist:
  - cassandra-demo-dc1-rack1-0.cassandra-demo-dc1-rack1.cassandra-test
  - cassandra-demo-dc1-rack1-1.cassandra-demo-dc1-rack1.cassandra-test

The Status of the dc1-rack1 is Initializing=Done
The Status of the Cluster is Initializing=Done
The phase is Running which means that each Rack has the desired amount of Nodes.

We asked 2 nodesPerRacks and we have one default rack, so we ended with 2 Cassandra nodes in our cluster.

The Cassandra seedlist has been initialized and stored in the CassandraCluster.status.seedlist`. It has also been configured in each of the Cassandra Pods.

We can also confirm that Cassandra knows about the DC and Rack name we have deployed :

$ kubectl exec -ti cassandra-demo-dc1-rack1-0 nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.18.72.28   65.86 KiB  32           100.0%            fdc1a9e9-c5c3-4169-ae47-e6843efa096d  rack1
UN  172.18.120.12  65.86 KiB  32           100.0%            509ca725-fbf9-422f-a8e0-5e2a55474f70  rack1

With `topology` defined

In this example, I added a topology defining 2 Cassandra DC and 3 racks in total

apiVersion: "db.orange.com/v2"
kind: "CassandraCluster"
metadata:
  name: cassandra-demo
  labels:
    cluster: k8s.pic
spec:
  nodesPerRacks: 2
  baseImage: cassandra
  version: latest
  rollingPartition: 0
  dataCapacity: "3Gi"
  dataStorageClass: "local-storage"
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: true
  autoUpdateSeedList: true
  resources:
    requests:
      cpu: '2'
      memory: 2Gi
    limits:
      cpu: '2'
      memory: 2Gi
  topology:
    dc:
      - name: dc1
        labels:
          failure-domain.beta.kubernetes.io/region: europe-west1
        rack:
          - name: rack1
            labels:
              failure-domain.beta.kubernetes.io/zone: europe-west1-b
          - name: rack2
            labels:
              failure-domain.beta.kubernetes.io/zone: europe-west1-c
      - name: dc2
        nodesPerRacks: 3
        config:
          cassandra-yaml:
            num_tokens: 32
        labels:
          failure-domain.beta.kubernetes.io/region: europe-west1
        rack:
          - name: rack1
            labels:
              failure-domain.beta.kubernetes.io/zone: europe-west1-d

With this topology section I also references some Kubernetes nodes labels which will be used to spread the Cassandra nodes on each Racks on different groups of Kubernetes servers.

note

We can see here that we can give specific configuration for the number of pods in the dc2 (nodesPerRacks: 3) We also allow to configure Cassandra pods with different num_tokens confioguration for each dc using the appropriate parameter in the config.

CassKop will create a statefulset for each Rack, and start creating the Cassandra Cluster, starting by nodes from the Rack 1. When CassKop will end operations on Rack1, it will process the next rack and so on.

The status may be similar to :

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: Initializing
        status: Ongoing
      phase: Initializing
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: Initializing
        status: Ongoing
      phase: Initializing
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: Initializing
        status: Ongoing
      phase: Initializing
      podLastOperation: {}
  lastClusterAction: Initializing
  lastClusterActionStatus: Ongoing
  phase: Initializing
  seedlist:
  - cassandra-demo-dc1-rack1-0.cassandra-demo-dc1-rack1.cassandra-test
  - cassandra-demo-dc1-rack1-1.cassandra-demo-dc1-rack1.cassandra-test
  - cassandra-demo-dc1-rack2-0.cassandra-demo-dc1-rack2.cassandra-test
  - cassandra-demo-dc2-rack1-0.cassandra-demo-dc2-rack1.cassandra-test
  - cassandra-demo-dc2-rack1-1.cassandra-demo-dc2-rack1.cassandra-test
  - cassandra-demo-dc2-rack1-2.cassandra-demo-dc2-rack1.cassandra-test

The creation of the cluster is ongoing. We can see that, regarding the Cluster Topology, CassKop has created the SeedList.

tip

CassKop compute a seedlist with 3 nodes in each datacenter (if possible). The Cassandra seeds are always the first Cassandra nodes of a statefulset (starting with index 0).

When all racks are in status done, then the CassandraCluster.status.lastClusterActionStatus is changed to Done.

We can see that internally Cassandra also knows the desired topology :

kubectl exec -ti cassandra-demo-dc1-rack1-0 nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.18.112.6   126.05 KiB  32           38.5%             1512da3c-f6b2-469f-95d1-2d060043777a  rack1
UN  172.18.64.10   137.08 KiB  32           32.0%             8149054f-4bc3-4093-a6ef-80910c018122  rack2
UN  172.18.88.9    154.54 KiB  32           30.2%             dbe44aa6-6763-4bc1-825a-9ea7d21690e3  rack2
UN  172.18.120.15  119.88 KiB  32           33.7%             c87e858d-66a8-4544-9d28-718a1f94955b  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.18.72.8    119.65 KiB  32           26.9%             8688abd3-08b6-44e0-8805-05bd3650eea6  rack1
UN  172.18.104.8   153.08 KiB  32           38.8%             62adf02d-8c55-4d95-a459-45b1c9c3aa91  rack1

UpdateConfigMap

You can find in the cassandra-configuration section how you can use the spec.configMap parameter.

important

Actually CassKop doesn't monitor changes inside the ConfigMap. If you want to change a parameter in a file in the current configMap, you must create a new configMap with the updated version, and then ask CassKop to use the new configmap name.

If we add/change/remove the CassandraCluster.spec.configMapName then CassKop will start a RollingUpdate of each CassandraNodes in each Racks, starting from the first Rack defined in the topology.

apiVersion: "db.orange.com/v2"
kind: "CassandraCluster"
metadata:
  name: cassandra-demo
  labels:
    cluster: k8s.pic
spec:
  nodesPerRacks: 2
  baseImage: cassandra
  version: latest
  rollingPartition: 0
  dataCapacity: "3Gi"
  dataStorageClass: "local-storage"
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: true
  autoUpdateSeedList: true
  configMapName: cassandra-configmap-v1
  ...

First we need to create the configmap exemple:

kubectl apply -f config/samples/cassandra-configmap-v1.yaml

Then we apply the changes in the CassandraCluster.

We can see the CassandraCluster.Status updated by CassKop

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateConfigMap
        startTime: 2018-09-21T12:24:24Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-21T10:33:10Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-21T10:34:47Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateConfigMap
  lastClusterActionStatus: Ongoing

note

CassKop won't make a rolling update on the next rack until the status of the current rack becomesDone. The Operation is processing "rack per rack".

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateConfigMap
        endTime: 2018-09-21T12:26:10Z
        startTime: 2018-09-21T12:24:24Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateConfigMap
        endTime: 2018-09-21T12:27:25Z
        startTime: 2018-09-21T12:26:10Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateConfigMap
        startTime: 2018-09-21T12:27:27Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
  lastClusterAction: UpdateConfigMap
  lastClusterActionStatus: Ongoing

UpdateDockerImage

CassKop allows you to change the Cassandra docker image and gracefully redeploy your whole cluster.

If we change the CassandraCluster.spec.baseImage and or CassandraCluster.spec.version CassKop will start to perform a RollingUpdate on the whole cluster (for each racks sequentially, in order to change the version of the Cassandra Docker Image on all nodes.

You can change the docker image used to :

change the version of Cassandra
change the version of Java
Change some configuration parameters for cassandra or jvm if you don't overwrite them with a ConfigMap

The status may be similar to:

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateDockerImage
        startTime: 2018-09-18T16:08:59Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-18T16:05:51Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-18T16:07:52Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateDockerImage
  lastClusterActionStatus: Ongoing
  phase: Pending
  seedlist:
  - cassandra-demo-dc1-rack1-0.cassandra-demo-dc1-rack1.cassandra-test
  - cassandra-demo-dc1-rack1-1.cassandra-demo-dc1-rack1.cassandra-test
  - cassandra-demo-dc1-rack2-0.cassandra-demo-dc1-rack2.cassandra-test
  - cassandra-demo-dc2-rack1-0.cassandra-demo-dc2-rack1.cassandra-test
  - cassandra-demo-dc2-rack1-1.cassandra-demo-dc2-rack1.cassandra-test
  - cassandra-demo-dc2-rack1-2.cassandra-demo-dc2-rack1.cassandra-test

We can see that CassKop has started to Update the dc1-rack1 and it has changed the lastClusterAction and lastClusterStatus accordingly.

Once it has finished the first rack, then it processes the next one:

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateDockerImage
        endTime: 2018-09-18T16:10:51Z
        startTime: 2018-09-18T16:08:59Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateDockerImage
        startTime: 2018-09-18T16:10:51Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: 2018-09-18T16:07:52Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateDockerImage
  lastClusterActionStatus: Ongoing

And when all racks are Done:

  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateDockerImage
        endTime: 2018-09-18T16:10:51Z
        startTime: 2018-09-18T16:08:59Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateDockerImage
        endTime: 2018-09-18T16:12:42Z
        startTime: 2018-09-18T16:10:51Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateDockerImage
        endTime: 2018-09-18T16:14:52Z
        startTime: 2018-09-18T16:12:42Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateDockerImage
  lastClusterActionStatus: Done
  phase: Running

This provides a Central view to monitor what is happening on the Cassandra Cluster.

UpdateResources

CassKop allows you to configure your Cassandra's pods resources (memory and cpu).

If we change the CassandraCluster.spec.resources, then CassKop will start to make a RollingUpdate on the whole cluster (for each racks sequentially) to change the version of the Cassandra Docker Image on all nodes.

See section

Resource limits and requets

For example, to increase Memory/CPU requests and/or limits:

    requests:
      cpu: '2'
      memory: 3Gi
    limits:
      cpu: '2'
      memory: 3Gi

Then CassKop should output the status:

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateResources
        startTime: 2018-09-21T15:28:43Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateResources
        startTime: 2018-09-21T15:28:43Z
        status: ToDo
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateResources
        startTime: 2018-09-21T15:28:43Z
        status: ToDo
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateResources
  lastClusterActionStatus: Ongoing

We can see that it has staged the UpdateResources action in all racks (status=ToDo) and has started the action in the first rack (status=Ongoing). Once Done it will follow with next rack, and so on.

Upon completion, the status may look like :

  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateResources
        endTime: 2018-09-21T15:30:31Z
        startTime: 2018-09-21T15:28:43Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateResources
        endTime: 2018-09-21T15:32:12Z
        startTime: 2018-09-21T15:30:32Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateResources
        endTime: 2018-09-21T15:34:07Z
        startTime: 2018-09-21T15:32:13Z
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: UpdateResources
  lastClusterActionStatus: Done

Scaling the cluster

The Scaling of the Cluster is managed through the nodesPerRacks parameters and through the number of Dcs and Racks defined in the Topology section.

See section NodesPerRacks

note

if the ScaleUp (or the ScaleDown) may change the SeedList and if spec.autoUpdateSeedList is set to true then CassKop will program a new operation : UpdateSeedList which will trigger a rollingUpdate to apply the new seedlist on all nodes, once the Scaling is done.

ScaleUp

CassKop allows you to Scale Up your Cassandra cluster.

There is a global parameter CassandraCluster.spec.nodesPerRacks which specify the number of Cassandra nodes we want in a rack.

It is possible to surcharge this for a particular DC in the CassandraCluster.spec.topology.dc[<idx>].nodesPerRacks

Example:

  topology:
    dc:
      - name: dc1
        rack:
          - name: rack1
          - name: rack2
      - name: dc2
        nodesPerRacks: 3        <--- We increase by one this value
        rack:
          - name: rack1

In this case, we ask to ScaleUp nodes of second DC dc2

CassKop takes into account the new target, and starts applying modifications in the cluster :

...
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        status: Configuring
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateSeedList
        status: Configuring
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: ScaleUp
        startTime: 2018-09-27T15:02:21Z
        status: Configuring
      phase: Pending
  lastClusterAction: ScaleUp
  lastClusterActionStatus: Ongoing
 ...

We can see that CassKop:

has started the ScaleUp action in dc2-rack1
has found that the SeedList must be updated, and because the autoUpdateSeedList=true it has staged (status=Configuring) the UpdateSeedList operation for dc1-rack1 and dc1-rack2

When CassKop ends the ScaleUp action in the dc2-rack1 then it will also stage this rack with UpdateSeedList=Configuring. Once all racks are in this state, CassKop will turn each Rack in status UpdateSeedList=ToDo, meaning that it can start the operation.

Starting from then, CassKop will iterate on each rack one after the other and get status :

UpdateSeedList=Ongoing meaning that it is currently doing a rolling update on the Rack to update the SeedList seting also sets the startTime.
UpdateSeedList=Done meaning that the operation is done. (then, it sets the endTime)

See evolution of status:

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        endTime: 2018-09-27T15:05:00Z
        startTime: 2018-09-27T15:03:13Z
        status: Done
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateSeedList
        startTime: 2018-09-27T15:03:13Z
        status: Ongoing
      phase: Pending
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        status: ToDo
      phase: Running
  lastClusterAction: UpdateSeedList
  lastClusterActionStatus: Finalizing
  phase: Pending

Here is the final topology seen from nodetool :

$ kubectl  exec -ti cassandra-demo-dc1-rack1-0 nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.18.88.6    211.95 KiB  32           27.4%             dbe44aa6-6763-4bc1-825a-9ea7d21690e3  rack2
UN  172.18.112.5   231.49 KiB  32           29.2%             1512da3c-f6b2-469f-95d1-2d060043777a  rack1
UN  172.18.64.10   188.36 KiB  32           27.6%             8149054f-4bc3-4093-a6ef-80910c018122  rack2
UN  172.18.120.14  237.62 KiB  32           29.8%             c87e858d-66a8-4544-9d28-718a1f94955b  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.18.72.8    228.54 KiB  32           22.0%             8688abd3-08b6-44e0-8805-05bd3650eea6  rack1
UN  172.18.104.8   212.34 KiB  32           32.2%             62adf02d-8c55-4d95-a459-45b1c9c3aa91  rack1
UN  172.18.88.9    148.34 KiB  32           31.7%             fecdfb5d-3ad4-4204-8ca5-cc7f1c4c19c4  rack1

Note that nodetool prints IP of nodes while kubernetes works with names :

$ kubectl get pods -o wide -l app=cassandracluster
NAME                         READY     STATUS    RESTARTS   AGE       IP              NODE      NOMINATED NODE
cassandra-demo-dc1-rack1-0   1/1       Running   0          14m       172.18.112.5    node006   <none>
cassandra-demo-dc1-rack1-1   1/1       Running   0          15m       172.18.120.14   node003   <none>
cassandra-demo-dc1-rack2-0   1/1       Running   0          13m       172.18.88.6     node005   <none>
cassandra-demo-dc1-rack2-1   1/1       Running   0          13m       172.18.64.10    node004   <none>
cassandra-demo-dc2-rack1-0   1/1       Running   0          10m       172.18.72.8     node008   <none>
cassandra-demo-dc2-rack1-1   1/1       Running   0          11m       172.18.104.8    node007   <none>
cassandra-demo-dc2-rack1-2   1/1       Running   0          12m       172.18.88.9     node005   <none>

After the ScaleUp has finished, CassKop must execute a cassandra cleanup on each nodes of the Cluster. This can be manually triggered by setting appropriate labels on each Pods.

CassKop can automate this if spec.autoPilot is true by setting the labels on each Pods of the cluster with a ToDo state and then find thoses pods to sequentially execute thoses actions.

See podOperation Cleanup!!

UpdateScaleDown

For ScaleDown, CassKop must perform a clean cassandra decommission prior to actually scale down the cluster at Kubernetes level.

Actually, this is done through CassKop asking the decommission through a jolokia call and waiting for it to be performed (cassandra node status = decommissionned) before updating kubernetes statefulset (removing the pod).

important

If we ask to scale down more than 1 node at a time, then CassKop will iterate on a single scale down until it reaches the requested number of nodes.

Also CassKop will refuse a scaledown to 0 for a DC if there still have some data replicated to it.

To launch a ScaleDown, we simply need to decrease the value of nodesPerRacks.

  topology:
    dc:
      - name: dc1
        rack:
          - name: rack1
          - name: rack2
      - name: dc2
        nodesPerRacks: 2        <--- Get back to 2
        rack:
          - name: rack1

We can see in the below example that:

It has started the ScaleDown action in dc2-rack1
CassKop has found that the SeedList must be updated, and it has staged (status=ToDo) it for dc1-rack1 and dc1-rack2

When CassKop completes the ScaleDown in the dc2-rack1 then it will stage it also with UpdateSeedList=ToDo Once all racks are in this state, CassKop will turn each Rack in status UpdateSeedList=Ongoing meaning that it can start the operation, it also set the startTime

Then, CassKop will iterate on each rack one after the other and get status :

UpdateSeedList=Finalizing meaning that it is currently doing a rolling update on the Rack to update the SeedList
UpdateSeedList=Done meaning that the operation is done. Then, it sets the endTime.

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        status: ToDo
      phase: Running
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateSeedList
        status: ToDo
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: ScaleDown
        startTime: 2018-09-27T15:22:23Z
        status: Ongoing
      phase: Running
      podLastOperation:
        Name: decommission
        pods:
        - cassandra-demo-dc2-rack1-2
        startTime: 2018-09-27T15:22:23Z
        status: Ongoing
  lastClusterAction: ScaleDown
  lastClusterActionStatus: Ongoing

When ScaleDown=Done CassKop will start the UpdateSeedList operation.

status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        startTime: 2018-09-27T15:23:54Z
        status: Finalizing
      phase: Pending
      podLastOperation: {}
    dc1-rack2:
      cassandraLastAction:
        Name: UpdateSeedList
        startTime: 2018-09-27T15:23:54Z
        status: Ongoing
      phase: Running
      podLastOperation: {}
    dc2-rack1:
      cassandraLastAction:
        Name: UpdateSeedList
        startTime: 2018-09-27T15:23:54Z
        status: Ongoing
      phase: Running
      podLastOperation:
        Name: decommission
        endTime: 2018-09-27T15:23:51Z
        podsOK:
        - cassandra-demo-dc2-rack1-2
        startTime: 2018-09-27T15:22:23Z
        status: Done
  lastClusterAction: UpdateSeedList
  lastClusterActionStatus: Finalizing
  phase: Pending

It shows also that podLastOperation decommission is Done. CassKop will then rollingUpdate all racks one by one in order to update the Cassandra seedlist.

UpdateSeedList

The UpdateSeedList is done automatically by CassKop when the parameter CassandraCluster.spec.autoUpdateSeedList is true (default).

See ScaleUp and ScaleDown.

CorrectCRDConfig

The CRD CassandraCluster is used to define your cluster configuration. Some fields can't be updated in a kubernetes clusters. Some fields are taken from the CRD to configure thoses objects, and to be sure we don't update them (to prevent kubernetes objects in errors), we have configure CassKop to simply ignore/revert unauthorized changed to the CRD.

Example With this CRD deployed :

spec:
  nodesPerRacks: 2
  baseImage: cassandra
  version: latest
  imagePullSecret:
    name: advisedev # To authenticate on docker registry
  rollingPartition: 0
  dataCapacity: "3Gi"                  <-- can't be changed
  dataStorageClass: "local-storage"    <-- can't be changed
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: true
  autoUpdateSeedList: true

If we try to update the dataCapacity or dataStorageClass nothing will happen. And we could see thoses messages in the logs of CassKop :

time="2018-09-27T17:44:13+02:00" level=warning msg="[cassandra-demo]: CassKop has refused the changed on DataCapacity from [3Gi] to NewValue[4Gi]"
time="2018-09-27T17:44:35+02:00" level=warning msg="[cassandra-demo]: CassKop has refused the changed on DataStorageClass from [local-storage] to NewValue[local-storag]"

If you performed the modification by updating your local CRD file and apply it with kubectl you must revert to the old value.

Delete a DC

Prior to delete a DC, you must have ScaleDown to 0 all the Racks, if not, CassKop will refuse and correct the CRD.
Prior to scaleDown to 0 CassKop will ensure that there are no more data replicated to the DC, if not, CassKop will refuse and correct the CRD. Because CassKop wants that we have the same amounts of pods in all racks, we decided that we would't allow to remove only a rack. This will be revert too.

important

You must ScaleDown to 0 before you remoove a DC You must change replication factor before doing a ScaleDown to 0 for a DC

Kubernetes node maintenance operation

In a normal production environment, CassKop will have spread it's Cassandra pods on differents k8s nodes. If the team in charge of the machines needs to make some operations on a host they can make a drain.

The Kubernetes drain command will ask the scheduler to make an eviction for all pods on the current nodes, and for many workloads k8s will reschedule them on other machines. In the case of CassKop cassandra pods, they won't be scheduled on another host, because they uses local-storage and are stick to a specific host thanks to the PersistentVolumeClaim kubernetes object.

Example: we drain the node008 for a maintenance operation.

kubectl drainnode node008 --ignore-daemonsets --delete-local-data

All pods will be evicted, thoses who can will be rescheduled on another hosts. Our Cassandra pod won't we able to be schedule elsewhere due to the PVS, and we can see this messages in the k8s events :

0s    Warning   FailedScheduling   Pod   0/8 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taints
that the pod didn't tolerate, 5 node(s) had volume node affinity conflict.

It explain that 1 node is unshedulable, this is the one we just drain. the 5 other nodes can't be scheduled by our pod because they have volume node affinity conflict ()our pods have an affinity on node008).

Once the team have finished their maintenance operation they can bring back the host into the kubernetes cluster. From then, k8s will be able to reshedule back the cassandra pod into the cluster so that it can re-join the ring.

$ kubectl uncordon node008
node/node008 uncordoned

Immediately the pending pod is rescheduled and started on the host. If the time of interruption was not too long there is nothing more to do, the node will join the ring and re-synchronise with the cluster. If the time was too long, then it may be needed to schedule some PodOperations that you will find in nexts sections of this document.

The PodDisruptionBudget (PDB) protection

If a k8s admin ask to drain a node, this may not been allowed by the cassandracluster regarding it's current state and the configuration of its PDB (usually only 1 nodes allowed to be in disruption).

Example :

$ kubectl drainnode node008 --ignore-daemonsets --delete-local-data
error when evicting pod "cassandra-demo-dc2-rack1-0" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

The node008 will be flagged as SchedulingDisabled, so that it won't take new workload. It will evict all possible pods, but if there was an ongoing disruption on the current Cassandra cluster, it won't be allowed to evict the cassandra pod.

Example of a PDB :

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  labels:
    app: cassandracluster
    cassandracluster: cassandra-test
    cluster: k8s.pic
  name: cassandra-test
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: cassandracluster
      cassandracluster: cassandra-test
      cluster: k8s.pic
status:
  currentHealthy: 13
  desiredHealthy: 13
  disruptedPods: null
  disruptionsAllowed: 0
  expectedPods: 14
  observedGeneration: 1

In this example we see that we allowed only 1 Pod unavailable, and on our cluster we wants to have 14 pods and we only have 13 healthy, that's why the PDB won't allow the eviction of an additional pod.

To be able to continue, we need to wait or to make appropriate actions so that the Cassandra cluster won't have any unavailable nodes.

K8S host major failure: replacing a cassandra node

In the case of a major host failure, it may not be possible to bring back the node to life. We can in this case consider that our cassandra node is lost and we will want to replace it on another host.

In this case we may have 2 solutions that will require some manual actions :

Remove old node and create new one

In this case we will use CassKop client to schedule a cassandra removenode for the failing node.

kubectl casskop remove --pod <pod_name> [--previous-ip <previous_ip_pod>] {--from-pod <pod_name> | --crd <crd_name>}

This will trigger the PodOperation removenode by setting the appropriate labels on a cassandra Pod.

Once the node is properly removed, we can free the link between the Pod and the failing host by removing the associated PodDisruptionBudget

kubectl delete pvc data-cassandra-test-dc1-rack2-1

This will allow Kubernetes to reschedule the Pod on another free host.

Once the node is back in the cluster we need to apply a cleanup on all nodes

kubectl casskop cleanup start

you can pause the cleanup and check status with

kubectl casskop cleanup pause
kubectl casskop cleanup status

Replace node with a new one

In some cases It may be useful to prefer to replace the node. Because we use a statefulset to deploy cassandra pods, by definition all pods are identical and we couldn't execute specific actions on a specific node at startup.

For that CassKop provide the ability to execute a pre_run.sh script that can be change using the CRD ConfigMap.

To see how to use the configmap see Overriding Configuration using configMap

for example If we want to replace the node cassandra-test-dc1-rack2-1, we first need to retrieve it's IP address from nodetool status for example :

$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.184.189  35.18 GiB  256          51.2%             9455a9bd-7a61-405e-8c3f-ee1f72f63500  rack1
UN  172.31.180.138  37 GiB     256          51.0%             1ad1b4b7-c719-4683-8109-31aa9722c1ee  rack2
UN  172.31.179.248  37.86 GiB  256          47.4%             69cbf178-2477-4420-ac71-6fad10f93759  rack2
UN  172.31.182.120  41.76 GiB  256          50.2%             a4ffac86-990d-4487-80a0-b2e177d8e06e  rack1
DN  172.31.183.213  31.14 GiB  256          51.9%             e45107ba-fe7b-4904-98cf-1373d1946bb5  rack2
UN  172.31.181.193  33.15 GiB  256          48.4%             35806f73-17fb-4d91-b2e7-8333f393189b  rack1

Then we can edit the ConfigMap to edit the pre_run.sh script :

apiVersion: v1
kind: ConfigMap
metadata:
  name: cassandra-configmap-pre-run
data:
  pre_run.sh: |-
    echo "** this is a pre-scrip for run.sh that can be edit with configmap"
    test "$(hostname)" == 'cassandra-demo-dc1-rack3-0' && echo "-Dcassandra.replace_address_first_boot=172.31.183.213" > /etc/cassandra/jvm.options
    echo "** end of pre_run.sh script, continue with run.sh"

So the Operation will be :

Edit the configmap with the appropriate CASSANDRA_REPLACE_NODE IP for the targeted pod name
delete the pvc data-cassandra-test-dc1-rack2-1
the Pod will boot, execute the pre_run.sh script prior to the /run.sh
the new pod replace the dead one by re-syncing the content which could take some times depending on the data size.
Do not forget to edit again the ConfigMap and to remove the specific line with replace_node instructions.

Initializing​

With no topology defined​

With topology defined​

UpdateConfigMap​

UpdateDockerImage​

UpdateResources​

Scaling the cluster​

ScaleUp​

UpdateScaleDown​

UpdateSeedList​

CorrectCRDConfig​

Delete a DC​

Kubernetes node maintenance operation​

The PodDisruptionBudget (PDB) protection​

K8S host major failure: replacing a cassandra node​

Remove old node and create new one​

Replace node with a new one​

Initializing

With no `topology` defined

With `topology` defined

UpdateConfigMap

UpdateDockerImage

UpdateResources

Scaling the cluster

ScaleUp

UpdateScaleDown

UpdateSeedList

CorrectCRDConfig

Delete a DC

Kubernetes node maintenance operation

The PodDisruptionBudget (PDB) protection

K8S host major failure: replacing a cassandra node

Remove old node and create new one

Replace node with a new one