Shutdown Kubernetes

Posted on 2022-01-30 in Programmation

It can be a good idea to shutdown part of your infrastructure when you don't need it. It will help you reduce costs and your environmental impact. It is however, not always easy to actually do it.

In this article, I'll talk here about how to shutdown a Kubernetes based infrastructure hosted in GCP. The ideas are general enough so they should apply to other cloud providers as well once properly adapted. Please note that it's possible to do this with GKE but may not be possible (or as simple) on other kind of Kubernetes clusters: your cluster needs to support 0 node. We will of course do this automatically and without the need for a developer computer to be on. To achieve this, we will rely on the Cloud Scheduler service which can call HTTP endpoints or connect to internal GCP APIs directly at defined intervals.

Shutting down/starting up the cluster

First, we will shutdown the cluster. If you have a GKE standard mode cluster (where you manage the nodes yourself), you should use the REST container API to shut down all nodes with the setSize endpoint. You can call this endpoint directly within the Cloud Scheduler provided you configure the task with a service account with the Kubernetes Engine Cluster Admin role.

Note

I have an autopilot cluster, so I couldn't test this method. I provide it here nonetheless based on my research and understanding of GCP infrastructure. If you do test it, please let me know (and tell me how to do it properly) in a comment.

If like me you have an autopilot cluster, you will need to scale all your deployments to 0. GKE will then be able to automatically shrink the number of nodes to 0. Since your are charged for used resources anyway, even if the cluster still has nodes, you won't pay for them. Sadly, there isn't a GCP API to do this directly, you need to rely on Kubernetes to do it. Given the constraints of Cloud Scheduler tasks (it can only to basic HTTP calls), I decided to make it call a Cloud Function that will do the actual work. Here's how you can do it:

  1. Create a service account so you Cloud Function can connect to your Kubernetes cluster. See this page to learn how to do that. Let's call it cloud-function-k8s.

  2. Create two secrets to store private cluster data:

    • cloud-function-k8s-access-key will hold the service account access key in JSON.
    • kubeconfig-for-cloud-functions will hold the content of kubeconfig.yaml.

    These secrets will then be mounted into the function.

  3. Create a Cloud Function to do the actual job. It will be triggered by a HTTP call (made by the Cloud Scheduler task) and must require HTTPS and authentication (you don't want anyone to call it). You also must:

    • Configure the KUBECONFIG=/k8s-cfg/kubeconfig.yaml and GOOGLE_APPLICATION_CREDENTIALS=/gsa-key/gsa-key.json environment variables so our code will be able to find the credentials and configuration.

    • Under the Security tab, you need to configure our two secrets that must be mounted as volumes:

      • cloud-function-k8s-access-key mounted to /gsa-key with the latest version mounted as gsa-key.json.
      • kubeconfig-for-cloud-functions mounted to /k8s-cfg with the latest version mounted as kubeconfig.yaml.
    • Under the code tab:

      • I used Python 3.9 but you can use any other runtime as long as it has a Kubernetes SDK.

      • I named my entry point accept_request.

      • I used this requirements.txt to install the Kubernetes SDK:

        kubernetes==21.7.0
        
      • This main.py:

        from kubernetes import client, config
        
        
        def accept_request(request):
            """Triggered from a message on a Cloud Pub/Sub topic.
            Args:
                 event (dict): Event payload.
                 context (google.cloud.functions.Context): Metadata for the event.
            """
            message = request.get_json()
            replicas = message['replicas']
        
            _scale(replicas)
        
            return {'success': True}
        
        
        def _scale(replicas, api_client=None):
            config.load_kube_config(persist_config=False)
            apps_api = client.AppsV1Api(api_client)
        
            for deployment in apps_api.list_deployment_for_all_namespaces().items:
                if deployment.metadata.namespace.startswith(
                        "gatekeeper-system"
                ) or deployment.metadata.namespace.startswith("kube-"):
                    continue
        
                print(f"Scaling {deployment.metadata.name} to {replicas}")
                apps_api.patch_namespaced_deployment_scale(
                    deployment.metadata.name,
                    deployment.metadata.namespace,
                    {"spec": {"replicas": replicas}},
                )
        
  4. Once the function is created, you must create another service account with the Cloud Functions Invoker role and add it in the Authorization section of the function. It will be used to allow the task and only it to run the function. Let's call it cron-maintenance.

  5. Create the Cloud Scheduler task so it calls your HTTP function with the POST method. You must set the Content-Type HTTP header to application/json; charset=utf-8 and use this payload:

    {
      "replicas": 0
    }
    

    Under the authentication section, add the cron-maintenance service account with an OIDC token.

To scale up and restart the cluster, you simply need to change the number of replicas to use in another task.

Note

OAuth tokens can only be used for internal GCP APIs.

Shutting down/starting up the database

It's easier: we can create a task to use the Cloud SQL Admin API directly. Here's how it goes:

  1. Create the task and configure when it must.

  2. For its execution:

    • Select this HTTP target: https://sqladmin.googleapis.com/v1/projects/<PROJECT_ID>>/instances/<INSTANCE_ID> Replace <PROJECT_ID> with your project id and <INSTANCE_ID> by your instance ID.

    • Use PATCH as the HTTP method.

    • Set the Content-Type HTTP header to application/json; charset=utf-8.

    • Set the body to this JSON:

      {
        "settings": {
          "activationPolicy": "NEVER"
        }
      }
      
    • Configure authorization to use an OAuth token from a service account with the Cloud SQL Editor role. It's probably best to have a dedicated account for this.

To restart the database, you simply have to use "ALWAYS" instead of "NEVER" in the JSON payload in another task.

Note

These operations take a few minutes to complete. So it's best to shutdown the cluster first and then the database. Do it in the reverse order when starting up. If you don't, you will get errors.

Conclusion

I hope you found this useful. I took me some time to research, test and think it through. Once you have the solution, it should be relatively easy to implement. If you have any question or remark, please leave a comment.

As a bonus, I'll give you this script you can launch locally to start/shutdown a Kubernetes based infrastructure. It's not usable automatically because we cannot run Bash scripts neither can we install gcloud nor kubectl in Cloud Functions.

#!/usr/bin/env bash

set -eu
set -o pipefail

source "$(dirname "$0")/_lib.sh"

function usage() {
    echo "$0 start|shutdown pp|dev|all"
}

function shutdown() {
    local k8s_cluster="$1"
    local project="$2"
    local db_instance="$3"
    local namespace

    echo "Shutting down cluster ${k8s_cluster}"
    use-k8s-cluster "${k8s_cluster}"
    for namespace in $(list-k8s-non-system-namespaces); do
        echo "Shutting down all deployments in namespace ${namespace}"
        kubectl scale deploy --namespace "${namespace}" --replicas=0 --all
    done

    echo "Shutting down database ${db_instance}"
    gcloud --project "${project}" sql instances patch "${db_instance}"     --activation-policy=NEVER

echo "Done shuting down ${k8s_cluster} and ${db_instance}"
}

function startup() {
    local k8s_cluster="$1"
    local project="$2"
    local db_instance="$3"
    local namespace

    echo "Starting cluster ${k8s_cluster}"
    use-k8s-cluster "${k8s_cluster}"
    for namespace in $(list-k8s-non-system-namespaces); do
        echo "Starting all deployments in namespace ${namespace}"
        kubectl scale deploy --namespace "${namespace}" --replicas=1 --all
    done

    echo "Starting database ${db_instance}"
    gcloud --project "${project}" sql instances patch "${db_instance}"     --activation-policy=ALWAYS

echo "Done startup up ${k8s_cluster} and ${db_instance}"
}

function main() {
    if [[ "$#" != 2 ]]; then
        echo "Invalid number of arguments" >&2
        usage
        exit 1
    fi

    case "$1" in
        start)
            case "$2" in
                pp)
                    echo -e "Starting \033[1mPP\033[0m, press enter to continue or ^C-C to     quit."
                    read -r
                startup preprod-autopilot-cluster     preprod api-preprod-pg13
                ;;
            dev)
                    echo -e "Starting \033[1mdev\033[0m, press enter to continue or ^C-C to     quit."
                    read -r
                startup dev-autopilot-cluster     dev api-dev-pg13
                    ;;
            all)
                    echo -e "Starting \033[1mall\033[0m, press enter to continue or ^C-C to     quit."
                    read -r
                startup preprod-autopilot-cluster     preprod api-preprod-pg13
                startup dev-autopilot-cluster dev api-dev-pg13
                    ;;
            *)
                    echo "Invalid argument $2"
                    usage
                    exit 1
                    ;;
            esac
            ;;
        stop)
            case "$2" in
                pp)
                    echo -e "Shutting down \033[1mPP\033[0m, press enter to continue or ^C-C     to quit."
                    read -r
                shutdown     preprod-autopilot-cluster     preprod api-preprod-pg13
                ;;
            dev)
                    echo -e "Shutting down \033[1mdev\033[0m, press enter to continue or ^C-C     to quit."
                    read -r
                shutdown dev-autopilot-cluster     dev api-dev-pg13
                    ;;
            all)
                    echo -e "Shutting down \033[1mall\033[0m, press enter to continue or ^C-C     to quit."
                    read -r
                shutdown     preprod-autopilot-cluster     preprod api-preprod-pg13
                shutdown gdev-autopilot-cluster dev api-dev-pg13
                    ;;
            *)
                    echo "Invalid argument $2"
                    usage
                    exit 1
                    ;;
            esac
            ;;
        *)
            echo "Invalid argument $1" >&2
            usage
            exit 1
    esac
}

main "$@"