Create test environments in Kubernetes

Posted on 2022-02-20 in Programmation

Context and possible solutions

If you are working in a team with multiple developers, you will probably need to be able to test your code and make the product team validate your features. You will also need to be able to deploy new versions of your code. This implies to have only validated (technically and functionality wise) in your production branch, so you can easily and at any time deploy it in production. There are two pitfalls to avoid:

  • If you merge code too early, you probably will need to fix issues once more testing is done by the product team. This will hamper your ability to deploy: you either need to wait to fix all issues, revert the code or deploy broken code. None of these "solutions" are satisfactory.
  • If you merge too rarely, you will hamper our ability to deploy frequently. And you want to deploy frequently. I won't dig into this here, you'll find many articles online about this, but here are a few reasons:
    • Your feature branches will get outdated resulting in hard rebase and conflicts. It will also be harder to build upon one another works.
    • You want to provide features and fixes to your users as fast as possible.
    • Deploy often means deploying less, which means less scary deployments. And if something goes wrong, it's easier to spot and fix.

To avoid these pitfalls, you can use a workflow similar to this:

  1. Develop your feature in a branch.
  2. Make the tech review.
  3. Deploy in a test env so the product team can review the feature/bug fix.
  4. If code changes are needed, make sure they are reviewed again. This may imply to iterate a few times between product and tech reviews. If everyone does their job properly, I don't think you shouldn't generally iterate more than 2 or 3 times.
  5. Merge the branch once the development is ready.

This way, your work is merged as soon as possible and you only have validated features/bug fixes in your production branch. If you have a feature that will take multiple sprints, you can create a long lived feature branch and merge multiple smaller branches in it and merge it at the very end in the production branch. You will have to do a few rebases to avoid a big one at the end. If the feature will take a very long time, you should split it into smaller chunks that can be merged to avoid having code that lives too long on its own. For instance, it can be done by hiding the feature behind a flag. It's pretty much on you to figure this out depending on your context. I can only suggest you don't do this often if you can avoid it as it makes your job harder.

To achieve this workflow, you have many possibilities:

  • Have a shared test environment. It will create bottlenecks (only one dev can propose one branch at a time) and will be hard to use in practice: you need coordination to use this env and one test might be replaced by a more urgent one wreaking havoc in everybody's schedule.
  • Have one environment per developer. It removes the coordination problem but only reduces the bottleneck one: a developer may want to test many branches in parallel.
  • Have one environment per branch. It removes all the previous problems but can become expensive if not done properly. You can mitigate the cost issue by reducing the allowed resources for each test, disabling unneeded services and cleanup resources as soon as we don't need them and shutting dow the tests environment during nights and week-ends when you won't use it. That's the approach I'll detail here.

Now that we know what we want, how are we going to do this? Again, you have many possibilities:

  • On a dedicated server, you can manage everything including deployment. You should be able to do it, but you will need to create the deploy and cleanup scripts as well as provision a sufficiently big machine to support all these tests apps. As your team grow, this can become more and more difficult.
  • Some platforms like Heroku seems to provide tools to help.
  • If you have a Kubernetes cluster, you can rely on namespaces to create a set of services for each tests. You create a namespace when you push your branch and you delete it with all its associated resources when you are done. I'll think it's best to do it into a dedicated cluster (and maybe even a dedicated project in your cloud provider) to avoid polluting the production one and ease shutdown of resources. As you probably guessed from the title, that's the solution I'll describe here.

Deploying

The idea is to deploy a branch in the dedicated test namespace automatically. The namespace will be created or updated depending on whether it already exists. This is to ease deployment of fixes if needed. Where I work, we rely on GCP's CloudBuild service to do that, but it can of course be adapted to any CI/CD system.

In a nutshell, we use two Bash scripts: one to build the docker image and one to deploy it. These scripts are written in a way that allow them to be used both for production and tests deployments.

Before trying to deploy to the test environment, you need to:

  1. Create the cluster.
  2. Create a file to configure the values used by the deployment. We use Helm for our deployments, so we created a values.dev.yaml file adapted from a values.production.yaml.
  3. Make sure all your configuration values are in a Helm values file or a Helm secret. It's easier not to have anything to read from the CI/CD environment when you deploy multiple services with different configurations. With Helm values files, all your services will behave the same.

Attention!

You must not associate each Helm resources with a namespace inside each YAML file like this:

{{ if .Values.namespace -}}
namespace: "{{ .Values.namespace }}"
{{- end }}

If you do this, when you will try to deploy another namespace it will replace the deployments of the previous branch. So, you will never be able to deploy many branches in parallel. You must associate the resources with a namespace with the --namespace "${namespace}" option of helm. See below.

Note

If you choose not to support HTTPS, be sure you disable all HTTPS related configuration in your code. This includes HSTS, blocking mixed content with CSP, forcing requests upgrade with your services or CSP…

Building the image

First, we needed to create a local .env containing all the values required for the image to be build. It's very useful when building frontend images, a bit less when building backend ones (they will read their configuration from the environment on startup). This can be done with the tip I give in my Extract kubectl configmap/secret to .env file article.

We then needed to correct some of these values to match your actual namespace. I'm mostly thinking of the URLs used in a frontend app to communicate with the API. Depending on your use cases, you can have more (or none). All these values depend on the namespace and thus will all need to be overridden here.

We can then build the image.

This can be done with a script that looks something like this:

set -eu
set -o pipefail

# Read parameters.
readonly ENV="$1"
readonly IMAGE="$2"
readonly PROJECT_ID="$3"
readonly COMMIT_SHA="$4"
readonly BRANCH_NAME="$5"
readonly CUSTOM_REPO="$6"

# Functions shared across multiple scripts, like extract-namespace-from-branch
# or config variable like IMAGE_TO_DOCKER_FILE
# See below for details.
source "$(dirname "$0")/_lib.sh"

# Not detailed here, see above.
"$(dirname "$0")/create-dot-env.sh"

if [[ "${ENV}" == 'dev' && -f "${env_file}" ]]; then
    namespace=$(extract-namespace-from-branch "${BRANCH_NAME}")
    api_domain="api--${namespace}.dev.example.com"
    api_url="https://${api_domain}"

    # Remove existing value, then replace by the new one.
    # It could be done in one go, but I find it easier this way (no complex regexp).
    sed -i '/REACT_APP_API_BASE_URL/d' "${env_file}"
    echo "REACT_APP_API_BASE_URL=${api_url}" >> "${env_file}"
fi

docker build \
    -t "${CUSTOM_REPO}/${PROJECT_ID}/${IMAGE}:${COMMIT_SHA}" \
    --build-arg "COMMIT_SHA=${COMMIT_SHA}" \
    --build-arg "ENV=${ENV}" \
    -f "${IMAGE_TO_DOCKER_FILE[${IMAGE}]}" \
    .

And the relevant shared code (this must be placed in a file named _lib.sh next to the deploy script):

declare -rA IMAGE_TO_DOCKER_FILE=(
    [api]="./backend/Dockerfile"
    [frontend]="./frontend/Dockerfile"
)

function extract-namespace-from-branch() {
    # This will either return the ticket number of the full branch name.
    echo $1 | sed --regexp-extended --expression 's@^[a-zA-Z]+(/[a-zA-Z]+)?/([0-9]+).*$@\2@' --expression 's@/@-@g'
}

Note

The namespace name cannot contain slashes since it would not be a valid HTTP domain name. The extract-namespace-from-branch function will replace them with dashes instead. It's also meant to extract the ticket number from the branch name to ease identification of the test namespace (and thus URL) for everybody. We use this convention: (feat|fix|test)/.*/\d+-.*. For instance: feat/jujens/789-add-smileys-to-actions will extract 789 as the namespace while with feat/jujens/add-smileys-to-actions it will extract feat-jujens-add-smileys-to-actions.

Deploying the image

Time to deploy the newly built image! We rely on a deploy function to which we can pass extra arguments. This can be done with the $@ variable in bash. This allows us to have a base function we can use without arguments for production and with extra --set and --namespace options for tests. It's like this:

function deploy() {
    local container_image
    container_image="${CUSTOM_REPO}/${PROJECT_ID}/${IMAGE},container.image.tag=${COMMIT_SHA}"
    # See following section for this proxy.
    if [[ "${IMAGE}" == dev-proxy ]]; then
        container_image="docker.io/nginx,container.image.tag=stable"
    fi

    helm upgrade --install --debug \
            "${IMAGE}" "./devops/helm/${IMAGE}" \
            -f "./devops/helm/${IMAGE}/values.${ENV}.yaml" \
            --set "container.image.repository=${container_image}" \
            "$@"
}

This allows to have a body like this:

set -eu
set -o pipefail

readonly ENV="$1"
readonly IMAGE="$2"
readonly PROJECT_ID="$3"
readonly COMMIT_SHA="$4"
readonly BRANCH_NAME="$5"
readonly CUSTOM_REPO="$6"

source "$(dirname "$0")/_lib.sh"

if [[ "${ENV}" == "dev" ]]; then
    namespace=$(extract-namespace-from-branch "${BRANCH_NAME}")
    if [[ "${IMAGE}" == dev-proxy ]]; then
        namespace=default
    fi

    app_url="https://frontend-app--${namespace}.dev.example.com"
    api_domain="api--${namespace}.dev.example.com"
    api_url="https://${api_domain}"

    deploy --namespace "${namespace}" \
        --create-namespace \
        --set "configmap.FRONTEND_BASE_URL=${app_url}" \
        --set "configmap.REACT_APP_API_BASE_URL=${api_url}"
else
    deploy
fi

See that we reuse our extract-namespace-from-branch function. For production, we can just deploy the image. In dev, we need to update our ConfigMap so it contains proper values. For instance, our backend API may need to have the proper frontend URL. We use --set "PATH=value" to override values from our values file. We associate our resources to the proper namespace with --namespace "${namespace}" so each one are correctly isolated from one another. --create-namespace allows us to create the namespace if it doesn't already exists (it required --install).

So far so good. But what if we had secrets? It's better to keep them out of git and thus Helm won't deploy them. Since it's a test environment, we decided to create them once in the default namespace and to use the same username and password for all tests namespaces. We then copy them into the proper namespace so they can be used. This way, they stay secured on the cluster, it's easy to change them if needed and they are still easy to use. To do this, we had to insert this code before calling the deploy function:

# We must create the namespace ourselves to copy secrets to it.
create-namespace "${namespace}"
copy-secret "${IMAGE}" "${namespace}"

We also need these two functions:

function create-namespace() {
    local namespace="$1"

    echo "Creating namespace"
    kubectl create namespace "${namespace}" || echo "Namespace already exists, continuing"
}

function copy-secret() {
    local secret_name="$1"
    local namespace="$2"

    if kubectl get secret "${IMAGE}" > /dev/null 2>&1; then
        echo "Copying secret ${secret_name}"
        # We export the secret, correct its namespace and remove metadata for the copy to succeed.
        kubectl get secret "${secret_name}" -o yaml |
            sed "s/namespace: default/namespace: '${namespace}'/" |
            sed '/^[[:blank:]]*uid/d' |
            sed '/^[[:blank:]]*creationTimestamp/d' |
            sed '/^[[:blank:]]*resourceVersion/d' |
            sed '/^[[:blank:]]*kubectl.kubernetes.io\/last-applied-configuration/,+1 d' |
            kubectl apply --overwrite=true -f -
    else
        echo "Secret ${secret_name} doesn't exists, skipping"
    fi
    echo "Done copying secrets"
}

Configuring a proxy to forward request to the proper service

Since our service must be available, we had to go one step further. We could have configured a load balancer for each to allow external traffic to the cluster. However, these take a long time to startup. So instead, we deployed only one public service named dev-proxy which has a load balancer and routes our requests to the proper service inside the cluster.

This service will be a very basic nginx with its configuration mounted from a volume. If you are reading this, I expect you to be enough at ease with Kubernetes and Helm to write its configuration yourself. In case of trouble, you can always leave a comment.

Let's focus on the configuration of nginx. It has several particularities you should be aware of:

  • We had to specify the resolver manually like this: resolver kube-dns.kube-system.svc.cluster.local valid=10s;. Otherwise, nginx will fail to resolve internal cluster domain names like api.namespace.svc.cluster.local. Meaning it won't be able to forward traffic.
  • We need to parse the server_name directive to extract the name of the service and the namespace into variables. This is what will allow you to forward traffic with the proxy_pass directive later on. This can be done with server_name ~^(?<service>.+)--(?<namespace>.+)\.dev\.example\.com$ for a service accessible at api--namespace.dev.example.com. We use two dashes (--) instead of a dot (.) to ease the management of HTTPS certificate, see the Enabling HTTPS section below.

Here is the full configuration file:

 1 server {
 2     resolver kube-dns.kube-system.svc.cluster.local valid=10s;
 3 
 4     listen 80;
 5     root /var/www/;
 6     client_max_body_size 1G;
 7     server_tokens off;
 8     server_name ~^(?<service>.+)--(?<namespace>.+)\.dev\.example\.com$;
 9 
10     index index.html;
11 
12     gzip on;
13     gzip_vary on;
14     gzip_min_length 1024;
15     gzip_proxied expired no-cache no-store private auth;
16     gzip_types text/plain text/css text/xml text/javascript application/javascript application/x-javascript application/xml;
17     gzip_disable "MSIE [1-6]\.";
18 
19     access_log stdout;
20     error_log  stderr;
21 
22     location / {
23         {{ if .Values.container.nginx.enableBasicAuth -}}
24         auth_basic           "Pre-Production. Access Restricted";
25         auth_basic_user_file /etc/nginx/conf.d/.htpasswd;
26         {{- end }}
27 
28         add_header Permissions-Policy "interest-cohort=()" always;
29         add_header Cross-Origin-Opener-Policy  same-origin always;
30         add_header Cross-Origin-Resource-Policy same-site always;
31         add_header Cross-Origin-Embedder-Policy unsafe-none always;
32         # Uncomment if HTTPS is supported.
33         # add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
34         add_header X-Frame-Options DENY;
35         add_header X-XSS-Protection "1; mode=block";
36         add_header X-Content-Type-Options nosniff;
37 
38         location /nghealth {
39             {{ if .Values.container.nginx.enableBasicAuth -}}
40             auth_basic off;
41             {{- end }}
42             return 200;
43         }
44 
45         location /api {
46             # Disable auth for API path to avoid requests failing because our XHR request didn't supply auth.
47             {{ if .Values.container.nginx.enableBasicAuth -}}
48             auth_basic off;
49             {{- end }}
50 
51             add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0, no-transform';
52             try_files /$uri @proxy;
53         }
54 
55         location /auth {
56             # Disable auth for Auth endpoint so we can login or reset our password easily.
57             {{ if .Values.container.nginx.enableBasicAuth -}}
58             auth_basic off;
59             {{- end }}
60 
61             add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0, no-transform';
62             try_files /$uri @proxy;
63         }
64 
65         add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0, no-transform';
66         try_files /$uri @proxy;
67     }
68 
69     location @proxy {
70         # Uncomment if HTTPS is supported.
71         # add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
72 
73         proxy_connect_timeout 30;
74         proxy_send_timeout 30;
75         proxy_read_timeout 30;
76         send_timeout 30;
77         proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
78         # We have another proxy in front of this one. It will capture traffic
79         # as HTTPS, so we must not set X-Forwarded-Proto here since it's already
80         # set with the proper value.
81         # proxy_set_header X-Forwarded-Proto $schema;
82 
83         proxy_set_header Host $http_host;
84         proxy_redirect off;
85         proxy_pass http://$service.$namespace.svc.cluster.local$request_uri;
86     }
87 }

Creating and filling the database

We created a database for the test site as part as our initContainer for the backend. This container will create a new empty database, fill it and apply migrations (if any) to it.

We cannot share a database between all the instance because they may each have dedicated migrations. These migrations may not be compatible with one another. So a shared database would cause a lot of trouble or slow our development process down if we want to avoid them which we don't want.

However, We can have a dedicated and managed SQL instance that holds all the databases. All site have the same connection information, only the name of the database is different. This will make connection to the database way easier, and it's sufficient for a testing service.

Creating

To create the database, we rely on a Django command (since we use the Django web framework), but you can use something else if needed or even install psql to do this. It takes the name of the database to create and will run a CREATE DATABASE SQL command with the connection. For this to work, since the database for the site doesn't exist yet, we have to run it like this:

DB_NAME=postgres python manage.py create_dev_database $DB_NAME

It requires your database instance to have a postgres database (that's mostly likely the case) that is accessible with the account you will use to connect to all the database. This account must have admin permissions to create (and then delete) the database. Our Django install will use this database temporarily so our command can start and perform its required actions. Without an existing database to connect to, it would fail. You also need to configure your settings so that the DB_NAME environment variable is used to identify the database.

Since we can deploy a test site many times, our command must be able to detect whether the database already exists or not and don't try to recreate it. Or again, it will fail. Since there is no CREATE DATABASE IF NOT EXISTS, the command tries to create it and catch an error if it already exists.

Note

I don't use psql because it wasn't initially installed in these Docker images and I wanted to keep them small. I later had to add the psql client to correctly fill the database and didn't update this part that was working correctly. Since a simple to Django command does the trick, I don't think I need to do more than this.

Here is the command. Despite using raw SQL (Django's ORM don't have command to create a database), we still rely on the psycopg2 driver to escape the name of the database. Here, we know what we send so you can think of it as non required, but it's still a good practice to do, so let's not deviate from it.

import logging

from django.core.management import BaseCommand
from django.core.management.base import CommandParser
from django.db import connection
from django.db.utils import ProgrammingError
from psycopg2 import sql

logger = logging.getLogger(__name__)


class Command(BaseCommand):
    help = "Create the supplied database."

    def add_arguments(self, parser: CommandParser):
        parser.add_argument("db_name", nargs=1, help="Name of the database to create.")

    def handle(self, *args, **options):
        with connection.cursor() as cursor:
            sql_statement = sql.SQL("CREATE DATABASE {};").format(
                sql.Identifier(options["db_name"][0])
            )
            try:
                logger.info(f'Creating dev database {options["db_name"][0]}')
                cursor.execute(sql_statement)
            except ProgrammingError as e:
                logger.info(f"Failed to create database probably because it already exist: {e}.")
            else:
                logger.info("Successfully created database.")

Filling

Starting with an empty database for the test website, means that we should fill it manually before having a chance to do anything useful. That's so impractical, that we had to find something. There are three ways to fill a database for testing purposes:

  • Loading an existing database: if it already exists it can be an easier to fill the test one with actual data. It's probably not a good idea to use the production one (even an extract of it): it can contain too much data as well as sensitive data about your customers that must not leak. You can use the same as in preproduction (if you have one) though: it'll probably be up to date with lots of tests data your tester can use. They also should be able to reuse some of the data easily.
  • Loading a dedicated database file: it's good to start with a very clean state but it can take time to build and maintain.
  • Using a script to create objects in the database: probably the best. You still need to agree to what must go in there, create the script and maintain it. Since it's manage with the rest of the code, it should also always be up to date. It's probably the best if you want end to end tests.

Currently, we don't have end to end tests and we have a good preproduction database already known by the tests team. So given these current requirements and since creating a dedicated script would take a lot of time, we decided to load a dump from preproduction to achieve our need. We create this dump at regular interval (once a month right now) and use this dump with a Django command to import it.

Later, when we add end to end tests, we will need to build a more reliable set of data for our testing, but it can wait.

This command is launched just after the create_dev_database one like this:

python ./manage.py import_sql_file preprod-latest.sql --download

We then launch migrations as you would expect to apply new migrations for our code:

python ./manage.py migrate

Our command:

  1. Downloads a GZ compressed SQL file from our bucket.
  2. Decompresses it with shutil.copyfileobj to limit the memory impact. This will decompress the file using system tools and won't load the whole file in memory inside Python.
  3. Imports the file with psql. It seems the more efficient way to do it: we could read the file and let Django import it, but this test file from preproduction can grow and our init container can't use much memory. So, leaving this task at psql which can import big files efficiently, seems more reasonable. And we don't want our deployment to fail because of that.

As with database creation, the command will import the file if the database is empty to prevent errors and needless duplications. Here it is if you want to take a look:

import gzip
import logging
import os
import shutil
import subprocess

from django.conf import settings
from django.core.management import BaseCommand
from django.core.management.base import CommandParser
from django.db.utils import ProgrammingError
from storages.backends.gcloud import GoogleCloudStorage

from myproject.apps.banks.models import MyModel

logger = logging.getLogger(__name__)


class Command(BaseCommand):
    help = "Import a SQL file into the database."

    def add_arguments(self, parser: CommandParser):
        parser.add_argument("sql_file", nargs=1, help="Path to the file to import.")
        parser.add_argument(
            "--database",
            required=False,
            help="The database to fill (default value is read from Django settings).",
            default=settings.DATABASES["default"]["NAME"],
        )
        parser.add_argument(
            "--download",
            action="store_true",
            required=False,
            help="Download the file from Python and the proper bucket.",
        )

    def handle(self, *args, **options):
        try:
            objects_count = MyModel.objects.all().count()
        except ProgrammingError:
            # If the database is not filled, this will result in an error.
            objects_count = 0

        if objects_count > 0:
            logger.info("Database already imported, skipping.")
            return

        if options["download"]:
            self._download(options["sql_file"][0])

        logger.info("Importing SQL file.")
        subprocess.run(
            [
                "psql",
                "--echo-errors",
                "--host",
                settings.DATABASES["default"]["HOST"],
                "--username",
                settings.DATABASES["default"]["USER"],
                "--dbname",
                options["database"],
                "--file",
                options["sql_file"][0],
            ],
            check=True,
            shell=False,
            env={
                **os.environ,
                "PGPASSWORD": settings.DATABASES["default"]["PASSWORD"],
            },
        )
        logger.info("Done importing.")

    def _download(self, sql_filename):
        logger.info("Downloading.")
        storage = GoogleCloudStorage(bucket_name=settings.SQL_STORAGE_BUCKET_NAME)
        with storage.open("preprod-latest.sql.gz", "rb") as sql_file_from_storage, open(
            f"{sql_filename}.gz", "wb"
        ) as sql_file:
            sql_file.write(sql_file_from_storage.read())

        with gzip.open(f"{sql_filename}.gz", "rb") as gzip_sql_file, open(
            sql_filename, "wb"
        ) as sql_file:
            shutil.copyfileobj(gzip_sql_file, sql_file)

        logger.info("Done downloading.")

Enabling HTTPS

It's a good thing to enable HTTPS:

  • The app will be configure closer to how it is in other environment.
  • We will use the same username/password as in preproduction and it would be best not to leak these.

While in production and preproduction we use ManagedCertificate provide to us by Kubernetes and GCP, we cannot do this in our development environment. According to the documentation, these managed certificate don't support wildcard certificate which we need to protect all our domains. We could provision one ManagedCertificate per test site, but it would be long and costly.

So, like many websites on the internet, we decided to use Let's encrypt. There are project like cert-manager which can help manage Let's Encrypt certificates in a Kubernetes cluster. However, there are two down sides:

  • It requires to install lots of dedicated and custom components in the cluster.
  • Since we need a wildcard certificate, we need to use a DNS challenge. cert-manager only supports these challenges when the DNS is managed by AWS or GCP which is not our case.

So, at least for now, we decided to manage them by hand. In a nutshell, we use the method describe in this article.

We initially though about storing these certificates in a persistent volume mounted into the dev-proxy pod, but letting HTTPS traffic through the load balancer was more complex than we though. We tested many possible solutions and none of them worked. So, we created a dedicated secret with them inside and linked it to the Ingress as describe in the official Kubernetes documentation.

Note

It's not possible to have multi-level wildcard certificate, ie a certificate that would handle things like *.*.dev.example.com. In our original idea, that's how we wanted to structure our domain: SERVICE.NAMESPACE.dev.example.com. Since it's not possible, we changed how we do things into SERVICE--NAMESPACE.dev.example.com. It works equally well and is compatible with HTTPS and our way to generate certificates.

Note

We will (at least for a short time), handle renewal manually in a similar process as the one used to create the first certificate. Since it only happens once every 3 months, it will be manageable and better for now (to avoid wasting time on this). The main problem being the DNS challenge we would need to correctly automate with our DNS provider or to migrate our DNS to GCP.

Cleaning the environment

To clean the Kubernetes namespace and its database, we run a cleaning script named clean-k8s-dev-namespace.sh manually. Ideally, this would be automatic and run as soon as the branch is merged in GitHub. However, it would required extra work we didn't want to do right now. And it's not a big deal to run this once in a while (at least for now).

The script has two tasks:

  1. Cleaning the database which must be done first.
  2. Cleaning the namespace.

Note

In all these sample, the namespace variable will be filled with the namespace we are cleaning by the script.

Cleaning the database

Just like we created it, we will run a Django command to clean it. For this to work, we need to connect to the postgres database and not the proper database. To ease things, we connect to a pod in the proper namespace to run the command: it's there and running since we haven't cleaned it yet. We do this with in two steps:

  1. Select the pod to use with

    backend_api_pod=$(kubectl get pods --namespace "${namespace}" --field-selector=status.phase=Running |
        grep '^api' |
        cut -d ' ' -f 1)
    
  2. Connect to it and run the command:

    kubectl exec -it "${backend_api_pod}" -c api --namespace "${namespace}" \
        -- bash -c "DB_NAME=postgres python manage.py drop_dev_database '${namespace}'"
    

The command like the previous one is simple and run SQL code directly with the driver. We use the same pattern and the same arguments. It's handle function looks like this:

def handle(self, *args, **options):
    with connection.cursor() as cursor:
        sql_statement = sql.SQL("DROP DATABASE IF EXISTS {};").format(
            sql.Identifier(options["db_name"][0])
        )
        try:
            logger.info(f'Deleting dev database {options["db_name"][0]}')
            cursor.execute(sql_statement)
        except ProgrammingError as e:
            logger.info(f"Failed to delete database: {e}.")
        else:
            logger.info("Successfully deleted database.")

Cleaning the Kubernetes namespace

At first glance, it looks like we could simply run:

kubectl delete all --all --namespace "${namespace}"
kubectl delete namespace "${namespace}"

However, this would delete resources in the wrong order and our deployment would recreate the backend pod. Since this pod recreates the database in its initContainer, we would end up with a database. To avoid that, we need to delete some resources manually to be sure the deletion process will work as expected:

kubectl delete services --all --namespace "${namespace}"
kubectl delete deployment --all --namespace "${namespace}"
kubectl delete pods --all --namespace "${namespace}"
kubectl delete all --all --namespace "${namespace}"
kubectl delete namespace "${namespace}"

Conclusion

That's it. We have been using this for over a month now and it works perfectly. I hope you found it useful too. If you have any questions or remark, please leave a comment below.

I'd also like to give two tips before completing this article:

  • If you wont to shutdown your test Kubernetes cluster, I wrote an article for that not long ago.

  • You can use this code to ask for validation before doing something (cleaning a namespace for instance):

    # \033[1m will make the text bold, \033[0m will reset the display.
    echo -e "Deleting namespace \033[1m${namespace}\033[0m and all its resources. Press enter to continue or ^C-C to quit."
        read -r
    
  • You can use something like this to print the resources used by your test cluster:

    #!/usr/bin/env bash
    
    set -eu
    set -o pipefail
    
    current_cluster=MY_TEST_CLUSTER
    
    kubectl config use-context "${current_cluster}"
    
    number_nodes=$(kubectl get nodes | tail --lines +2 | wc -l)
    number_non_system_namespaces=$(list-k8s-non-system-namespaces | wc -l)
    number_pods=$(kubectl get pods --all-namespaces | tail --lines +2 | grep -v '^gatekeeper-system' | grep -v '^kube-' | wc -l)
    number_running_pods=$(kubectl get pods --all-namespaces --field-selector=status.phase=Running | tail --lines +2 | grep -v '^gatekeeper-system' | grep -v '^kube-' | wc -l)
    number_deployment=$(kubectl get deployments --all-namespaces | tail --lines +2 | grep -v '^gatekeeper-system' | grep -v '^kube-' | wc -l)
    number_services=$(kubectl get services --all-namespaces | tail --lines +2 | grep -v '^gatekeeper-system' | grep -v '^kube-' | wc -l)
    current_namespaces=$(kubectl get namespaces | tail --lines +2 | grep -v '^gatekeeper-system' | grep -v '^kube-' | cut -d ' ' -f 1 | sort | tr '\n' ', ')
    
    echo "Cluster ${current_cluster} has:"
    echo -e "\t- ${number_nodes} nodes"
    echo -e "\t- ${number_non_system_namespaces} non system namespaces"
    echo -e "\t- ${number_pods} pods (${number_running_pods} running)"
    echo -e "\t- ${number_deployment} deployments"
    echo -e "\t- ${number_services} services"
    echo -e "\t- Current namespaces: ${current_namespaces}"