+ - 0:00:00
Notes for current slide
Notes for next slide

Intros

  • Hello! I'm Jérôme (@jpetazzo, Enix SAS)

  • The training will run from 9:30 to 13:00

  • There will be a break at (approximately) 11:00

  • Feel free to interrupt for questions at any time

  • Especially when you see full screen container pictures!

logistics.md

2/404

A brief introduction

  • This was initially written by Jérôme Petazzoni to support in-person, instructor-led workshops and tutorials

  • Credit is also due to multiple contributors — thank you!

  • You can also follow along on your own, at your own pace

  • We included as much information as possible in these slides

  • We recommend having a mentor to help you ...

  • ... Or be comfortable spending some time reading the Kubernetes documentation ...

  • ... And looking for answers on StackOverflow and other outlets

k8s/intro.md

3/404

Accessing these slides now

  • We recommend that you open these slides in your browser:

    https://2021-02-enix.container.training/

  • Use arrows to move to next/previous slide

    (up, down, left, right, page up, page down)

  • Type a slide number + ENTER to go to that slide

  • The slide number is also visible in the URL bar

    (e.g. .../#123 for slide 123)

shared/about-slides.md

4/404

Accessing these slides later

shared/about-slides.md

5/404

These slides are open source

  • You are welcome to use, re-use, share these slides

  • These slides are written in markdown

  • The sources of these slides are available in a public GitHub repository:

    https://github.com/jpetazzo/container.training

  • Typos? Mistakes? Questions? Feel free to hover over the bottom of the slide ...

👇 Try it! The source file will be shown and you can view it on GitHub and fork and edit it.

shared/about-slides.md

6/404

Extra details

  • This slide has a little magnifying glass in the top left corner

  • This magnifying glass indicates slides that provide extra details

  • Feel free to skip them if:

    • you are in a hurry

    • you are new to this and want to avoid cognitive overload

    • you want only the most essential information

  • You can review these slides another time if you want, they'll be waiting for you ☺

shared/about-slides.md

7/404

Chat room

  • We've set up a chat room that we will monitor during the workshop

  • Don't hesitate to use it to ask questions, or get help, or share feedback

  • The chat room will also be available after the workshop

  • Join the chat room: Gitter

  • Say hi in the chat room!

shared/chat-room-im.md

8/404

Module 5

(auto-generated TOC)

shared/toc.md

13/404

Image separating from the next module

14/404

Pre-requirements

(automatically generated title slide)

15/404

Pre-requirements

  • Kubernetes concepts

    (pods, deployments, services, labels, selectors)

  • Hands-on experience working with containers

    (building images, running them; doesn't matter how exactly)

  • Familiar with the UNIX command-line

    (navigating directories, editing files, using kubectl)

k8s/prereqs-admin.md

16/404

Labs and exercises

  • We are going to build and break multiple clusters

  • Everyone will get their own private environment(s)

  • You are invited to reproduce all the demos (but you don't have to)

  • All hands-on sections are clearly identified, like the gray rectangle below

k8s/prereqs-admin.md

17/404

Private environments

  • Each person gets their own private set of VMs

  • Each person should have a printed card with connection information

  • We will connect to these VMs with SSH

    (if you don't have an SSH client, install one now!)

k8s/prereqs-admin.md

18/404

Doing or re-doing this on your own?

  • We are using basic cloud VMs with Ubuntu LTS

  • Kubernetes packages or binaries have been installed

    (depending on what we want to accomplish in the lab)

  • We disabled IP address checks

    • we want to route pod traffic directly between nodes

    • most cloud providers will treat pod IP addresses as invalid

    • ... and filter them out; so we disable that filter k8s/prereqs-admin.md

19/404

Image separating from the next module

20/404

Kubernetes architecture

(automatically generated title slide)

21/404

Kubernetes architecture

We can arbitrarily split Kubernetes in two parts:

  • the nodes, a set of machines that run our containerized workloads;

  • the control plane, a set of processes implementing the Kubernetes APIs.

Kubernetes also relies on underlying infrastructure:

  • servers, network connectivity (obviously!),

  • optional components like storage systems, load balancers ...

k8s/architecture.md

22/404

Control plane location

The control plane can run:

  • in containers, on the same nodes that run other application workloads

    (example: Minikube; 1 node runs everything, kind)

  • on a dedicated node

    (example: a cluster installed with kubeadm)

  • on a dedicated set of nodes

    (example: Kubernetes The Hard Way; kops)

  • outside of the cluster

    (example: most managed clusters like AKS, EKS, GKE)

k8s/architecture.md

23/404

What runs on a node

  • Our containerized workloads

  • A container engine like Docker, CRI-O, containerd...

    (in theory, the choice doesn't matter, as the engine is abstracted by Kubernetes)

  • kubelet: an agent connecting the node to the cluster

    (it connects to the API server, registers the node, receives instructions)

  • kube-proxy: a component used for internal cluster communication

    (note that this is not an overlay network or a CNI plugin!)

k8s/architecture.md

25/404

What's in the control plane

  • Everything is stored in etcd

    (it's the only stateful component)

  • Everyone communicates exclusively through the API server:

    • we (users) interact with the cluster through the API server

    • the nodes register and get their instructions through the API server

    • the other control plane components also register with the API server

  • API server is the only component that reads/writes from/to etcd

k8s/architecture.md

26/404

Communication protocols: API server

  • The API server exposes a REST API

    (except for some calls, e.g. to attach interactively to a container)

  • Almost all requests and responses are JSON following a strict format

  • For performance, the requests and responses can also be done over protobuf

    (see this design proposal for details)

  • In practice, protobuf is used for all internal communication

    (between control plane components, and with kubelet)

k8s/architecture.md

27/404

Communication protocols: on the nodes

The kubelet agent uses a number of special-purpose protocols and interfaces, including:

  • CRI (Container Runtime Interface)

    • used for communication with the container engine
    • abstracts the differences between container engines
    • based on gRPC+protobuf
  • CNI (Container Network Interface)

    • used for communication with network plugins
    • network plugins are implemented as executable programs invoked by kubelet
    • network plugins provide IPAM
    • network plugins set up network interfaces in pods

k8s/architecture.md

28/404

Image separating from the next module

30/404

The Kubernetes API

(automatically generated title slide)

31/404

The Kubernetes API is declarative

  • We cannot tell the API, "run a pod"

  • We can tell the API, "here is the definition for pod X"

  • The API server will store that definition (in etcd)

  • Controllers will then wake up and create a pod matching the definition

k8s/architecture.md

33/404

The core features of the Kubernetes API

  • We can create, read, update, and delete objects

  • We can also watch objects

    (be notified when an object changes, or when an object of a given type is created)

  • Objects are strongly typed

  • Types are validated and versioned

  • Storage and watch operations are provided by etcd

    (note: the k3s project allows us to use sqlite instead of etcd)

k8s/architecture.md

34/404

Let's experiment a bit!

  • For the exercises in this section, connect to the first node of the test cluster
  • SSH to the first node of the test cluster

  • Check that the cluster is operational:

    kubectl get nodes
  • All nodes should be Ready

k8s/architecture.md

35/404

Create

  • Let's create a simple object
  • Create a namespace with the following command:
    kubectl create -f- <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
    name: hello
    EOF

This is equivalent to kubectl create namespace hello.

k8s/architecture.md

36/404

Read

  • Let's retrieve the object we just created
  • Read back our object:
    kubectl get namespace hello -o yaml

We see a lot of data that wasn't here when we created the object.

Some data was automatically added to the object (like spec.finalizers).

Some data is dynamic (typically, the content of status.)

k8s/architecture.md

37/404

API requests and responses

  • Almost every Kubernetes API payload (requests and responses) has the same format:

    apiVersion: xxx
    kind: yyy
    metadata:
    name: zzz
    (more metadata fields here)
    (more fields here)
  • The fields shown above are mandatory, except for some special cases

    (e.g.: in lists of resources, the list itself doesn't have a metadata.name)

  • We show YAML for convenience, but the API uses JSON

    (with optional protobuf encoding)

k8s/architecture.md

38/404

API versions

  • The apiVersion field corresponds to an API group

  • It can be either v1 (aka "core" group or "legacy group"), or group/versions; e.g.:

    • apps/v1
    • rbac.authorization.k8s.io/v1
    • extensions/v1beta1
  • It does not indicate which version of Kubernetes we're talking about

  • It indirectly indicates the version of the kind

    (which fields exist, their format, which ones are mandatory...)

  • A single resource type (kind) is rarely versioned alone

    (e.g.: the batch API group contains jobs and cronjobs)

k8s/architecture.md

39/404

Group-Version-Kind, or GVK

  • A particular type will be identified by the combination of:

    • the API group it belongs to (core, apps, metrics.k8s.io, ...)

    • the version of this API group (v1, v1beta1, ...)

    • the "Kind" itself (Pod, Role, Job, ...)

  • "GVK" appears a lot in the API machinery code

  • Conversions are possible between different versions and even between API groups

    (e.g. when Deployments moved from extensions to apps)

k8s/architecture.md

40/404

Update

  • Let's update our namespace object

  • There are many ways to do that, including:

    • kubectl apply (and provide an updated YAML file)
    • kubectl edit
    • kubectl patch
    • many helpers, like kubectl label, or kubectl set
  • In each case, kubectl will:

    • get the current definition of the object
    • compute changes
    • submit the changes (with PATCH requests)

k8s/architecture.md

41/404

Adding a label

  • For demonstration purposes, let's add a label to the namespace

  • The easiest way is to use kubectl label

  • In one terminal, watch namespaces:

    kubectl get namespaces --show-labels -w
  • In the other, update our namespace:

    kubectl label namespaces hello color=purple

We demonstrated update and watch semantics.

k8s/architecture.md

42/404

What's special about watch?

  • The API server itself doesn't do anything: it's just a fancy object store

  • All the actual logic in Kubernetes is implemented with controllers

  • A controller watches a set of resources, and takes action when they change

  • Examples:

    • when a Pod object is created, it gets scheduled and started

    • when a Pod belonging to a ReplicaSet terminates, it gets replaced

    • when a Deployment object is updated, it can trigger a rolling update

k8s/architecture.md

43/404

Watch events

  • kubectl get --watch shows changes

  • If we add --output-watch-events, we can also see:

    • the difference between ADDED and MODIFIED resources

    • DELETED resources

  • In one terminal, watch pods, displaying full events:

    kubectl get pods --watch --output-watch-events
  • In another, run a short-lived pod:

    kubectl run pause --image=alpine --rm -ti --restart=Never -- sleep 5

k8s/architecture.md

44/404

Image separating from the next module

45/404

Other control plane components

(automatically generated title slide)

46/404

Other control plane components

  • API server ✔️

  • etcd ✔️

  • Controller manager

  • Scheduler

k8s/architecture.md

47/404

Controller manager

  • This is a collection of loops watching all kinds of objects

  • That's where the actual logic of Kubernetes lives

  • When we create a Deployment (e.g. with kubectl create deployment web --image=nginx),

    • we create a Deployment object

    • the Deployment controller notices it, and creates a ReplicaSet

    • the ReplicaSet controller notices the ReplicaSet, and creates a Pod

k8s/architecture.md

48/404

Scheduler

  • When a pod is created, it is in Pending state

  • The scheduler (or rather: a scheduler) must bind it to a node

    • Kubernetes comes with an efficient scheduler with many features

    • if we have special requirements, we can add another scheduler
      (example: this demo scheduler uses the cost of nodes, stored in node annotations)

  • A pod might stay in Pending state for a long time:

    • if the cluster is full

    • if the pod has special constraints that can't be met

    • if the scheduler is not running (!)

49/404

:EN:- Kubernetes architecture review :FR:- Passage en revue de l'architecture de Kubernetes

k8s/architecture.md

19,000 words

They say, "a picture is worth one thousand words."

The following 19 slides show what really happens when we run:

kubectl create deployment web --image=nginx

k8s/deploymentslideshow.md

50/404

Image separating from the next module

70/404

Building our own cluster

(automatically generated title slide)

71/404

Building our own cluster

  • Let's build our own cluster!

    Perfection is attained not when there is nothing left to add, but when there is nothing left to take away. (Antoine de Saint-Exupery)

  • Our goal is to build a minimal cluster allowing us to:

    • create a Deployment (with kubectl create deployment)
    • expose it with a Service
    • connect to that service
  • "Minimal" here means:

    • smaller number of components
    • smaller number of command-line flags
    • smaller number of configuration files

k8s/dmuc.md

72/404

Non-goals

  • For now, we don't care about security

  • For now, we don't care about scalability

  • For now, we don't care about high availability

  • All we care about is simplicity

k8s/dmuc.md

73/404

Our environment

  • We will use the machine indicated as dmuc1

    (this stands for "Dessine Moi Un Cluster" or "Draw Me A Sheep",
    in homage to Saint-Exupery's "The Little Prince")

  • This machine:

    • runs Ubuntu LTS

    • has Kubernetes, Docker, and etcd binaries installed

    • but nothing is running

k8s/dmuc.md

74/404

Checking our environment

  • Let's make sure we have everything we need first
  • Log into the dmuc1 machine

  • Get root:

    sudo -i
  • Check available versions:

    etcd -version
    kube-apiserver --version
    dockerd --version

k8s/dmuc.md

75/404

The plan

  1. Start API server

  2. Interact with it (create Deployment and Service)

  3. See what's broken

  4. Fix it and go back to step 2 until it works!

k8s/dmuc.md

76/404

Dealing with multiple processes

  • We are going to start many processes

  • Depending on what you're comfortable with, you can:

    • open multiple windows and multiple SSH connections

    • use a terminal multiplexer like screen or tmux

    • put processes in the background with &
      (warning: log output might get confusing to read!)

k8s/dmuc.md

77/404

Starting API server

  • Try to start the API server:
    kube-apiserver
    # It will fail with "--etcd-servers must be specified"

Since the API server stores everything in etcd, it cannot start without it.

k8s/dmuc.md

78/404

Starting etcd

  • Try to start etcd:
    etcd

Success!

Note the last line of output:

serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!

Sure, that's discouraged. But thanks for telling us the address!

k8s/dmuc.md

79/404

Starting API server (for real)

  • Try again, passing the --etcd-servers argument

  • That argument should be a comma-separated list of URLs

  • Start API server:
    kube-apiserver --etcd-servers http://127.0.0.1:2379

Success!

k8s/dmuc.md

80/404

Interacting with API server

  • Let's try a few "classic" commands
  • List nodes:

    kubectl get nodes
  • List services:

    kubectl get services

We should get No resources found. and the kubernetes service, respectively.

Note: the API server automatically created the kubernetes service entry.

k8s/dmuc.md

81/404

What about kubeconfig?

  • We didn't need to create a kubeconfig file

  • By default, the API server is listening on localhost:8080

    (without requiring authentication)

  • By default, kubectl connects to localhost:8080

    (without providing authentication)

k8s/dmuc.md

82/404

Creating a Deployment

  • Let's run a web server!
  • Create a Deployment with NGINX:
    kubectl create deployment web --image=nginx

Success?

k8s/dmuc.md

83/404

Checking our Deployment status

  • Look at pods, deployments, etc.:
    kubectl get all

Our Deployment is in bad shape:

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/web 0/1 0 0 2m26s

And, there is no ReplicaSet, and no Pod.

k8s/dmuc.md

84/404

What's going on?

  • We stored the definition of our Deployment in etcd

    (through the API server)

  • But there is no controller to do the rest of the work

  • We need to start the controller manager

k8s/dmuc.md

85/404

Starting the controller manager

  • Try to start the controller manager:
    kube-controller-manager

The final error message is:

invalid configuration: no configuration has been provided

But the logs include another useful piece of information:

Neither --kubeconfig nor --master was specified.
Using the inClusterConfig. This might not work.

k8s/dmuc.md

86/404

Reminder: everyone talks to API server

  • The controller manager needs to connect to the API server

  • It does not have a convenient localhost:8080 default

  • We can pass the connection information in two ways:

    • --master and a host:port combination (easy)

    • --kubeconfig and a kubeconfig file

  • For simplicity, we'll use the first option

k8s/dmuc.md

87/404

Starting the controller manager (for real)

  • Start the controller manager:
    kube-controller-manager --master http://localhost:8080

Success!

k8s/dmuc.md

88/404

Checking our Deployment status

  • Check all our resources again:
    kubectl get all

We now have a ReplicaSet.

But we still don't have a Pod.

k8s/dmuc.md

89/404

What's going on?

In the controller manager logs, we should see something like this:

E0404 15:46:25.753376 22847 replica_set.go:450] Sync "default/web-5bc9bd5b8d"
failed with No API token found for service account "default", retry after the
token is automatically created and added to the service account
  • The service account default was automatically added to our Deployment

    (and to its pods)

  • The service account default exists

  • But it doesn't have an associated token

    (the token is a secret; creating it requires signature; therefore a CA)

k8s/dmuc.md

90/404

Solving the missing token issue

There are many ways to solve that issue.

We are going to list a few (to get an idea of what's happening behind the scenes).

Of course, we don't need to perform all the solutions mentioned here.

k8s/dmuc.md

91/404

Option 1: disable service accounts

  • Restart the API server with --disable-admission-plugins=ServiceAccount

  • The API server will no longer add a service account automatically

  • Our pods will be created without a service account

k8s/dmuc.md

92/404

Option 2: do not mount the (missing) token

  • Add automountServiceAccountToken: false to the Deployment spec

    or

  • Add automountServiceAccountToken: false to the default ServiceAccount

  • The ReplicaSet controller will no longer create pods referencing the (missing) token

  • Programmatically change the default ServiceAccount:
    kubectl patch sa default -p "automountServiceAccountToken: false"

k8s/dmuc.md

93/404

Option 3: set up service accounts properly

  • This is the most complex option!

  • Generate a key pair

  • Pass the private key to the controller manager

    (to generate and sign tokens)

  • Pass the public key to the API server

    (to verify these tokens)

k8s/dmuc.md

94/404

Continuing without service account token

  • Once we patch the default service account, the ReplicaSet can create a Pod
  • Check that we now have a pod:
    kubectl get all

Note: we might have to wait a bit for the ReplicaSet controller to retry.

If we're impatient, we can restart the controller manager.

k8s/dmuc.md

95/404

What's next?

  • Our pod exists, but it is in Pending state

  • Remember, we don't have a node so far

    (kubectl get nodes shows an empty list)

  • We need to:

    • start a container engine

    • start kubelet

k8s/dmuc.md

96/404

Starting a container engine

  • We're going to use Docker (because it's the default option)
  • Start the Docker Engine:
    dockerd

Success!

Feel free to check that it actually works with e.g.:

docker run alpine echo hello world

k8s/dmuc.md

97/404

Starting kubelet

  • If we start kubelet without arguments, it will start

  • But it will not join the cluster!

  • It will start in standalone mode

  • Just like with the controller manager, we need to tell kubelet where the API server is

  • Alas, kubelet doesn't have a simple --master option

  • We have to use --kubeconfig

  • We need to write a kubeconfig file for kubelet

k8s/dmuc.md

98/404

Writing a kubeconfig file

  • We can copy/paste a bunch of YAML

  • Or we can generate the file with kubectl

  • Create the file ~/.kube/config with kubectl:
    kubectl config \
    set-cluster localhost --server http://localhost:8080
    kubectl config \
    set-context localhost --cluster localhost
    kubectl config \
    use-context localhost

k8s/dmuc.md

99/404

Our ~/.kube/config file

The file that we generated looks like the one below.

That one has been slightly simplified (removing extraneous fields), but it is still valid.

apiVersion: v1
kind: Config
current-context: localhost
contexts:
- name: localhost
context:
cluster: localhost
clusters:
- name: localhost
cluster:
server: http://localhost:8080

k8s/dmuc.md

100/404

Starting kubelet

  • Start kubelet with that kubeconfig file:
    kubelet --kubeconfig ~/.kube/config

Success!

k8s/dmuc.md

101/404

Looking at our 1-node cluster

  • Let's check that our node registered correctly
  • List the nodes in our cluster:
    kubectl get nodes

Our node should show up.

Its name will be its hostname (it should be dmuc1).

k8s/dmuc.md

102/404

Are we there yet?

  • Let's check if our pod is running
  • List all resources:
    kubectl get all
103/404

Are we there yet?

  • Let's check if our pod is running
  • List all resources:
    kubectl get all

Our pod is still Pending. 🤔

104/404

Are we there yet?

  • Let's check if our pod is running
  • List all resources:
    kubectl get all

Our pod is still Pending. 🤔

Which is normal: it needs to be scheduled.

(i.e., something needs to decide which node it should go on.)

k8s/dmuc.md

105/404

Scheduling our pod

  • Why do we need a scheduling decision, since we have only one node?

  • The node might be full, unavailable; the pod might have constraints ...

  • The easiest way to schedule our pod is to start the scheduler

    (we could also schedule it manually)

k8s/dmuc.md

106/404

Starting the scheduler

  • The scheduler also needs to know how to connect to the API server

  • Just like for controller manager, we can use --kubeconfig or --master

  • Start the scheduler:
    kube-scheduler --master http://localhost:8080
  • Our pod should now start correctly

k8s/dmuc.md

107/404

Checking the status of our pod

  • Our pod will go through a short ContainerCreating phase

  • Then it will be Running

  • Check pod status:
    kubectl get pods

Success!

k8s/dmuc.md

108/404

Scheduling a pod manually

  • We can schedule a pod in Pending state by creating a Binding, e.g.:

    kubectl create -f- <<EOF
    apiVersion: v1
    kind: Binding
    metadata:
    name: name-of-the-pod
    target:
    apiVersion: v1
    kind: Node
    name: name-of-the-node
    EOF
  • This is actually how the scheduler works!

  • It watches pods, makes scheduling decisions, and creates Binding objects

k8s/dmuc.md

109/404

Connecting to our pod

  • Let's check that our pod correctly runs NGINX
  • Check our pod's IP address:

    kubectl get pods -o wide
  • Send some HTTP request to the pod:

    curl X.X.X.X

We should see the Welcome to nginx! page.

k8s/dmuc.md

110/404

Exposing our Deployment

  • We can now create a Service associated with this Deployment
  • Expose the Deployment's port 80:

    kubectl expose deployment web --port=80
  • Check the Service's ClusterIP, and try connecting:

    kubectl get service web
    curl http://X.X.X.X
111/404

Exposing our Deployment

  • We can now create a Service associated with this Deployment
  • Expose the Deployment's port 80:

    kubectl expose deployment web --port=80
  • Check the Service's ClusterIP, and try connecting:

    kubectl get service web
    curl http://X.X.X.X

This won't work. We need kube-proxy to enable internal communication.

k8s/dmuc.md

112/404

Starting kube-proxy

  • kube-proxy also needs to connect to the API server

  • It can work with the --master flag

    (although that will be deprecated in the future)

  • Start kube-proxy:
    kube-proxy --master http://localhost:8080

k8s/dmuc.md

113/404

Connecting to our Service

  • Now that kube-proxy is running, we should be able to connect
  • Check the Service's ClusterIP again, and retry connecting:
    kubectl get service web
    curl http://X.X.X.X

Success!

k8s/dmuc.md

114/404

How kube-proxy works

  • kube-proxy watches Service resources

  • When a Service is created or updated, kube-proxy creates iptables rules

  • Check out the OUTPUT chain in the nat table:

    iptables -t nat -L OUTPUT
  • Traffic is sent to KUBE-SERVICES; check that too:

    iptables -t nat -L KUBE-SERVICES

For each Service, there is an entry in that chain.

k8s/dmuc.md

115/404

Diving into iptables

  • The last command showed a chain named KUBE-SVC-... corresponding to our service
  • Check that KUBE-SVC-... chain:

    iptables -t nat -L KUBE-SVC-...
  • It should show a jump to a KUBE-SEP-... chains; check it out too:

    iptables -t nat -L KUBE-SEP-...

This is a DNAT rule to rewrite the destination address of the connection to our pod.

This is how kube-proxy works!

k8s/dmuc.md

116/404

kube-router, IPVS

  • With recent versions of Kubernetes, it is possible to tell kube-proxy to use IPVS

  • IPVS is a more powerful load balancing framework

    (remember: iptables was primarily designed for firewalling, not load balancing!)

  • It is also possible to replace kube-proxy with kube-router

  • kube-router uses IPVS by default

  • kube-router can also perform other functions

    (e.g., we can use it as a CNI plugin to provide pod connectivity)

k8s/dmuc.md

117/404

What about the kubernetes service?

  • If we try to connect, it won't work

    (by default, it should be 10.0.0.1)

  • If we look at the Endpoints for this service, we will see one endpoint:

    host-address:6443

  • By default, the API server expects to be running directly on the nodes

    (it could be as a bare process, or in a container/pod using the host network)

  • ... And it expects to be listening on port 6443 with TLS

118/404

:EN:- Building our own cluster from scratch :FR:- Construire son cluster à la main

k8s/dmuc.md

Image separating from the next module

119/404

Adding nodes to the cluster

(automatically generated title slide)

120/404

Adding nodes to the cluster

  • So far, our cluster has only 1 node

  • Let's see what it takes to add more nodes

  • We are going to use another set of machines: kubenet

k8s/multinode.md

121/404

The environment

  • We have 3 identical machines: kubenet1, kubenet2, kubenet3

  • The Docker Engine is installed (and running) on these machines

  • The Kubernetes binaries are installed, but nothing is running

  • We will use kubenet1 to run the control plane

k8s/multinode.md

122/404

The plan

  • Start the control plane on kubenet1

  • Join the 3 nodes to the cluster

  • Deploy and scale a simple web server

  • Log into kubenet1

k8s/multinode.md

123/404

Running the control plane

  • We will use a Compose file to start the control plane components
  • Clone the repository containing the workshop materials:

    git clone https://github.com/jpetazzo/container.training
  • Go to the compose/simple-k8s-control-plane directory:

    cd container.training/compose/simple-k8s-control-plane
  • Start the control plane:

    docker-compose up

k8s/multinode.md

124/404

Checking the control plane status

  • Before moving on, verify that the control plane works
  • Show control plane component statuses:

    kubectl get componentstatuses
    kubectl get cs
  • Show the (empty) list of nodes:

    kubectl get nodes

k8s/multinode.md

125/404

Differences from dmuc

  • Our new control plane listens on 0.0.0.0 instead of the default 127.0.0.1

  • The ServiceAccount admission plugin is disabled

k8s/multinode.md

126/404

Joining the nodes

  • We need to generate a kubeconfig file for kubelet

  • This time, we need to put the public IP address of kubenet1

    (instead of localhost or 127.0.0.1)

  • Generate the kubeconfig file:
    kubectl config set-cluster kubenet --server http://X.X.X.X:8080
    kubectl config set-context kubenet --cluster kubenet
    kubectl config use-context kubenet
    cp ~/.kube/config ~/kubeconfig

k8s/multinode.md

127/404

Distributing the kubeconfig file

  • We need that kubeconfig file on the other nodes, too
  • Copy kubeconfig to the other nodes:
    for N in 2 3; do
    scp ~/kubeconfig kubenet$N:
    done

k8s/multinode.md

128/404

Starting kubelet

  • Reminder: kubelet needs to run as root; don't forget sudo!
  • Join the first node:

    sudo kubelet --kubeconfig ~/kubeconfig
  • Open more terminals and join the other nodes to the cluster:

    ssh kubenet2 sudo kubelet --kubeconfig ~/kubeconfig
    ssh kubenet3 sudo kubelet --kubeconfig ~/kubeconfig

k8s/multinode.md

129/404

Checking cluster status

  • We should now see all 3 nodes

  • At first, their STATUS will be NotReady

  • They will move to Ready state after approximately 10 seconds

  • Check the list of nodes:
    kubectl get nodes

k8s/multinode.md

130/404

Deploy a web server

  • Let's create a Deployment and scale it

    (so that we have multiple pods on multiple nodes)

  • Create a Deployment running NGINX:

    kubectl create deployment web --image=nginx
  • Scale it:

    kubectl scale deployment web --replicas=5

k8s/multinode.md

131/404

Check our pods

  • The pods will be scheduled on the nodes

  • The nodes will pull the nginx image, and start the pods

  • What are the IP addresses of our pods?

  • Check the IP addresses of our pods
    kubectl get pods -o wide
132/404

Check our pods

  • The pods will be scheduled on the nodes

  • The nodes will pull the nginx image, and start the pods

  • What are the IP addresses of our pods?

  • Check the IP addresses of our pods
    kubectl get pods -o wide

🤔 Something's not right ... Some pods have the same IP address!

k8s/multinode.md

133/404

What's going on?

  • Without the --network-plugin flag, kubelet defaults to "no-op" networking

  • It lets the container engine use a default network

    (in that case, we end up with the default Docker bridge)

  • Our pods are running on independent, disconnected, host-local networks

k8s/multinode.md

134/404

What do we need to do?

  • On a normal cluster, kubelet is configured to set up pod networking with CNI plugins

  • This requires:

    • installing CNI plugins

    • writing CNI configuration files

    • running kubelet with --network-plugin=cni

k8s/multinode.md

135/404

Using network plugins

  • We need to set up a better network

  • Before diving into CNI, we will use the kubenet plugin

  • This plugin creates a cbr0 bridge and connects the containers to that bridge

  • This plugin allocates IP addresses from a range:

    • either specified to kubelet (e.g. with --pod-cidr)

    • or stored in the node's spec.podCIDR field

See here for more details about this kubenet plugin. k8s/multinode.md

136/404

What kubenet does and does not do

  • It allocates IP addresses to pods locally

    (each node has its own local subnet)

  • It connects the pods to a local bridge

    (pods on the same node can communicate together; not with other nodes)

  • It doesn't set up routing or tunneling

    (we get pods on separated networks; we need to connect them somehow)

  • It doesn't allocate subnets to nodes

    (this can be done manually, or by the controller manager)

k8s/multinode.md

137/404

Setting up routing or tunneling

  • On each node, we will add routes to the other nodes' pod network

  • Of course, this is not convenient or scalable!

  • We will see better techniques to do this; but for now, hang on!

k8s/multinode.md

138/404

Allocating subnets to nodes

  • There are multiple options:

    • passing the subnet to kubelet with the --pod-cidr flag

    • manually setting spec.podCIDR on each node

    • allocating node CIDRs automatically with the controller manager

  • The last option would be implemented by adding these flags to controller manager:

    --allocate-node-cidrs=true --cluster-cidr=<cidr>

k8s/multinode.md

139/404

The pod CIDR field is not mandatory

  • kubenet needs the pod CIDR, but other plugins don't need it

    (e.g. because they allocate addresses in multiple pools, or a single big one)

  • The pod CIDR field may eventually be deprecated and replaced by an annotation

    (see kubernetes/kubernetes#57130)

k8s/multinode.md

140/404

Restarting kubelet wih pod CIDR

  • We need to stop and restart all our kubelets

  • We will add the --network-plugin and --pod-cidr flags

  • We all have a "cluster number" (let's call that C) printed on your VM info card

  • We will use pod CIDR 10.C.N.0/24 (where N is the node number: 1, 2, 3)

  • Stop all the kubelets (Ctrl-C is fine)

  • Restart them all, adding --network-plugin=kubenet --pod-cidr 10.C.N.0/24

k8s/multinode.md

141/404

What happens to our pods?

  • When we stop (or kill) kubelet, the containers keep running

  • When kubelet starts again, it detects the containers

  • Check that our pods are still here:
    kubectl get pods -o wide

🤔 But our pods still use local IP addresses!

k8s/multinode.md

142/404

Recreating the pods

  • The IP address of a pod cannot change

  • kubelet doesn't automatically kill/restart containers with "invalid" addresses
    (in fact, from kubelet's point of view, there is no such thing as an "invalid" address)

  • We must delete our pods and recreate them

  • Delete all the pods, and let the ReplicaSet recreate them:

    kubectl delete pods --all
  • Wait for the pods to be up again:

    kubectl get pods -o wide -w

k8s/multinode.md

143/404

Adding kube-proxy

  • Let's start kube-proxy to provide internal load balancing

  • Then see if we can create a Service and use it to contact our pods

  • Start kube-proxy:

    sudo kube-proxy --kubeconfig ~/.kube/config
  • Expose our Deployment:

    kubectl expose deployment web --port=80

k8s/multinode.md

144/404

Test internal load balancing

  • Retrieve the ClusterIP address:

    kubectl get svc web
  • Send a few requests to the ClusterIP address (with curl)

145/404

Test internal load balancing

  • Retrieve the ClusterIP address:

    kubectl get svc web
  • Send a few requests to the ClusterIP address (with curl)

Sometimes it works, sometimes it doesn't. Why?

k8s/multinode.md

146/404

Routing traffic

  • Our pods have new, distinct IP addresses

  • But they are on host-local, isolated networks

  • If we try to ping a pod on a different node, it won't work

  • kube-proxy merely rewrites the destination IP address

  • But we need that IP address to be reachable in the first place

  • How do we fix this?

    (hint: check the title of this slide!)

k8s/multinode.md

147/404

Important warning

  • The technique that we are about to use doesn't work everywhere

  • It only works if:

    • all the nodes are directly connected to each other (at layer 2)

    • the underlying network allows the IP addresses of our pods

  • If we are on physical machines connected by a switch: OK

  • If we are on virtual machines in a public cloud: NOT OK

    • on AWS, we need to disable "source and destination checks" on our instances

    • on OpenStack, we need to disable "port security" on our network ports

k8s/multinode.md

148/404

Routing basics

  • We need to tell each node:

    "The subnet 10.C.N.0/24 is located on node N" (for all values of N)

  • This is how we add a route on Linux:

    ip route add 10.C.N.0/24 via W.X.Y.Z

    (where W.X.Y.Z is the internal IP address of node N)

  • We can see the internal IP addresses of our nodes with:

    kubectl get nodes -o wide

k8s/multinode.md

149/404

Firewalling

  • By default, Docker prevents containers from using arbitrary IP addresses

    (by setting up iptables rules)

  • We need to allow our containers to use our pod CIDR

  • For simplicity, we will insert a blanket iptables rule allowing all traffic:

    iptables -I FORWARD -j ACCEPT

  • This has to be done on every node

k8s/multinode.md

150/404

Setting up routing

  • Create all the routes on all the nodes

  • Insert the iptables rule allowing traffic

  • Check that you can ping all the pods from one of the nodes

  • Check that you can curl the ClusterIP of the Service successfully

k8s/multinode.md

151/404

What's next?

  • We did a lot of manual operations:

    • allocating subnets to nodes

    • adding command-line flags to kubelet

    • updating the routing tables on our nodes

  • We want to automate all these steps

  • We want something that works on all networks

152/404

:EN:- Connecting nodes ands pods :FR:- Interconnecter les nœuds et les pods

k8s/multinode.md

Image separating from the next module

153/404

The Container Network Interface

(automatically generated title slide)

154/404

The Container Network Interface

  • Allows us to decouple network configuration from Kubernetes

  • Implemented by plugins

  • Plugins are executables that will be invoked by kubelet

  • Plugins are responsible for:

    • allocating IP addresses for containers

    • configuring the network for containers

  • Plugins can be combined and chained when it makes sense

k8s/cni.md

155/404

Combining plugins

  • Interface could be created by e.g. vlan or bridge plugin

  • IP address could be allocated by e.g. dhcp or host-local plugin

  • Interface parameters (MTU, sysctls) could be tweaked by the tuning plugin

The reference plugins are available here.

Look in each plugin's directory for its documentation. k8s/cni.md

156/404

How does kubelet know which plugins to use?

  • The plugin (or list of plugins) is set in the CNI configuration

  • The CNI configuration is a single file in /etc/cni/net.d

  • If there are multiple files in that directory, the first one is used

    (in lexicographic order)

  • That path can be changed with the --cni-conf-dir flag of kubelet

k8s/cni.md

157/404

CNI configuration in practice

  • When we set up the "pod network" (like Calico, Weave...) it ships a CNI configuration

    (and sometimes, custom CNI plugins)

  • Very often, that configuration (and plugins) is installed automatically

    (by a DaemonSet featuring an initContainer with hostPath volumes)

  • Examples:

k8s/cni.md

158/404

Conf vs conflist

  • There are two slightly different configuration formats

  • Basic configuration format:

    • holds configuration for a single plugin
    • typically has a .conf name suffix
    • has a type string field in the top-most structure
    • examples
  • Configuration list format:

    • can hold configuration for multiple (chained) plugins
    • typically has a .conflist name suffix
    • has a plugins list field in the top-most structure
    • examples

k8s/cni.md

159/404

How plugins are invoked

  • Parameters are given through environment variables, including:

    • CNI_COMMAND: desired operation (ADD, DEL, CHECK, or VERSION)

    • CNI_CONTAINERID: container ID

    • CNI_NETNS: path to network namespace file

    • CNI_IFNAME: what the network interface should be named

  • The network configuration must be provided to the plugin on stdin

    (this avoids race conditions that could happen by passing a file path)

k8s/cni.md

160/404

In practice: kube-router

  • We are going to set up a new cluster

  • For this new cluster, we will use kube-router

  • kube-router will provide the "pod network"

    (connectivity with pods)

  • kube-router will also provide internal service connectivity

    (replacing kube-proxy)

k8s/cni.md

161/404

How kube-router works

  • Very simple architecture

  • Does not introduce new CNI plugins

    (uses the bridge plugin, with host-local for IPAM)

  • Pod traffic is routed between nodes

    (no tunnel, no new protocol)

  • Internal service connectivity is implemented with IPVS

  • Can provide pod network and/or internal service connectivity

  • kube-router daemon runs on every node

k8s/cni.md

162/404

What kube-router does

  • Connect to the API server

  • Obtain the local node's podCIDR

  • Inject it into the CNI configuration file

    (we'll use /etc/cni/net.d/10-kuberouter.conflist)

  • Obtain the addresses of all nodes

  • Establish a full mesh BGP peering with the other nodes

  • Exchange routes over BGP

k8s/cni.md

163/404

What's BGP?

  • BGP (Border Gateway Protocol) is the protocol used between internet routers

  • It scales pretty well (it is used to announce the 700k CIDR prefixes of the internet)

  • It is spoken by many hardware routers from many vendors

  • It also has many software implementations (Quagga, Bird, FRR...)

  • Experienced network folks generally know it (and appreciate it)

  • It also used by Calico (another popular network system for Kubernetes)

  • Using BGP allows us to interconnect our "pod network" with other systems

k8s/cni.md

164/404

The plan

  • We'll work in a new cluster (named kuberouter)

  • We will run a simple control plane (like before)

  • ... But this time, the controller manager will allocate podCIDR subnets

    (so that we don't have to manually assign subnets to individual nodes)

  • We will create a DaemonSet for kube-router

  • We will join nodes to the cluster

  • The DaemonSet will automatically start a kube-router pod on each node

k8s/cni.md

165/404

Logging into the new cluster

  • Log into node kuberouter1

  • Clone the workshop repository:

    git clone https://github.com/jpetazzo/container.training
  • Move to this directory:

    cd container.training/compose/kube-router-k8s-control-plane

k8s/cni.md

166/404

Checking the CNI configuration

  • By default, kubelet gets the CNI configuration from /etc/cni/net.d
  • Check the content of /etc/cni/net.d

(On most machines, at this point, /etc/cni/net.d doesn't even exist).)

k8s/cni.md

167/404

Our control plane

  • We will use a Compose file to start the control plane

  • It is similar to the one we used with the kubenet cluster

  • The API server is started with --allow-privileged

    (because we will start kube-router in privileged pods)

  • The controller manager is started with extra flags too:

    --allocate-node-cidrs and --cluster-cidr

  • We need to edit the Compose file to set the Cluster CIDR

k8s/cni.md

168/404

Starting the control plane

  • Our cluster CIDR will be 10.C.0.0/16

    (where C is our cluster number)

  • Edit the Compose file to set the Cluster CIDR:

    vim docker-compose.yaml
  • Start the control plane:

    docker-compose up

k8s/cni.md

169/404

The kube-router DaemonSet

  • In the same directory, there is a kuberouter.yaml file

  • It contains the definition for a DaemonSet and a ConfigMap

  • Before we load it, we also need to edit it

  • We need to indicate the address of the API server

    (because kube-router needs to connect to it to retrieve node information)

k8s/cni.md

170/404

Creating the DaemonSet

  • The address of the API server will be http://A.B.C.D:8080

    (where A.B.C.D is the public address of kuberouter1, running the control plane)

  • Edit the YAML file to set the API server address:

    vim kuberouter.yaml
  • Create the DaemonSet:

    kubectl create -f kuberouter.yaml

Note: the DaemonSet won't create any pods (yet) since there are no nodes (yet).

k8s/cni.md

171/404

Generating the kubeconfig for kubelet

  • This is similar to what we did for the kubenet cluster
  • Generate the kubeconfig file (replacing X.X.X.X with the address of kuberouter1):
    kubectl config set-cluster cni --server http://X.X.X.X:8080
    kubectl config set-context cni --cluster cni
    kubectl config use-context cni
    cp ~/.kube/config ~/kubeconfig

k8s/cni.md

172/404

Distributing kubeconfig

  • We need to copy that kubeconfig file to the other nodes
  • Copy kubeconfig to the other nodes:
    for N in 2 3; do
    scp ~/kubeconfig kuberouter$N:
    done

k8s/cni.md

173/404

Starting kubelet

  • We don't need the --pod-cidr option anymore

    (the controller manager will allocate these automatically)

  • We need to pass --network-plugin=cni

  • Join the first node:

    sudo kubelet --kubeconfig ~/kubeconfig --network-plugin=cni
  • Open more terminals and join the other nodes:

    ssh kuberouter2 sudo kubelet --kubeconfig ~/kubeconfig --network-plugin=cni
    ssh kuberouter3 sudo kubelet --kubeconfig ~/kubeconfig --network-plugin=cni

k8s/cni.md

174/404

Checking the CNI configuration

  • At this point, kuberouter should have installed its CNI configuration

    (in /etc/cni/net.d)

  • Check the content of /etc/cni/net.d
  • There should be a file created by kuberouter

  • The file should contain the node's podCIDR

k8s/cni.md

175/404

Setting up a test

  • Let's create a Deployment and expose it with a Service
  • Create a Deployment running a web server:

    kubectl create deployment web --image=jpetazzo/httpenv
  • Scale it so that it spans multiple nodes:

    kubectl scale deployment web --replicas=5
  • Expose it with a Service:

    kubectl expose deployment web --port=8888

k8s/cni.md

176/404

Checking that everything works

  • Get the ClusterIP address for the service:

    kubectl get svc web
  • Send a few requests there:

    curl X.X.X.X:8888

Note that if you send multiple requests, they are load-balanced in a round robin manner.

This shows that we are using IPVS (vs. iptables, which picked random endpoints).

k8s/cni.md

177/404

Troubleshooting

  • What if we need to check that everything is working properly?
  • Check the IP addresses of our pods:

    kubectl get pods -o wide
  • Check our routing table:

    route -n
    ip route

We should see the local pod CIDR connected to kube-bridge, and the other nodes' pod CIDRs having individual routes, with each node being the gateway.

k8s/cni.md

178/404

More troubleshooting

  • We can also look at the output of the kube-router pods

    (with kubectl logs)

  • kube-router also comes with a special shell that gives lots of useful info

    (we can access it with kubectl exec)

  • But with the current setup of the cluster, these options may not work!

  • Why?

k8s/cni.md

179/404

Trying kubectl logs / kubectl exec

  • Try to show the logs of a kube-router pod:

    kubectl -n kube-system logs ds/kube-router
  • Or try to exec into one of the kube-router pods:

    kubectl -n kube-system exec kube-router-xxxxx bash

These commands will give an error message that includes:

dial tcp: lookup kuberouterX on 127.0.0.11:53: no such host

What does that mean?

k8s/cni.md

180/404

Internal name resolution

  • To execute these commands, the API server needs to connect to kubelet

  • By default, it creates a connection using the kubelet's name

    (e.g. http://kuberouter1:...)

  • This requires our nodes names to be in DNS

  • We can change that by setting a flag on the API server:

    --kubelet-preferred-address-types=InternalIP

k8s/cni.md

181/404

Another way to check the logs

  • We can also ask the logs directly to the container engine

  • First, get the container ID, with docker ps or like this:

    CID=$(docker ps -q \
    --filter label=io.kubernetes.pod.namespace=kube-system \
    --filter label=io.kubernetes.container.name=kube-router)
  • Then view the logs:

    docker logs $CID

k8s/cni.md

182/404

Other ways to distribute routing tables

  • We don't need kube-router and BGP to distribute routes

  • The list of nodes (and associated podCIDR subnets) is available through the API

  • This shell snippet generates the commands to add all required routes on a node:

NODES=$(kubectl get nodes -o name | cut -d/ -f2)
for DESTNODE in $NODES; do
if [ "$DESTNODE" != "$HOSTNAME" ]; then
echo $(kubectl get node $DESTNODE -o go-template="
route add -net {{.spec.podCIDR}} gw {{(index .status.addresses 0).address}}")
fi
done
  • This could be useful for embedded platforms with very limited resources

    (or lab environments for learning purposes)

183/404

:EN:- Configuring CNI plugins :FR:- Configurer des plugins CNI

k8s/cni.md

Image separating from the next module

184/404

Interconnecting clusters

(automatically generated title slide)

185/404

Interconnecting clusters

  • We assigned different Cluster CIDRs to each cluster

  • This allows us to connect our clusters together

  • We will leverage kube-router BGP abilities for that

  • We will peer each kube-router instance with a route reflector

  • As a result, we will be able to ping each other's pods

k8s/interco.md

186/404

Disclaimers

  • There are many methods to interconnect clusters

  • Depending on your network implementation, you will use different methods

  • The method shown here only works for nodes with direct layer 2 connection

  • We will often need to use tunnels or other network techniques

k8s/interco.md

187/404

The plan

  • Someone will start the route reflector

    (typically, that will be the person presenting these slides!)

  • We will update our kube-router configuration

  • We will add a peering with the route reflector

    (instructing kube-router to connect to it and exchange route information)

  • We should see the routes to other clusters on our nodes

    (in the output of e.g. route -n or ip route show)

  • We should be able to ping pods of other nodes

k8s/interco.md

188/404

Starting the route reflector

  • Only do this slide if you are doing this on your own

  • There is a Compose file in the compose/frr-route-reflector directory

  • Before continuing, make sure that you have the IP address of the route reflector

k8s/interco.md

189/404

Configuring kube-router

  • This can be done in two ways:

    • with command-line flags to the kube-router process

    • with annotations to Node objects

  • We will use the command-line flags

    (because it will automatically propagate to all nodes)

Note: with Calico, this is achieved by creating a BGPPeer CRD.

k8s/interco.md

190/404

Updating kube-router configuration

  • We need to pass two command-line flags to the kube-router process
  • Edit the kuberouter.yaml file

  • Add the following flags to the kube-router arguments:

    - "--peer-router-ips=X.X.X.X"
    - "--peer-router-asns=64512"

    (Replace X.X.X.X with the route reflector address)

  • Update the DaemonSet definition:

    kubectl apply -f kuberouter.yaml

k8s/interco.md

191/404

Restarting kube-router

  • The DaemonSet will not update the pods automatically

    (it is using the default updateStrategy, which is OnDelete)

  • We will therefore delete the pods

    (they will be recreated with the updated definition)

  • Delete all the kube-router pods:
    kubectl delete pods -n kube-system -l k8s-app=kube-router

Note: the other updateStrategy for a DaemonSet is RollingUpdate.
For critical services, we might want to precisely control the update process.

k8s/interco.md

192/404

Checking peering status

  • We can see informative messages in the output of kube-router:

    time="2019-04-07T15:53:56Z" level=info msg="Peer Up"
    Key=X.X.X.X State=BGP_FSM_OPENCONFIRM Topic=Peer
  • We should see the routes of the other clusters show up

  • For debugging purposes, the reflector also exports a route to 1.0.0.2/32

  • That route will show up like this:

    1.0.0.2 172.31.X.Y 255.255.255.255 UGH 0 0 0 eth0
  • We should be able to ping the pods of other clusters!

k8s/interco.md

193/404

If we wanted to do more ...

  • kube-router can also export ClusterIP addresses

    (by adding the flag --advertise-cluster-ip)

  • They are exported individually (as /32)

  • This would allow us to easily access other clusters' services

    (without having to resolve the individual addresses of pods)

  • Even better if it's combined with DNS integration

    (to facilitate name → ClusterIP resolution)

194/404

:EN:- Interconnecting clusters :FR:- Interconnexion de clusters

k8s/interco.md

Image separating from the next module

195/404

CNI internals

(automatically generated title slide)

196/404

CNI internals

  • Kubelet looks for a CNI configuration file

    (by default, in /etc/cni/net.d)

  • Note: if we have multiple files, the first one will be used

    (in lexicographic order)

  • If no configuration can be found, kubelet holds off on creating containers

    (except if they are using hostNetwork)

  • Let's see how exactly plugins are invoked!

k8s/cni-internals.md

197/404

General principle

  • A plugin is an executable program

  • It is invoked with by kubelet to set up / tear down networking for a container

  • It doesn't take any command-line argument

  • However, it uses environment variables to know what to do, which container, etc.

  • It reads JSON on stdin, and writes back JSON on stdout

  • There will generally be multiple plugins invoked in a row

    (at least IPAM + network setup; possibly more)

k8s/cni-internals.md

198/404

Environment variables

  • CNI_COMMAND: ADD, DEL, CHECK, or VERSION

  • CNI_CONTAINERID: opaque identifier

    (container ID of the "sandbox", i.e. the container running the pause image)

  • CNI_NETNS: path to network namespace pseudo-file

    (e.g. /var/run/netns/cni-0376f625-29b5-7a21-6c45-6a973b3224e5)

  • CNI_IFNAME: interface name, usually eth0

  • CNI_PATH: path(s) with plugin executables (e.g. /opt/cni/bin)

  • CNI_ARGS: "extra arguments" (see next slide)

k8s/cni-internals.md

199/404

CNI_ARGS

  • Extra key/value pair arguments passed by "the user"

  • "The user", here, is "kubelet" (or in an abstract way, "Kubernetes")

  • This is used to pass the pod name and namespace to the CNI plugin

  • Example:

    IgnoreUnknown=1
    K8S_POD_NAMESPACE=default
    K8S_POD_NAME=web-96d5df5c8-jcn72
    K8S_POD_INFRA_CONTAINER_ID=016493dbff152641d334d9828dab6136c1ff...

Note that technically, it's a ;-separated list, so it really looks like this:

CNI_ARGS=IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=web-96d...

k8s/cni-internals.md

200/404

JSON in, JSON out

  • The plugin reads its configuration on stdin

  • It writes back results in JSON

    (e.g. allocated address, routes, DNS...)

⚠️ "Plugin configuration" is not always the same as "CNI configuration"!

k8s/cni-internals.md

201/404

Conf vs Conflist

  • The CNI configuration can be a single plugin configuration

    • it will then contain a type field in the top-most structure

    • it will be passed "as-is"

  • It can also be a "conflist", containing a chain of plugins

    (it will then contain a plugins field in the top-most structure)

  • Plugins are then invoked in order (reverse order for DEL action)

  • In that case, the input of the plugin is not the whole configuration

    (see details on next slide)

k8s/cni-internals.md

202/404

List of plugins

  • When invoking a plugin in a list, the JSON input will be:

    • the configuration of the plugin

    • augmented with name (matching the conf list name)

    • augmented with prevResult (which will be the output of the previous plugin)

  • Conceptually, a plugin (generally the first one) will do the "main setup"

  • Other plugins can do tuning / refinement (firewalling, traffic shaping...)

k8s/cni-internals.md

203/404

Analyzing plugins

  • Let's see what goes in and out of our CNI plugins!

  • We will create a fake plugin that:

    • saves its environment and input

    • executes the real plugin with the saved input

    • saves the plugin output

    • passes the saved output

k8s/cni-internals.md

204/404

Our fake plugin

#!/bin/sh
PLUGIN=$(basename $0)
cat > /tmp/cni.$$.$PLUGIN.in
env | sort > /tmp/cni.$$.$PLUGIN.env
echo "PPID=$PPID, $(readlink /proc/$PPID/exe)" > /tmp/cni.$$.$PLUGIN.parent
$0.real < /tmp/cni.$$.$PLUGIN.in > /tmp/cni.$$.$PLUGIN.out
EXITSTATUS=$?
cat /tmp/cni.$$.$PLUGIN.out
exit $EXITSTATUS

Save this script as /opt/cni/bin/debug and make it executable.

k8s/cni-internals.md

205/404

Substituting the fake plugin

  • For each plugin that we want to instrument:

    • rename the plugin from e.g. ptp to ptp.real

    • symlink ptp to our debug plugin

  • There is no need to change the CNI configuration or restart kubelet

k8s/cni-internals.md

206/404

Create some pods and looks at the results

  • Create a pod

  • For each instrumented plugin, there will be files in /tmp:

    cni.PID.pluginname.in (JSON input)

    cni.PID.pluginname.env (environment variables)

    cni.PID.pluginname.parent (parent process information)

    cni.PID.pluginname.out (JSON output)

❓️ What is calling our plugins?

207/404

:EN:- Deep dive into CNI internals :FR:- La Container Network Interface (CNI) en détails

k8s/cni-internals.md

Image separating from the next module

208/404

API server availability

(automatically generated title slide)

209/404

API server availability

  • When we set up a node, we need the address of the API server:

    • for kubelet

    • for kube-proxy

    • sometimes for the pod network system (like kube-router)

  • How do we ensure the availability of that endpoint?

    (what if the node running the API server goes down?)

k8s/apilb.md

210/404

Option 1: external load balancer

  • Set up an external load balancer

  • Point kubelet (and other components) to that load balancer

  • Put the node(s) running the API server behind that load balancer

  • Update the load balancer if/when an API server node needs to be replaced

  • On cloud infrastructures, some mechanisms provide automation for this

    (e.g. on AWS, an Elastic Load Balancer + Auto Scaling Group)

  • Example in Kubernetes The Hard Way

k8s/apilb.md

211/404

Option 2: local load balancer

  • Set up a load balancer (like NGINX, HAProxy...) on each node

  • Configure that load balancer to send traffic to the API server node(s)

  • Point kubelet (and other components) to localhost

  • Update the load balancer configuration when API server nodes are updated

k8s/apilb.md

212/404

Updating the local load balancer config

  • Distribute the updated configuration (push)

  • Or regularly check for updates (pull)

  • The latter requires an external, highly available store

    (it could be an object store, an HTTP server, or even DNS...)

  • Updates can be facilitated by a DaemonSet

    (but remember that it can't be used when installing a new node!)

k8s/apilb.md

213/404

Option 3: DNS records

  • Put all the API server nodes behind a round-robin DNS

  • Point kubelet (and other components) to that name

  • Update the records when needed

  • Note: this option is not officially supported

    (but since kubelet supports reconnection anyway, it should work)

k8s/apilb.md

214/404

Option 4: ....................

  • Many managed clusters expose a high-availability API endpoint

    (and you don't have to worry about it)

  • You can also use HA mechanisms that you're familiar with

    (e.g. virtual IPs)

  • Tunnels are also fine

    (e.g. k3s uses a tunnel to allow each node to contact the API server)

215/404

:EN:- Ensuring API server availability :FR:- Assurer la disponibilité du serveur API

k8s/apilb.md

Image separating from the next module

216/404

Kubernetes Internal APIs

(automatically generated title slide)

217/404

Kubernetes Internal APIs

  • Almost every Kubernetes component has some kind of internal API

    (some components even have multiple APIs on different ports!)

  • At the very least, these can be used for healthchecks

    (you should leverage this if you are deploying and operating Kubernetes yourself!)

  • Sometimes, they are used internally by Kubernetes

    (e.g. when the API server retrieves logs from kubelet)

  • Let's review some of these APIs!

k8s/internal-apis.md

218/404

API hunting guide

This is how we found and investigated these APIs:

  • look for open ports on Kubernetes nodes

    (worker nodes or control plane nodes)

  • check which process owns that port

  • probe the port (with curl or other tools)

  • read the source code of that process

    (in particular when looking for API routes)

OK, now let's see the results!

k8s/internal-apis.md

219/404

etcd

  • 2379/tcp → etcd clients

    • should be HTTPS and require mTLS authentication
  • 2380/tcp → etcd peers

    • should be HTTPS and require mTLS authentication
  • 2381/tcp → etcd healthcheck

    • HTTP without authentication

    • exposes two API routes: /health and /metrics

k8s/internal-apis.md

220/404

kubelet

  • 10248/tcp → healthcheck

    • HTTP without authentication

    • exposes a single API route, /healthz, that just returns ok

  • 10250/tcp → internal API

    • should be HTTPS and require mTLS authentication

    • used by the API server to obtain logs, kubectl exec, etc.

k8s/internal-apis.md

221/404

kubelet API

  • We can authenticate with e.g. our TLS admin certificate

  • The following routes should be available:

    • /healthz
    • /configz (serves kubelet configuration)
    • /metrics
    • /pods (returns desired state)
    • /runningpods (returns current state from the container runtime)
    • /logs (serves files from /var/log)
    • /containerLogs/<namespace>/<podname>/<containername> (can add e.g. ?tail=10)
    • /run, /exec, /attach, /portForward
  • See kubelet source code for details!

k8s/internal-apis.md

222/404

Trying the kubelet API

The following example should work on a cluster deployed with kubeadm.

  1. Obtain the key and certificate for the cluster-admin user.

  2. Log into a node.

  3. Copy the key and certificate on the node.

  4. Find out the name of the kube-proxy pod running on that node.

  5. Run the following command, updating the pod name:

    curl -d cmd=ls -k --cert admin.crt --key admin.key \
    https://localhost:10250/run/kube-system/kube-proxy-xy123/kube-proxy

... This should show the content of the root directory in the pod.

k8s/internal-apis.md

223/404

kube-proxy

  • 10249/tcp → healthcheck

    • HTTP, without authentication

    • exposes a few API routes: /healthz (just returns ok), /configz, /metrics

  • 10256/tcp → another healthcheck

    • HTTP, without authentication

    • also exposes a /healthz API route (but this one shows a timestamp)

k8s/internal-apis.md

224/404

kube-controller and kube-scheduler

  • 10257/tcp → kube-controller

    • HTTPS, with optional mTLS authentication

    • /healthz doesn't require authentication

    • ... but /configz and /metrics do (use e.g. admin key and certificate)

  • 10259/tcp → kube-scheduler

    • similar to kube-controller, with the same routes
225/404

:EN:- Kubernetes internal APIs :FR:- Les APIs internes de Kubernetes k8s/internal-apis.md

Image separating from the next module

226/404

Static pods

(automatically generated title slide)

227/404

Static pods

  • Hosting the Kubernetes control plane on Kubernetes has advantages:

    • we can use Kubernetes' replication and scaling features for the control plane

    • we can leverage rolling updates to upgrade the control plane

  • However, there is a catch:

    • deploying on Kubernetes requires the API to be available

    • the API won't be available until the control plane is deployed

  • How can we get out of that chicken-and-egg problem?

k8s/staticpods.md

228/404

A possible approach

  • Since each component of the control plane can be replicated...

  • We could set up the control plane outside of the cluster

  • Then, once the cluster is fully operational, create replicas running on the cluster

  • Finally, remove the replicas that are running outside of the cluster

What could possibly go wrong?

k8s/staticpods.md

229/404

Sawing off the branch you're sitting on

  • What if anything goes wrong?

    (During the setup or at a later point)

  • Worst case scenario, we might need to:

    • set up a new control plane (outside of the cluster)

    • restore a backup from the old control plane

    • move the new control plane to the cluster (again)

  • This doesn't sound like a great experience

k8s/staticpods.md

230/404

Static pods to the rescue

  • Pods are started by kubelet (an agent running on every node)

  • To know which pods it should run, the kubelet queries the API server

  • The kubelet can also get a list of static pods from:

    • a directory containing one (or multiple) manifests, and/or

    • a URL (serving a manifest)

  • These "manifests" are basically YAML definitions

    (As produced by kubectl get pod my-little-pod -o yaml)

k8s/staticpods.md

231/404

Static pods are dynamic

  • Kubelet will periodically reload the manifests

  • It will start/stop pods accordingly

    (i.e. it is not necessary to restart the kubelet after updating the manifests)

  • When connected to the Kubernetes API, the kubelet will create mirror pods

  • Mirror pods are copies of the static pods

    (so they can be seen with e.g. kubectl get pods)

k8s/staticpods.md

232/404

Bootstrapping a cluster with static pods

  • We can run control plane components with these static pods

  • They can start without requiring access to the API server

  • Once they are up and running, the API becomes available

  • These pods are then visible through the API

    (We cannot upgrade them from the API, though)

This is how kubeadm has initialized our clusters.

k8s/staticpods.md

233/404

Static pods vs normal pods

  • The API only gives us read-only access to static pods

  • We can kubectl delete a static pod...

    ...But the kubelet will re-mirror it immediately

  • Static pods can be selected just like other pods

    (So they can receive service traffic)

  • A service can select a mixture of static and other pods

k8s/staticpods.md

234/404

From static pods to normal pods

  • Once the control plane is up and running, it can be used to create normal pods

  • We can then set up a copy of the control plane in normal pods

  • Then the static pods can be removed

  • The scheduler and the controller manager use leader election

    (Only one is active at a time; removing an instance is seamless)

  • Each instance of the API server adds itself to the kubernetes service

  • Etcd will typically require more work!

k8s/staticpods.md

235/404

From normal pods back to static pods

  • Alright, but what if the control plane is down and we need to fix it?

  • We restart it using static pods!

  • This can be done automatically with the Pod Checkpointer

  • The Pod Checkpointer automatically generates manifests of running pods

  • The manifests are used to restart these pods if API contact is lost

    (More details in the Pod Checkpointer documentation page)

  • This technique is used by bootkube k8s/staticpods.md

236/404

Where should the control plane run?

Is it better to run the control plane in static pods, or normal pods?

  • If I'm a user of the cluster: I don't care, it makes no difference to me

  • What if I'm an admin, i.e. the person who installs, upgrades, repairs... the cluster?

  • If I'm using a managed Kubernetes cluster (AKS, EKS, GKE...) it's not my problem

    (I'm not the one setting up and managing the control plane)

  • If I already picked a tool (kubeadm, kops...) to set up my cluster, the tool decides for me

  • What if I haven't picked a tool yet, or if I'm installing from scratch?

    • static pods = easier to set up, easier to troubleshoot, less risk of outage

    • normal pods = easier to upgrade, easier to move (if nodes need to be shut down)

k8s/staticpods.md

237/404

Static pods in action

  • On our clusters, the staticPodPath is /etc/kubernetes/manifests
  • Have a look at this directory:
    ls -l /etc/kubernetes/manifests

We should see YAML files corresponding to the pods of the control plane.

k8s/staticpods.md

238/404

Running a static pod

  • We are going to add a pod manifest to the directory, and kubelet will run it
  • Copy a manifest to the directory:

    sudo cp ~/container.training/k8s/just-a-pod.yaml /etc/kubernetes/manifests
  • Check that it's running:

    kubectl get pods

The output should include a pod named hello-node1.

k8s/staticpods.md

239/404

Remarks

In the manifest, the pod was named hello.

apiVersion: v1
kind: Pod
metadata:
name: hello
namespace: default
spec:
containers:
- name: hello
image: nginx

The -node1 suffix was added automatically by kubelet.

If we delete the pod (with kubectl delete), it will be recreated immediately.

To delete the pod, we need to delete (or move) the manifest file.

240/404

:EN:- Static pods :FR:- Les static pods

k8s/staticpods.md

Image separating from the next module

241/404

Upgrading clusters

(automatically generated title slide)

242/404

Upgrading clusters

  • It's recommended to run consistent versions across a cluster

    (mostly to have feature parity and latest security updates)

  • It's not mandatory

    (otherwise, cluster upgrades would be a nightmare!)

  • Components can be upgraded one at a time without problems

k8s/cluster-upgrade.md

243/404

Checking what we're running

  • It's easy to check the version for the API server
  • Log into node test1

  • Check the version of kubectl and of the API server:

    kubectl version
  • In a HA setup with multiple API servers, they can have different versions

  • Running the command above multiple times can return different values

k8s/cluster-upgrade.md

244/404

Node versions

  • It's also easy to check the version of kubelet
  • Check node versions (includes kubelet, kernel, container engine):
    kubectl get nodes -o wide
  • Different nodes can run different kubelet versions

  • Different nodes can run different kernel versions

  • Different nodes can run different container engines

k8s/cluster-upgrade.md

245/404

Control plane versions

  • If the control plane is self-hosted (running in pods), we can check it
  • Show image versions for all pods in kube-system namespace:
    kubectl --namespace=kube-system get pods -o json \
    | jq -r '
    .items[]
    | [.spec.nodeName, .metadata.name]
    +
    (.spec.containers[].image | split(":"))
    | @tsv
    ' \
    | column -t

k8s/cluster-upgrade.md

246/404

What version are we running anyway?

  • When I say, "I'm running Kubernetes 1.15", is that the version of:

    • kubectl

    • API server

    • kubelet

    • controller manager

    • something else?

k8s/cluster-upgrade.md

247/404

Other versions that are important

  • etcd

  • kube-dns or CoreDNS

  • CNI plugin(s)

  • Network controller, network policy controller

  • Container engine

  • Linux kernel

k8s/cluster-upgrade.md

248/404

General guidelines

  • To update a component, use whatever was used to install it

  • If it's a distro package, update that distro package

  • If it's a container or pod, update that container or pod

  • If you used configuration management, update with that

k8s/cluster-upgrade.md

249/404

Know where your binaries come from

  • Sometimes, we need to upgrade quickly

    (when a vulnerability is announced and patched)

  • If we are using an installer, we should:

    • make sure it's using upstream packages

    • or make sure that whatever packages it uses are current

    • make sure we can tell it to pin specific component versions

k8s/cluster-upgrade.md

250/404

Important questions

  • Should we upgrade the control plane before or after the kubelets?

  • Within the control plane, should we upgrade the API server first or last?

  • How often should we upgrade?

  • How long are versions maintained?

  • All the answers are in the documentation about version skew policy!

  • Let's review the key elements together ...

k8s/cluster-upgrade.md

251/404

Kubernetes uses semantic versioning

  • Kubernetes versions look like MAJOR.MINOR.PATCH; e.g. in 1.17.2:

    • MAJOR = 1
    • MINOR = 17
    • PATCH = 2
  • It's always possible to mix and match different PATCH releases

    (e.g. 1.16.1 and 1.16.6 are compatible)

  • It is recommended to run the latest PATCH release

    (but it's mandatory only when there is a security advisory)

k8s/cluster-upgrade.md

252/404

Version skew

  • API server must be more recent than its clients (kubelet and control plane)

  • ... Which means it must always be upgraded first

  • All components support a difference of one¹ MINOR version

  • This allows live upgrades (since we can mix e.g. 1.15 and 1.16)

  • It also means that going from 1.14 to 1.16 requires going through 1.15

¹Except kubelet, which can be up to two MINOR behind API server, and kubectl, which can be one MINOR ahead or behind API server.

k8s/cluster-upgrade.md

253/404

Release cycle

  • There is a new PATCH relese whenever necessary

    (every few weeks, or "ASAP" when there is a security vulnerability)

  • There is a new MINOR release every 3 months (approximately)

  • At any given time, three MINOR releases are maintained

  • ... Which means that MINOR releases are maintained approximately 9 months

  • We should expect to upgrade at least every 3 months (on average)

k8s/cluster-upgrade.md

254/404

In practice

  • We are going to update a few cluster components

  • We will change the kubelet version on one node

  • We will change the version of the API server

  • We will work with cluster test (nodes test1, test2, test3)

k8s/cluster-upgrade.md

255/404

Updating the API server

  • This cluster has been deployed with kubeadm

  • The control plane runs in static pods

  • These pods are started automatically by kubelet

    (even when kubelet can't contact the API server)

  • They are defined in YAML files in /etc/kubernetes/manifests

    (this path is set by a kubelet command-line flag)

  • kubelet automatically updates the pods when the files are changed

k8s/cluster-upgrade.md

256/404

Changing the API server version

  • We will edit the YAML file to use a different image version
  • Log into node test1

  • Check API server version:

    kubectl version
  • Edit the API server pod manifest:

    sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
  • Look for the image: line, and update it to e.g. v1.16.0

k8s/cluster-upgrade.md

257/404

Checking what we've done

  • The API server will be briefly unavailable while kubelet restarts it
  • Check the API server version:
    kubectl version

k8s/cluster-upgrade.md

258/404

Was that a good idea?

259/404

Was that a good idea?

No!

260/404

Was that a good idea?

No!

  • Remember the guideline we gave earlier:

    To update a component, use whatever was used to install it.

  • This control plane was deployed with kubeadm

  • We should use kubeadm to upgrade it!

k8s/cluster-upgrade.md

261/404

Updating the whole control plane

  • Let's make it right, and use kubeadm to upgrade the entire control plane

    (note: this is possible only because the cluster was installed with kubeadm)

  • Check what will be upgraded:
    sudo kubeadm upgrade plan

Note 1: kubeadm thinks that our cluster is running 1.16.0.
It is confused by our manual upgrade of the API server!

Note 2: kubeadm itself is still version 1.15.9.
It doesn't know how to upgrade do 1.16.X.

k8s/cluster-upgrade.md

262/404

Upgrading kubeadm

  • First things first: we need to upgrade kubeadm
  • Upgrade kubeadm:

    sudo apt install kubeadm
  • Check what kubeadm tells us:

    sudo kubeadm upgrade plan

Problem: kubeadm doesn't know know how to handle upgrades from version 1.15.

This is because we installed version 1.17 (or even later).

We need to install kubeadm version 1.16.X.

k8s/cluster-upgrade.md

263/404

Downgrading kubeadm

  • We need to go back to version 1.16.X (e.g. 1.16.6)
  • View available versions for package kubeadm:

    apt show kubeadm -a | grep ^Version | grep 1.16
  • Downgrade kubeadm:

    sudo apt install kubeadm=1.16.6-00
  • Check what kubeadm tells us:

    sudo kubeadm upgrade plan

kubeadm should now agree to upgrade to 1.16.6.

k8s/cluster-upgrade.md

264/404

Upgrading the cluster with kubeadm

  • Ideally, we should revert our image: change

    (so that kubeadm executes the right migration steps)

  • Or we can try the upgrade anyway

  • Perform the upgrade:
    sudo kubeadm upgrade apply v1.16.6

k8s/cluster-upgrade.md

265/404

Updating kubelet

  • These nodes have been installed using the official Kubernetes packages

  • We can therefore use apt or apt-get

  • Log into node test3

  • View available versions for package kubelet:

    apt show kubelet -a | grep ^Version
  • Upgrade kubelet:

    sudo apt install kubelet=1.16.6-00

k8s/cluster-upgrade.md

266/404

Checking what we've done

  • Log into node test1

  • Check node versions:

    kubectl get nodes -o wide
  • Create a deployment and scale it to make sure that the node still works

k8s/cluster-upgrade.md

267/404

Was that a good idea?

268/404

Was that a good idea?

Almost!

269/404

Was that a good idea?

Almost!

  • Yes, kubelet was installed with distribution packages

  • However, kubeadm took care of configuring kubelet

    (when doing kubeadm join ...)

  • We were supposed to run a special command before upgrading kubelet!

  • That command should be executed on each node

  • It will download the kubelet configuration generated by kubeadm

k8s/cluster-upgrade.md

270/404

Upgrading kubelet the right way

  • We need to upgrade kubeadm, upgrade kubelet config, then upgrade kubelet

    (after upgrading the control plane)

  • Download the configuration on each node, and upgrade kubelet:
    for N in 1 2 3; do
    ssh test$N "
    sudo apt install kubeadm=1.16.6-00 &&
    sudo kubeadm upgrade node &&
    sudo apt install kubelet=1.16.6-00"
    done

k8s/cluster-upgrade.md

271/404

Checking what we've done

  • All our nodes should now be updated to version 1.16.6
  • Check nodes versions:
    kubectl get nodes -o wide

k8s/cluster-upgrade.md

272/404

Skipping versions

  • This example worked because we went from 1.15 to 1.16

  • If you are upgrading from e.g. 1.14, you will have to go through 1.15 first

  • This means upgrading kubeadm to 1.15.X, then using it to upgrade the cluster

  • Then upgrading kubeadm to 1.16.X, etc.

  • Make sure to read the release notes before upgrading!

273/404

:EN:- Best practices for cluster upgrades :EN:- Example: upgrading a kubeadm cluster

:FR:- Bonnes pratiques pour la mise à jour des clusters :FR:- Exemple : mettre à jour un cluster kubeadm

k8s/cluster-upgrade.md

Image separating from the next module

274/404

Backing up clusters

(automatically generated title slide)

275/404

Backing up clusters

  • Backups can have multiple purposes:

    • disaster recovery (servers or storage are destroyed or unreachable)

    • error recovery (human or process has altered or corrupted data)

    • cloning environments (for testing, validation...)

  • Let's see the strategies and tools available with Kubernetes!

k8s/cluster-backup.md

276/404

Important

  • Kubernetes helps us with disaster recovery

    (it gives us replication primitives)

  • Kubernetes helps us clone / replicate environments

    (all resources can be described with manifests)

  • Kubernetes does not help us with error recovery

  • We still need to back up/snapshot our data:

    • with database backups (mysqldump, pgdump, etc.)

    • and/or snapshots at the storage layer

    • and/or traditional full disk backups

k8s/cluster-backup.md

277/404

In a perfect world ...

  • The deployment of our Kubernetes clusters is automated

    (recreating a cluster takes less than a minute of human time)

  • All the resources (Deployments, Services...) on our clusters are under version control

    (never use kubectl run; always apply YAML files coming from a repository)

  • Stateful components are either:

    • stored on systems with regular snapshots

    • backed up regularly to an external, durable storage

    • outside of Kubernetes

k8s/cluster-backup.md

278/404

Kubernetes cluster deployment

  • If our deployment system isn't fully automated, it should at least be documented

  • Litmus test: how long does it take to deploy a cluster...

    • for a senior engineer?

    • for a new hire?

  • Does it require external intervention?

    (e.g. provisioning servers, signing TLS certs...)

k8s/cluster-backup.md

279/404

Plan B

  • Full machine backups of the control plane can help

  • If the control plane is in pods (or containers), pay attention to storage drivers

    (if the backup mechanism is not container-aware, the backups can take way more resources than they should, or even be unusable!)

  • If the previous sentence worries you:

    automate the deployment of your clusters!

k8s/cluster-backup.md

280/404

Managing our Kubernetes resources

  • Ideal scenario:

    • never create a resource directly on a cluster

    • push to a code repository

    • a special branch (production or even master) gets automatically deployed

  • Some folks call this "GitOps"

    (it's the logical evolution of configuration management and infrastructure as code)

k8s/cluster-backup.md

281/404

GitOps in theory

  • What do we keep in version control?

  • For very simple scenarios: source code, Dockerfiles, scripts

  • For real applications: add resources (as YAML files)

  • For applications deployed multiple times: Helm, Kustomize...

    (staging and production count as "multiple times")

k8s/cluster-backup.md

282/404

GitOps tooling

  • Various tools exist (Weave Flux, GitKube...)

  • These tools are still very young

  • You still need to write YAML for all your resources

  • There is no tool to:

    • list all resources in a namespace

    • get resource YAML in a canonical form

    • diff YAML descriptions with current state

k8s/cluster-backup.md

283/404

GitOps in practice

  • Start describing your resources with YAML

  • Leverage a tool like Kustomize or Helm

  • Make sure that you can easily deploy to a new namespace

    (or even better: to a new cluster)

  • When tooling matures, you will be ready

k8s/cluster-backup.md

284/404

Plan B

  • What if we can't describe everything with YAML?

  • What if we manually create resources and forget to commit them to source control?

  • What about global resources, that don't live in a namespace?

  • How can we be sure that we saved everything?

k8s/cluster-backup.md

285/404

Backing up etcd

  • All objects are saved in etcd

  • etcd data should be relatively small

    (and therefore, quick and easy to back up)

  • Two options to back up etcd:

    • snapshot the data directory

    • use etcdctl snapshot

k8s/cluster-backup.md

286/404

Making an etcd snapshot

  • The basic command is simple:

    etcdctl snapshot save <filename>
  • But we also need to specify:

    • an environment variable to specify that we want etcdctl v3

    • the address of the server to back up

    • the path to the key, certificate, and CA certificate
      (if our etcd uses TLS certificates)

k8s/cluster-backup.md

287/404

Snapshotting etcd on kubeadm

  • The following command will work on clusters deployed with kubeadm

    (and maybe others)

  • It should be executed on a master node

docker run --rm --net host -v $PWD:/vol \
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd:ro \
-e ETCDCTL_API=3 k8s.gcr.io/etcd:3.3.10 \
etcdctl --endpoints=https://[127.0.0.1]:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /vol/snapshot
  • It will create a file named snapshot in the current directory

k8s/cluster-backup.md

288/404

How can we remember all these flags?

  • Older versions of kubeadm did add a healthcheck probe with all these flags

  • That healthcheck probe was calling etcdctl with all the right flags

  • With recent versions of kubeadm, we're on our own!

  • Exercise: write the YAML for a batch job to perform the backup

    (how will you access the key and certificate required to connect?)

k8s/cluster-backup.md

289/404

Restoring an etcd snapshot

  • Execute exactly the same command, but replacing save with restore

    (Believe it or not, doing that will not do anything useful!)

  • The restore command does not load a snapshot into a running etcd server

  • The restore command creates a new data directory from the snapshot

    (it's an offline operation; it doesn't interact with an etcd server)

  • It will create a new data directory in a temporary container

    (leaving the running etcd node untouched)

k8s/cluster-backup.md

290/404

When using kubeadm

  1. Create a new data directory from the snapshot:

    sudo rm -rf /var/lib/etcd
    docker run --rm -v /var/lib:/var/lib -v $PWD:/vol \
    -e ETCDCTL_API=3 k8s.gcr.io/etcd:3.3.10 \
    etcdctl snapshot restore /vol/snapshot --data-dir=/var/lib/etcd
  2. Provision the control plane, using that data directory:

    sudo kubeadm init \
    --ignore-preflight-errors=DirAvailable--var-lib-etcd
  3. Rejoin the other nodes

k8s/cluster-backup.md

291/404

The fine print

  • This only saves etcd state

  • It does not save persistent volumes and local node data

  • Some critical components (like the pod network) might need to be reset

  • As a result, our pods might have to be recreated, too

  • If we have proper liveness checks, this should happen automatically

k8s/cluster-backup.md

292/404

More information about etcd backups

k8s/cluster-backup.md

293/404

Don't forget ...

  • Also back up the TLS information

    (at the very least: CA key and cert; API server key and cert)

  • With clusters provisioned by kubeadm, this is in /etc/kubernetes/pki

  • If you don't:

    • you will still be able to restore etcd state and bring everything back up

    • you will need to redistribute user certificates

TLS information is highly sensitive!
Anyone who has it has full access to your cluster!

k8s/cluster-backup.md

294/404

Stateful services

  • It's totally fine to keep your production databases outside of Kubernetes

    Especially if you have only one database server!

  • Feel free to put development and staging databases on Kubernetes

    (as long as they don't hold important data)

  • Using Kubernetes for stateful services makes sense if you have many

    (because then you can leverage Kubernetes automation)

k8s/cluster-backup.md

295/404

Snapshotting persistent volumes

k8s/cluster-backup.md

296/404

More backup tools

  • Stash

    back up Kubernetes persistent volumes

  • ReShifter

    cluster state management

  • Heptio Ark Velero

    full cluster backup

  • kube-backup

    simple scripts to save resource YAML to a git repository

  • bivac

    Backup Interface for Volumes Attached to Containers

297/404

:EN:- Backing up clusters :FR:- Politiques de sauvegarde

k8s/cluster-backup.md

Image separating from the next module

298/404

Securing the control plane

(automatically generated title slide)

299/404

Securing the control plane

  • Many components accept connections (and requests) from others:

    • API server

    • etcd

    • kubelet

  • We must secure these connections:

    • to deny unauthorized requests

    • to prevent eavesdropping secrets, tokens, and other sensitive information

  • Disabling authentication and/or authorization is strongly discouraged

    (but it's possible to do it, e.g. for learning / troubleshooting purposes)

k8s/control-plane-auth.md

300/404

Authentication and authorization

  • Authentication (checking "who you are") is done with mutual TLS

    (both the client and the server need to hold a valid certificate)

  • Authorization (checking "what you can do") is done in different ways

    • the API server implements a sophisticated permission logic (with RBAC)

    • some services will defer authorization to the API server (through webhooks)

    • some services require a certificate signed by a particular CA / sub-CA

k8s/control-plane-auth.md

301/404

In practice

  • We will review the various communication channels in the control plane

  • We will describe how they are secured

  • When TLS certificates are used, we will indicate:

    • which CA signs them

    • what their subject (CN) should be, when applicable

  • We will indicate how to configure security (client- and server-side)

k8s/control-plane-auth.md

302/404

etcd peers

  • Replication and coordination of etcd happens on a dedicated port

    (typically port 2380; the default port for normal client connections is 2379)

  • Authentication uses TLS certificates with a separate sub-CA

    (otherwise, anyone with a Kubernetes client certificate could access etcd!)

  • The etcd command line flags involved are:

    --peer-client-cert-auth=true to activate it

    --peer-cert-file, --peer-key-file, --peer-trusted-ca-file

k8s/control-plane-auth.md

303/404

etcd clients

  • The only¹ thing that connects to etcd is the API server

  • Authentication uses TLS certificates with a separate sub-CA

    (for the same reasons as for etcd inter-peer authentication)

  • The etcd command line flags involved are:

    --client-cert-auth=true to activate it

    --trusted-ca-file, --cert-file, --key-file

  • The API server command line flags involved are:

    --etcd-cafile, --etcd-certfile, --etcd-keyfile

¹Technically, there is also the etcd healthcheck. Let's ignore it for now.

k8s/control-plane-auth.md

304/404

etcd authorization

  • etcd supports RBAC, but Kubernetes doesn't use it by default

    (note: etcd RBAC is completely different from Kubernetes RBAC!)

  • By default, etcd access is "all or nothing"

    (if you have a valid certificate, you get in)

  • Be very careful if you use the same root CA for etcd and other things

    (if etcd trusts the root CA, then anyone with a valid cert gets full etcd access)

  • For more details, check the following resources:

k8s/control-plane-auth.md

305/404

API server clients

  • The API server has a sophisticated authentication and authorization system

  • For connections coming from other components of the control plane:

    • authentication uses certificates (trusting the certificates' subject or CN)

    • authorization uses whatever mechanism is enabled (most oftentimes, RBAC)

  • The relevant API server flags are:

    --client-ca-file, --tls-cert-file, --tls-private-key-file

  • Each component connecting to the API server takes a --kubeconfig flag

    (to specify a kubeconfig file containing the CA cert, client key, and client cert)

  • Yes, that kubeconfig file follows the same format as our ~/.kube/config file!

k8s/control-plane-auth.md

306/404

Kubelet and API server

  • Communication between kubelet and API server can be established both ways

  • Kubelet → API server:

    • kubelet registers itself ("hi, I'm node42, do you have work for me?")

    • connection is kept open and re-established if it breaks

    • that's how the kubelet knows which pods to start/stop

  • API server → kubelet:

    • used to retrieve logs, exec, attach to containers

k8s/control-plane-auth.md

307/404

Kubelet → API server

  • Kubelet is started with --kubeconfig with API server information

  • The client certificate of the kubelet will typically have:

    CN=system:node:<nodename> and groups O=system:nodes

  • Nothing special on the API server side

    (it will authenticate like any other client)

k8s/control-plane-auth.md

308/404

API server → kubelet

  • Kubelet is started with the flag --client-ca-file

    (typically using the same CA as the API server)

  • API server will use a dedicated key pair when contacting kubelet

    (specified with --kubelet-client-certificate and --kubelet-client-key)

  • Authorization uses webhooks

    (enabled with --authorization-mode=Webhook on kubelet)

  • The webhook server is the API server itself

    (the kubelet sends back a request to the API server to ask, "can this person do that?")

k8s/control-plane-auth.md

309/404

Scheduler

  • The scheduler connects to the API server like an ordinary client

  • The certificate of the scheduler will have CN=system:kube-scheduler

k8s/control-plane-auth.md

310/404

Controller manager

  • The controller manager is also a normal client to the API server

  • Its certificate will have CN=system:kube-controller-manager

  • If we use the CSR API, the controller manager needs the CA cert and key

    (passed with flags --cluster-signing-cert-file and --cluster-signing-key-file)

  • We usually want the controller manager to generate tokens for service accounts

  • These tokens deserve some details (on the next slide!)

k8s/control-plane-auth.md

311/404

How are these permissions set up?

k8s/control-plane-auth.md

312/404

Service account tokens

  • Each time we create a service account, the controller manager generates a token

  • These tokens are JWT tokens, signed with a particular key

  • These tokens are used for authentication with the API server

    (and therefore, the API server needs to be able to verify their integrity)

  • This uses another keypair:

    • the private key (used for signature) is passed to the controller manager
      (using flags --service-account-private-key-file and --root-ca-file)

    • the public key (used for verification) is passed to the API server
      (using flag --service-account-key-file)

k8s/control-plane-auth.md

313/404

kube-proxy

  • kube-proxy is "yet another API server client"

  • In many clusters, it runs as a Daemon Set

  • In that case, it will have its own Service Account and associated permissions

  • It will authenticate using the token of that Service Account

k8s/control-plane-auth.md

314/404

Webhooks

  • We mentioned webhooks earlier; how does that really work?

  • The Kubernetes API has special resource types to check permissions

  • One of them is SubjectAccessReview

  • To check if a particular user can do a particular action on a particular resource:

    • we prepare a SubjectAccessReview object

    • we send that object to the API server

    • the API server responds with allow/deny (and optional explanations)

  • Using webhooks for authorization = sending SAR to authorize each request

k8s/control-plane-auth.md

315/404

Subject Access Review

Here is an example showing how to check if jean.doe can get some pods in kube-system:

kubectl -v9 create -f- <<EOF
apiVersion: authorization.k8s.io/v1beta1
kind: SubjectAccessReview
spec:
user: jean.doe
group:
- foo
- bar
resourceAttributes:
#group: blah.k8s.io
namespace: kube-system
resource: pods
verb: get
#name: web-xyz1234567-pqr89
EOF
316/404

:EN:- Control plane authentication :FR:- Sécurisation du plan de contrôle

k8s/control-plane-auth.md

Image separating from the next module

317/404

Generating user certificates

(automatically generated title slide)

318/404

Generating user certificates

  • The most popular ways to authenticate users with Kubernetes are:

    • TLS certificates

    • JSON Web Tokens (OIDC or ServiceAccount tokens)

  • We're going to see how to use TLS certificates

  • We will generate a certificate for an user and give them some permissions

  • Then we will use that certificate to access the cluster

k8s/user-cert.md

319/404

Heads up!

  • The demos in this section require that we have access to our cluster's CA

  • This is easy if we are using a cluster deployed with kubeadm

  • Otherwise, we may or may not have access to the cluster's CA

  • We may or may not be able to use the CSR API instead

k8s/user-cert.md

320/404

Check that we have access to the CA

  • Make sure that you are logged on the node hosting the control plane

    (if a cluster has been provisioned for you for a training, it's node1)

  • Check that the CA key is here:
    sudo ls -l /etc/kubernetes/pki

The output should include ca.key and ca.crt.

k8s/user-cert.md

321/404

How it works

  • The API server is configured to accept all certificates signed by a given CA

  • The certificate contains:

    • the user name (in the CN field)

    • the groups the user belongs to (as multiple O fields)

  • Check which CA is used by the Kubernetes API server:
    sudo grep crt /etc/kubernetes/manifests/kube-apiserver.yaml

This is the flag that we're looking for:

--client-ca-file=/etc/kubernetes/pki/ca.crt

k8s/user-cert.md

322/404

Generating a key and CSR for our user

  • These operations could be done on a separate machine

  • We only need to transfer the CSR (Certificate Signing Request) to the CA

    (we never need to expoes the private key)

  • Generate a private key:

    openssl genrsa 4096 > user.key
  • Generate a CSR:

    openssl req -new -key user.key -subj /CN=jerome/O=devs/O=ops > user.csr

k8s/user-cert.md

323/404

Generating a signed certificate

  • This has to be done on the machine holding the CA private key

    (copy the user.csr file if needed)

  • Verify the CSR paramters:

    openssl req -in user.csr -text | head
  • Generate the certificate:

    sudo openssl x509 -req \
    -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key \
    -in user.csr -days 1 -set_serial 1234 > user.crt

If you are using two separate machines, transfer user.crt to the other machine.

k8s/user-cert.md

324/404

Adding the key and certificate to kubeconfig

  • We have to edit our .kube/config file

  • This can be done relatively easily with kubectl config

  • Create a new user entry in our .kube/config file:
    kubectl config set-credentials jerome \
    --client-key=user.key --client-certificate=user.crt

The configuration file now points to our local files.

We could also embed the key and certs with the --embed-certs option.

(So that the kubeconfig file can be used without user.key and user.crt.)

k8s/user-cert.md

325/404

Using the new identity

  • At the moment, we probably use the admin certificate generated by kubeadm

    (with CN=kubernetes-admin and O=system:masters)

  • Let's edit our context to use our new certificate instead!

  • Edit the context:

    kubectl config set-context --current --user=jerome
  • Try any command:

    kubectl get pods

Access will be denied, but we should see that were correctly authenticated as jerome.

k8s/user-cert.md

326/404

Granting permissions

  • Let's add some read-only permissions to the devs group (for instance)
  • Switch back to our admin identity:

    kubectl config set-context --current --user=kubernetes-admin
  • Grant permissions:

    kubectl create clusterrolebinding devs-can-view \
    --clusterrole=view --group=devs

k8s/user-cert.md

327/404

Testing the new permissions

  • As soon as we create the ClusterRoleBinding, all users in the devs group get access

  • Let's verify that we can e.g. list pods!

  • Switch to our user identity again:

    kubectl config set-context --current --user=jerome
  • Test the permissions:

    kubectl get pods
328/404

:EN:- Authentication with user certificates :FR:- Identification par certificat TLS

k8s/user-cert.md

Image separating from the next module

329/404

The CSR API

(automatically generated title slide)

330/404

The CSR API

  • The Kubernetes API exposes CSR resources

  • We can use these resources to issue TLS certificates

  • First, we will go through a quick reminder about TLS certificates

  • Then, we will see how to obtain a certificate for a user

  • We will use that certificate to authenticate with the cluster

  • Finally, we will grant some privileges to that user

k8s/csr-api.md

331/404

Reminder about TLS

  • TLS (Transport Layer Security) is a protocol providing:

    • encryption (to prevent eavesdropping)

    • authentication (using public key cryptography)

  • When we access an https:// URL, the server authenticates itself

    (it proves its identity to us; as if it were "showing its ID")

  • But we can also have mutual TLS authentication (mTLS)

    (client proves its identity to server; server proves its identity to client)

k8s/csr-api.md

332/404

Authentication with certificates

  • To authenticate, someone (client or server) needs:

    • a private key (that remains known only to them)

    • a public key (that they can distribute)

    • a certificate (associating the public key with an identity)

  • A message encrypted with the private key can only be decrypted with the public key

    (and vice versa)

  • If I use someone's public key to encrypt/decrypt their messages,
    I can be certain that I am talking to them / they are talking to me

  • The certificate proves that I have the correct public key for them

k8s/csr-api.md

333/404

Certificate generation workflow

This is what I do if I want to obtain a certificate.

  1. Create public and private keys.

  2. Create a Certificate Signing Request (CSR).

    (The CSR contains the identity that I claim and a public key.)

  3. Send that CSR to the Certificate Authority (CA).

  4. The CA verifies that I can claim the identity in the CSR.

  5. The CA generates my certificate and gives it to me.

The CA (or anyone else) never needs to know my private key.

k8s/csr-api.md

334/404

The CSR API

  • The Kubernetes API has a CertificateSigningRequest resource type

    (we can list them with e.g. kubectl get csr)

  • We can create a CSR object

    (= upload a CSR to the Kubernetes API)

  • Then, using the Kubernetes API, we can approve/deny the request

  • If we approve the request, the Kubernetes API generates a certificate

  • The certificate gets attached to the CSR object and can be retrieved

k8s/csr-api.md

335/404

Using the CSR API

  • We will show how to use the CSR API to obtain user certificates

  • This will be a rather complex demo

  • ... And yet, we will take a few shortcuts to simplify it

    (but it will illustrate the general idea)

  • The demo also won't be automated

    (we would have to write extra code to make it fully functional)

k8s/csr-api.md

336/404

Warning

  • The CSR API isn't really suited to issue user certificates

  • It is primarily intended to issue control plane certificates

    (for instance, deal with kubelet certificates renewal)

  • The API was expanded a bit in Kubernetes 1.19 to encompass broader usage

  • There are still lots of gaps in the spec

    (e.g. how to specify expiration in a standard way)

  • ... And no other implementation to this date

    (but cert-manager might eventually get there!)

k8s/csr-api.md

337/404

General idea

  • We will create a Namespace named "users"

  • Each user will get a ServiceAccount in that Namespace

  • That ServiceAccount will give read/write access to one CSR object

  • Users will use that ServiceAccount's token to submit a CSR

  • We will approve the CSR (or not)

  • Users can then retrieve their certificate from their CSR object

  • ...And use that certificate for subsequent interactions

k8s/csr-api.md

338/404

Resource naming

For a user named jean.doe, we will have:

  • ServiceAccount jean.doe in Namespace users

  • CertificateSigningRequest user=jean.doe

  • ClusterRole user=jean.doe giving read/write access to that CSR

  • ClusterRoleBinding user=jean.doe binding ClusterRole and ServiceAccount

k8s/csr-api.md

339/404

About resource name constraints

  • Most Kubernetes identifiers and names are fairly restricted

  • They generally are DNS-1123 labels or subdomains (from RFC 1123)

  • A label is lowercase letters, numbers, dashes; can't start or finish with a dash

  • A subdomain is one or multiple labels separated by dots

  • Some resources have more relaxed constraints, and can be "path segment names"

    (uppercase are allowed, as well as some characters like #:?!,_)

  • This includes RBAC objects (like Roles, RoleBindings...) and CSRs

  • See the Identifiers and Names design document and the Object Names and IDs documentation page for more details

k8s/csr-api.md

340/404

Creating the user's resources

If you want to use another name than jean.doe, update the YAML file!

  • Create the global namespace for all users:

    kubectl create namespace users
  • Create the ServiceAccount, ClusterRole, ClusterRoleBinding for jean.doe:

    kubectl apply -f ~/container.training/k8s/user=jean.doe.yaml

k8s/csr-api.md

341/404

Extracting the user's token

  • Let's obtain the user's token and give it to them

    (the token will be their password)

  • List the user's secrets:

    kubectl --namespace=users describe serviceaccount jean.doe
  • Show the user's token:

    kubectl --namespace=users describe secret jean.doe-token-xxxxx

k8s/csr-api.md

342/404

Configure kubectl to use the token

  • Let's create a new context that will use that token to access the API
  • Add a new identity to our kubeconfig file:

    kubectl config set-credentials token:jean.doe --token=...
  • Add a new context using that identity:

    kubectl config set-context jean.doe --user=token:jean.doe --cluster=kubernetes

    (Make sure to adapt the cluster name if yours is different!)

  • Use that context:

    kubectl config use-context jean.doe

k8s/csr-api.md

343/404

Access the API with the token

  • Let's check that our access rights are set properly
  • Try to access any resource:

    kubectl get pods

    (This should tell us "Forbidden")

  • Try to access "our" CertificateSigningRequest:

    kubectl get csr user=jean.doe

    (This should tell us "NotFound")

k8s/csr-api.md

344/404

Create a key and a CSR

  • There are many tools to generate TLS keys and CSRs

  • Let's use OpenSSL; it's not the best one, but it's installed everywhere

    (many people prefer cfssl, easyrsa, or other tools; that's fine too!)

  • Generate the key and certificate signing request:
    openssl req -newkey rsa:2048 -nodes -keyout key.pem \
    -new -subj /CN=jean.doe/O=devs/ -out csr.pem

The command above generates:

  • a 2048-bit RSA key, without encryption, stored in key.pem
  • a CSR for the name jean.doe in group devs

k8s/csr-api.md

345/404

Inside the Kubernetes CSR object

  • The Kubernetes CSR object is a thin wrapper around the CSR PEM file

  • The PEM file needs to be encoded to base64 on a single line

    (we will use base64 -w0 for that purpose)

  • The Kubernetes CSR object also needs to list the right "usages"

    (these are flags indicating how the certificate can be used)

k8s/csr-api.md

346/404

Sending the CSR to Kubernetes

  • Generate and create the CSR resource:
    kubectl apply -f - <<EOF
    apiVersion: certificates.k8s.io/v1beta1
    kind: CertificateSigningRequest
    metadata:
    name: user=jean.doe
    spec:
    request: $(base64 -w0 < csr.pem)
    usages:
    - digital signature
    - key encipherment
    - client auth
    EOF

k8s/csr-api.md

347/404

Adjusting certificate expiration

  • Edit the static pod definition for the controller manager:

    sudo vim /etc/kubernetes/manifests/kube-controller-manager.yaml
  • In the list of flags, add the following line:

    - --experimental-cluster-signing-duration=1h

k8s/csr-api.md

348/404

Verifying and approving the CSR

  • Let's inspect the CSR, and if it is valid, approve it
  • Switch back to cluster-admin:

    kctx -
  • Inspect the CSR:

    kubectl describe csr user=jean.doe
  • Approve it:

    kubectl certificate approve user=jean.doe

k8s/csr-api.md

349/404

Obtaining the certificate

  • Switch back to the user's identity:

    kctx -
  • Retrieve the updated CSR object and extract the certificate:

    kubectl get csr user=jean.doe \
    -o jsonpath={.status.certificate} \
    | base64 -d > cert.pem
  • Inspect the certificate:

    openssl x509 -in cert.pem -text -noout

k8s/csr-api.md

350/404

Using the certificate

  • Add the key and certificate to kubeconfig:

    kubectl config set-credentials cert:jean.doe --embed-certs \
    --client-certificate=cert.pem --client-key=key.pem
  • Update the user's context to use the key and cert to authenticate:

    kubectl config set-context jean.doe --user cert:jean.doe
  • Confirm that we are seen as jean.doe (but don't have permissions):

    kubectl get pods

k8s/csr-api.md

351/404

What's missing?

We have just shown, step by step, a method to issue short-lived certificates for users.

To be usable in real environments, we would need to add:

  • a kubectl helper to automatically generate the CSR and obtain the cert

    (and transparently renew the cert when needed)

  • a Kubernetes controller to automatically validate and approve CSRs

    (checking that the subject and groups are valid)

  • a way for the users to know the groups to add to their CSR

    (e.g.: annotations on their ServiceAccount + read access to the ServiceAccount)

k8s/csr-api.md

352/404

Is this realistic?

  • Larger organizations typically integrate with their own directory

  • The general principle, however, is the same:

    • users have long-term credentials (password, token, ...)

    • they use these credentials to obtain other, short-lived credentials

  • This provides enhanced security:

    • the long-term credentials can use long passphrases, 2FA, HSM...

    • the short-term credentials are more convenient to use

    • we get strong security and convenience

  • Systems like Vault also have certificate issuance mechanisms

353/404

:EN:- Generating user certificates with the CSR API :FR:- Génération de certificats utilisateur avec la CSR API

k8s/csr-api.md

Image separating from the next module

354/404

OpenID Connect

(automatically generated title slide)

355/404

OpenID Connect

  • The Kubernetes API server can perform authentication with OpenID connect

  • This requires an OpenID provider

    (external authorization server using the OAuth 2.0 protocol)

  • We can use a third-party provider (e.g. Google) or run our own (e.g. Dex)

  • We are going to give an overview of the protocol

  • We will show it in action (in a simplified scenario)

k8s/openid-connect.md

356/404

Workflow overview

  • We want to access our resources (a Kubernetes cluster)

  • We authenticate with the OpenID provider

    • we can do this directly (e.g. by going to https://accounts.google.com)

    • or maybe a kubectl plugin can open a browser page on our behalf

  • After authenticating us, the OpenID provider gives us:

    • an id token (a short-lived signed JSON Web Token, see next slide)

    • a refresh token (to renew the id token when needed)

  • We can now issue requests to the Kubernetes API with the id token

  • The API server will verify that token's content to authenticate us

k8s/openid-connect.md

357/404

JSON Web Tokens

  • A JSON Web Token (JWT) has three parts:

    • a header specifying algorithms and token type

    • a payload (indicating who issued the token, for whom, which purposes...)

    • a signature generated by the issuer (the issuer = the OpenID provider)

  • Anyone can verify a JWT without contacting the issuer

    (except to obtain the issuer's public key)

  • Pro tip: we can inspect a JWT with https://jwt.io/

k8s/openid-connect.md

358/404

How the Kubernetes API uses JWT

  • Server side

    • enable OIDC authentication

    • indicate which issuer (provider) should be allowed

    • indicate which audience (or "client id") should be allowed

    • optionally, map or prefix user and group names

  • Client side

    • obtain JWT as described earlier

    • pass JWT as authentication token

    • renew JWT when needed (using the refresh token)

k8s/openid-connect.md

359/404

Demo time!

  • We will use Google Accounts as our OpenID provider

  • We will use the Google OAuth Playground as the "audience" or "client id"

  • We will obtain a JWT through Google Accounts and the OAuth Playground

  • We will enable OIDC in the Kubernetes API server

  • We will use the JWT to authenticate

If you can't or won't use a Google account, you can try to adapt this to another provider.

k8s/openid-connect.md

360/404

Checking the API server logs

  • The API server logs will be particularly useful in this section

    (they will indicate e.g. why a specific token is rejected)

  • Let's keep an eye on the API server output!

  • Tail the logs of the API server:
    kubectl logs kube-apiserver-node1 --follow --namespace=kube-system

k8s/openid-connect.md

361/404

Authenticate with the OpenID provider

  • We will use the Google OAuth Playground for convenience

  • In a real scenario, we would need our own OAuth client instead of the playground

    (even if we were still using Google as the OpenID provider)

  • Open the Google OAuth Playground:

    https://developers.google.com/oauthplayground/
  • Enter our own custom scope in the text field:

    https://www.googleapis.com/auth/userinfo.email
  • Click on "Authorize APIs" and allow the playground to access our email address

k8s/openid-connect.md

362/404

Obtain our JSON Web Token

  • The previous step gave us an "authorization code"

  • We will use it to obtain tokens

  • Click on "Exchange authorization code for tokens"
  • The JWT is the very long id_token that shows up on the right hand side

    (it is a base64-encoded JSON object, and should therefore start with eyJ)

k8s/openid-connect.md

363/404

Using our JSON Web Token

  • We need to create a context (in kubeconfig) for our token

    (if we just add the token or use kubectl --token, our certificate will still be used)

  • Create a new authentication section in kubeconfig:

    kubectl config set-credentials myjwt --token=eyJ...
  • Try to use it:

    kubectl --user=myjwt get nodes

We should get an Unauthorized response, since we haven't enabled OpenID Connect in the API server yet. We should also see invalid bearer token in the API server log output.

k8s/openid-connect.md

364/404

Enabling OpenID Connect

  • We need to add a few flags to the API server configuration

  • These two are mandatory:

    --oidc-issuer-url → URL of the OpenID provider

    --oidc-client-id → app requesting the authentication
    (in our case, that's the ID for the Google OAuth Playground)

  • This one is optional:

    --oidc-username-claim → which field should be used as user name
    (we will use the user's email address instead of an opaque ID)

  • See the API server documentation for more details about all available flags

k8s/openid-connect.md

365/404

Updating the API server configuration

  • The instructions below will work for clusters deployed with kubeadm

    (or where the control plane is deployed in static pods)

  • If your cluster is deployed differently, you will need to adapt them

  • Edit /etc/kubernetes/manifests/kube-apiserver.yaml

  • Add the following lines to the list of command-line flags:

    - --oidc-issuer-url=https://accounts.google.com
    - --oidc-client-id=407408718192.apps.googleusercontent.com
    - --oidc-username-claim=email

k8s/openid-connect.md

366/404

Restarting the API server

  • The kubelet monitors the files in /etc/kubernetes/manifests

  • When we save the pod manifest, kubelet will restart the corresponding pod

    (using the updated command line flags)

  • After making the changes described on the previous slide, save the file

  • Issue a simple command (like kubectl version) until the API server is back up

    (it might take between a few seconds and one minute for the API server to restart)

  • Restart the kubectl logs command to view the logs of the API server

k8s/openid-connect.md

367/404

Using our JSON Web Token

  • Now that the API server is set up to recognize our token, try again!
  • Try an API command with our token:
    kubectl --user=myjwt get nodes
    kubectl --user=myjwt get pods

We should see a message like:

Error from server (Forbidden): nodes is forbidden: User "jean.doe@gmail.com"
cannot list resource "nodes" in API group "" at the cluster scope

→ We were successfully authenticated, but not authorized.

k8s/openid-connect.md

368/404

Authorizing our user

  • As an extra step, let's grant read access to our user

  • We will use the pre-defined ClusterRole view

  • Create a ClusterRoleBinding allowing us to view resources:

    kubectl create clusterrolebinding i-can-view \
    --user=jean.doe@gmail.com --clusterrole=view

    (make sure to put your Google email address there)

  • Confirm that we can now list pods with our token:

    kubectl --user=myjwt get pods

k8s/openid-connect.md

369/404

From demo to production

This was a very simplified demo! In a real deployment...

  • We wouldn't use the Google OAuth Playground

  • We probably wouldn't even use Google at all

    (it doesn't seem to provide a way to include groups!)

  • Some popular alternatives:

  • We would use a helper (like the kubelogin plugin) to automatically obtain tokens

k8s/openid-connect.md

370/404

Service Account tokens

  • The tokens used by Service Accounts are JWT tokens as well

  • They are signed and verified using a special service account key pair

  • Extract the token of a service account in the current namespace:

    kubectl get secrets -o jsonpath={..token} | base64 -d
  • Copy-paste the token to a verification service like https://jwt.io

  • Notice that it says "Invalid Signature"

k8s/openid-connect.md

371/404

Verifying Service Account tokens

  • JSON Web Tokens embed the URL of the "issuer" (=OpenID provider)

  • The issuer provides its public key through a well-known discovery endpoint

    (similar to https://accounts.google.com/.well-known/openid-configuration)

  • There is no such endpoint for the Service Account key pair

  • But we can provide the public key ourselves for verification

k8s/openid-connect.md

372/404

Verifying a Service Account token

  • On clusters provisioned with kubeadm, the Service Account key pair is:

    /etc/kubernetes/pki/sa.key (used by the controller manager to generate tokens)

    /etc/kubernetes/pki/sa.pub (used by the API server to validate the same tokens)

  • Display the public key used to sign Service Account tokens:

    sudo cat /etc/kubernetes/pki/sa.pub
  • Copy-paste the key in the "verify signature" area on https://jwt.io

  • It should now say "Signature Verified"

373/404

:EN:- Authenticating with OIDC :FR:- S'identifier avec OIDC

k8s/openid-connect.md

Image separating from the next module

374/404

Pod Security Policies

(automatically generated title slide)

375/404

Pod Security Policies

  • By default, our pods and containers can do everything

    (including taking over the entire cluster)

  • We are going to show an example of a malicious pod

  • Then we will explain how to avoid this with PodSecurityPolicies

  • We will enable PodSecurityPolicies on our cluster

  • We will create a couple of policies (restricted and permissive)

  • Finally we will see how to use them to improve security on our cluster

k8s/podsecuritypolicy.md

376/404

Setting up a namespace

  • For simplicity, let's work in a separate namespace

  • Let's create a new namespace called "green"

  • Create the "green" namespace:

    kubectl create namespace green
  • Change to that namespace:

    kns green

k8s/podsecuritypolicy.md

377/404

Creating a basic Deployment

  • Just to check that everything works correctly, deploy NGINX
  • Create a Deployment using the official NGINX image:

    kubectl create deployment web --image=nginx
  • Confirm that the Deployment, ReplicaSet, and Pod exist, and that the Pod is running:

    kubectl get all

k8s/podsecuritypolicy.md

378/404

One example of malicious pods

  • We will now show an escalation technique in action

  • We will deploy a DaemonSet that adds our SSH key to the root account

    (on each node of the cluster)

  • The Pods of the DaemonSet will do so by mounting /root from the host

  • Check the file k8s/hacktheplanet.yaml with a text editor:

    vim ~/container.training/k8s/hacktheplanet.yaml
  • If you would like, change the SSH key (by changing the GitHub user name)

k8s/podsecuritypolicy.md

379/404

Deploying the malicious pods

  • Let's deploy our "exploit"!
  • Create the DaemonSet:

    kubectl create -f ~/container.training/k8s/hacktheplanet.yaml
  • Check that the pods are running:

    kubectl get pods
  • Confirm that the SSH key was added to the node's root account:

    sudo cat /root/.ssh/authorized_keys

k8s/podsecuritypolicy.md

380/404

Cleaning up

  • Before setting up our PodSecurityPolicies, clean up that namespace
  • Remove the DaemonSet:

    kubectl delete daemonset hacktheplanet
  • Remove the Deployment:

    kubectl delete deployment web

k8s/podsecuritypolicy.md

381/404

Pod Security Policies in theory

  • To use PSPs, we need to activate their specific admission controller

  • That admission controller will intercept each pod creation attempt

  • It will look at:

    • who/what is creating the pod

    • which PodSecurityPolicies they can use

    • which PodSecurityPolicies can be used by the Pod's ServiceAccount

  • Then it will compare the Pod with each PodSecurityPolicy one by one

  • If a PodSecurityPolicy accepts all the parameters of the Pod, it is created

  • Otherwise, the Pod creation is denied and it won't even show up in kubectl get pods

k8s/podsecuritypolicy.md

382/404

Pod Security Policies fine print

  • With RBAC, using a PSP corresponds to the verb use on the PSP

    (that makes sense, right?)

  • If no PSP is defined, no Pod can be created

    (even by cluster admins)

  • Pods that are already running are not affected

  • If we create a Pod directly, it can use a PSP to which we have access

  • If the Pod is created by e.g. a ReplicaSet or DaemonSet, it's different:

    • the ReplicaSet / DaemonSet controllers don't have access to our policies

    • therefore, we need to give access to the PSP to the Pod's ServiceAccount

k8s/podsecuritypolicy.md

383/404

Pod Security Policies in practice

  • We are going to enable the PodSecurityPolicy admission controller

  • At that point, we won't be able to create any more pods (!)

  • Then we will create a couple of PodSecurityPolicies

  • ...And associated ClusterRoles (giving use access to the policies)

  • Then we will create RoleBindings to grant these roles to ServiceAccounts

  • We will verify that we can't run our "exploit" anymore

k8s/podsecuritypolicy.md

384/404

Enabling Pod Security Policies

  • To enable Pod Security Policies, we need to enable their admission plugin

  • This is done by adding a flag to the API server

  • On clusters deployed with kubeadm, the control plane runs in static pods

  • These pods are defined in YAML files located in /etc/kubernetes/manifests

  • Kubelet watches this directory

  • Each time a file is added/removed there, kubelet creates/deletes the corresponding pod

  • Updating a file causes the pod to be deleted and recreated

k8s/podsecuritypolicy.md

385/404

Updating the API server flags

  • Let's edit the manifest for the API server pod
  • Have a look at the static pods:

    ls -l /etc/kubernetes/manifests
  • Edit the one corresponding to the API server:

    sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml

k8s/podsecuritypolicy.md

386/404

Adding the PSP admission plugin

  • There should already be a line with --enable-admission-plugins=...

  • Let's add PodSecurityPolicy on that line

  • Locate the line with --enable-admission-plugins=

  • Add PodSecurityPolicy

    It should read: --enable-admission-plugins=NodeRestriction,PodSecurityPolicy

  • Save, quit

k8s/podsecuritypolicy.md

387/404

Waiting for the API server to restart

  • The kubelet detects that the file was modified

  • It kills the API server pod, and starts a new one

  • During that time, the API server is unavailable

  • Wait until the API server is available again

k8s/podsecuritypolicy.md

388/404

Check that the admission plugin is active

  • Normally, we can't create any Pod at this point
  • Try to create a Pod directly:
    kubectl run testpsp1 --image=nginx --restart=Never
  • Try to create a Deployment:

    kubectl create deployment testpsp2 --image=nginx
  • Look at existing resources:

    kubectl get all

We can get hints at what's happening by looking at the ReplicaSet and Events.

k8s/podsecuritypolicy.md

389/404

Introducing our Pod Security Policies

  • We will create two policies:

    • privileged (allows everything)

    • restricted (blocks some unsafe mechanisms)

  • For each policy, we also need an associated ClusterRole granting use

k8s/podsecuritypolicy.md

390/404

Creating our Pod Security Policies

  • We have a couple of files, each defining a PSP and associated ClusterRole:

    • k8s/psp-privileged.yaml: policy privileged, role psp:privileged
    • k8s/psp-restricted.yaml: policy restricted, role psp:restricted
  • Create both policies and their associated ClusterRoles:
    kubectl create -f ~/container.training/k8s/psp-restricted.yaml
    kubectl create -f ~/container.training/k8s/psp-privileged.yaml

k8s/podsecuritypolicy.md

391/404

Check that we can create Pods again

  • We haven't bound the policy to any user yet

  • But cluster-admin can implicitly use all policies

  • Check that we can now create a Pod directly:

    kubectl run testpsp3 --image=nginx --restart=Never
  • Create a Deployment as well:

    kubectl create deployment testpsp4 --image=nginx
  • Confirm that the Deployment is not creating any Pods:

    kubectl get all

k8s/podsecuritypolicy.md

392/404

What's going on?

  • We can create Pods directly (thanks to our root-like permissions)

  • The Pods corresponding to a Deployment are created by the ReplicaSet controller

  • The ReplicaSet controller does not have root-like permissions

  • We need to either:

    • grant permissions to the ReplicaSet controller

    or

    • grant permissions to our Pods' ServiceAccount
  • The first option would allow anyone to create pods

  • The second option will allow us to scope the permissions better

k8s/podsecuritypolicy.md

393/404

Binding the restricted policy

  • Let's bind the role psp:restricted to ServiceAccount green:default

    (aka the default ServiceAccount in the green Namespace)

  • This will allow Pod creation in the green Namespace

    (because these Pods will be using that ServiceAccount automatically)

  • Create the following RoleBinding:
    kubectl create rolebinding psp:restricted \
    --clusterrole=psp:restricted \
    --serviceaccount=green:default

k8s/podsecuritypolicy.md

394/404

Trying it out

  • The Deployments that we created earlier will eventually recover

    (the ReplicaSet controller will retry to create Pods once in a while)

  • If we create a new Deployment now, it should work immediately

  • Create a simple Deployment:

    kubectl create deployment testpsp5 --image=nginx
  • Look at the Pods that have been created:

    kubectl get all

k8s/podsecuritypolicy.md

395/404

Trying to hack the cluster

  • Let's create the same DaemonSet we used earlier
  • Create a hostile DaemonSet:

    kubectl create -f ~/container.training/k8s/hacktheplanet.yaml
  • Look at the state of the namespace:

    kubectl get all

k8s/podsecuritypolicy.md

396/404

What's in our restricted policy?

  • The restricted PSP is similar to the one provided in the docs, but:

    • it allows containers to run as root

    • it doesn't drop capabilities

  • Many containers run as root by default, and would require additional tweaks

  • Many containers use e.g. chown, which requires a specific capability

    (that's the case for the NGINX official image, for instance)

  • We still block: hostPath, privileged containers, and much more!

k8s/podsecuritypolicy.md

397/404

The case of static pods

  • If we list the pods in the kube-system namespace, kube-apiserver is missing

  • However, the API server is obviously running

    (otherwise, kubectl get pods --namespace=kube-system wouldn't work)

  • The API server Pod is created directly by kubelet

    (without going through the PSP admission plugin)

  • Then, kubelet creates a "mirror pod" representing that Pod in etcd

  • That "mirror pod" creation goes through the PSP admission plugin

  • And it gets blocked!

  • This can be fixed by binding psp:privileged to group system:nodes

k8s/podsecuritypolicy.md

398/404

Before moving on...

  • Our cluster is currently broken

    (we can't create pods in namespaces kube-system, default, ...)

  • We need to either:

    • disable the PSP admission plugin

    • allow use of PSP to relevant users and groups

  • For instance, we could:

    • bind psp:restricted to the group system:authenticated

    • bind psp:privileged to the ServiceAccount kube-system:default

k8s/podsecuritypolicy.md

399/404

Fixing the cluster

  • Let's disable the PSP admission plugin
  • Edit the Kubernetes API server static pod manifest

  • Remove the PSP admission plugin

  • This can be done with this one-liner:

    sudo sed -i s/,PodSecurityPolicy// /etc/kubernetes/manifests/kube-apiserver.yaml
400/404

:EN:- Preventing privilege escalation with Pod Security Policies :FR:- Limiter les droits des conteneurs avec les Pod Security Policies

k8s/podsecuritypolicy.md

That's all, folks!
Questions?

end

shared/thankyou.md

401/404

Image separating from the next module

402/404

(Extra content)

(automatically generated title slide)

403/404

(Extra content)

  • k8s/apiserver-deepdive.md
  • k8s/setup-overview.md
  • k8s/setup-devel.md
  • k8s/setup-managed.md
  • k8s/setup-selfhosted.md

5.yml

404/404

Intros

  • Hello! I'm Jérôme (@jpetazzo, Enix SAS)

  • The training will run from 9:30 to 13:00

  • There will be a break at (approximately) 11:00

  • Feel free to interrupt for questions at any time

  • Especially when you see full screen container pictures!

logistics.md

2/404
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow