Any test that fails occasionally is "flaky". Since our merges only proceed when all tests are green, and we have a number of different CI systems running the tests in various combinations, even a small percentage of flakes results in a lot of pain for people waiting for their PRs to merge.
Therefore, it's very important that we write tests defensively. Situations that "almost never happen" happen with some regularity when run thousands of times in resource-constrained environments. Since flakes can often be quite hard to reproduce while still being common enough to block merges occasionally, it's additionally important that the test logs be useful for narrowing down exactly what caused the failure.
Note that flakes can occur in unit tests, integration tests, or end-to-end tests, but probably occur most commonly in end-to-end tests.
Because flakes may be rare, it's very important that all relevant logs be discoverable from the issue.
Note that we won't randomly assign these issues to you unless you've opted in or you're part of a group that has opted in. We are more than happy to accept help from anyone in fixing these, but due to the severity of the problem when merges are blocked, we need reasonably quick turn-around time on test flakes. Therefore we have the following guidelines:
Try the stress command.
Just
$ go install golang.org/x/tools/cmd/stress
Then build your test binary
$ go test -c -race
Then run it under stress
$ stress ./package.test -test.run=FlakyTest
It runs the command and writes output to /tmp/gostress-*
files when it fails.
It periodically reports with run counts. Be careful with tests that use the
net/http/httptest
package; they could exhaust the available ports on your
system!
Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
We have a goal of 99.9% flake free tests. This means that there is only one flake in one thousand runs of a test.
Running a test 1000 times on your own machine can be tedious and time consuming. Fortunately, there is a better way to achieve this using Kubernetes.
Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better
There is a testing image brendanburns/flake
up on the docker hub. We will use
this image to test our fix.
Create a replication controller with the following config:
apiVersion: v1
kind: ReplicationController
metadata:
name: flakecontroller
spec:
replicas: 24
template:
metadata:
labels:
name: flake
spec:
containers:
- name: flake
image: brendanburns/flake
env:
- name: TEST_PACKAGE
value: pkg/tools
- name: REPO_SPEC
value: https://github.com/kubernetes/kubernetes
Note that we omit the labels and the selector fields of the replication controller, because they will be populated from the labels field of the pod template by default.
kubectl create -f ./controller.yaml
This will spin up 24 instances of the test. They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test.
You can examine the recent runs of the test by calling docker ps -a
and
looking for tasks that exited with non-zero exit codes. Unfortunately, docker
ps -a only keeps around the exit status of the last 15-20 containers with the
same image, so you have to check them frequently.
You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes:
echo "" > output.txt
for i in {1..4}; do
echo "Checking kubernetes-node-${i}"
echo "kubernetes-node-${i}:" >> output.txt
gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
done
grep "Exited ([^0])" output.txt
Eventually you will have sufficient runs for your purposes. At that point you can delete the replication controller by running:
kubectl delete replicationcontroller flakecontroller
If you do a final check for flakes with docker ps -a
, ignore tasks that
exited -1, since that's what happens when you stop the replication controller.
Happy flake hunting!