cybercyber - Using rsync to copy files to and from a kubernetes pod

Using rsync to copy files to and from a kubernetes pod

So you are running some application in a pod on kubernetes. The data is stored in a volume mounted into the pod. So far so easy.

Now you want to do backups; if your storage solution (and application!) support ReadWriteMany-volumes, it’s easy: You can start a Job that also mounts the volume and copy the data. But what if you are using ReadWriteOnce (“RWX” for “eXclusive”) volumes?

I have tried two ways of getting to files in a Pod from a job:

running kubectl cp in the job
running kubectl exec ... -- tar -C /volume -cp . in the job

Since copying a a large number of files (or fewer files from a slow storage solution) takes a long time, both ways run into time out issues. Both ways cannot pick up where they stopped and need to copy all files from the beginning.

That sounds like a problem I have heard of before

Yes! rsync was designed to solve this exact problem: Copying files with the ability to restart and to pick up changes over time. There is a reason multiple solutions exist for backing up using rsync…

First, we will need to get a rsync “server” into our pod; I have created this Dockerfile that just contains the rsync binary. So we add this as a sidecar to our application Pod like so:

containers:
  ...
  - name: rsync
    image: toelke158/docker-rsync
    volumeMounts:
      - name: home
        mountPath: /data

To reach this rsync, we can use kubectl exec; we just have to tell rsync to use it instead of ssh or the native rsync protocol. I have developed¹ a flexible shell-script that does exactly that. It boils down to doing kubectl exec -i -- "$@" (where $@ means “all arguments given to the script”). rsync will call the script like <script> rsync server ... — and kubectl will thus start exactly that command in the pod.

If you read the script, you will see that it supports specifying the rsync “server” in multiple ways:

pod if the pod is in the default namespace and the first container is the rsync container.
pod.container to specify the container to access.
kind#pod or kind#pod.container to specify e.g. to connect to a Deployment — this is useful when the name of the Pod is not stable
pod@namespace, kind#pod@namespace or kind#pod.container@namespace if you need to specify the namespace

In the README of this github-repository I show a complete example of getting the data and then using restic to actually store the backup.

And with “developed” I mean of course, blatantly copied from StackOverflow and adapted… ↩︎