Entering in Podman containers

Published: Dec 3 2022

Updated: Dec 3 2022

Some commands for interacting with the namespaces of Podman containers.

Table of content

Table of content
Features
- Builtin podman features
- Using nsenter with podman containers
Applications
Security considerations
- Attack 1: attacking a hybrid process executing programs from the container filesystem
- Attack 2: debugging the hybrid process from the container
  - Case 1: container process and hybrid process running as root in the container
  - Case 2: container process and hybrid process running as non-root in the container
References

Features

Builtin podman features

Let us create a podman container:

podman run -it --rm \
  --name foo docker.io/library/python \
  python -m http.server --bind 127.0.0.1 --directory /etc 8000

We can execute some command in the container with:

podman exec foo ls /
podman exec -t foo python3 # Get a Python shell

We can mount the container filesystem using podman mount. As a regular (non root) user, we cannot mount in the host filesystem. However, we can create a new user namespace and a mount namepace: this makes us root in this new user namespace and are allowed to mount in the associated (new) mount namespace. We can achieve with using the podman unshare command:

podman unshare sh -c 'cd "$(podman mount foo)" ; pwd ; ls'

Note

This mount is not visible from the host filesystem/namespace. It is only visible in the process subtree below podman unshare.

Using nsenter with podman containers

We can use the nsenter command to launch some process attached to the namespaces of the container (either all of the namespaces of the container or a subset thereof):

# Get PID of a process in the container:
pid="$(podman inspect foo -f '{{.State.Pid}}')"

# Launch a process by attaching to these namespaces:
nsenter -t "$pid" -U -m -n -p -C -i /bin/bash

Warning: time-of-check to time-of-use vulnerability

There is a race condition introduced by the fact that we are accessing the namespace via PID: if the container main process dies, its PID might be recycled for a new process. This might happen by the time we are opening its namespace handles (by opening /proc/$pid/ns/{cgroup,ipc,mnt,net,pid,time,user,uts}). In this case, we might attach to the wrong namespaces.

As a result, this pattern may be vulnerable to a time-of-check to time-of-use (TOCTOU) vulnerability: if the container main process dies for some reason, a malicious user on the host system may try to exploit PID recycling in order to trick the nsenter process into joigning the wrong namespaces.

As long as nsenter is executing as non-root, it would not be able to join namespaces from another user. In this case, the only attack would be to attempt to make nsenter join another (more recent) container from the same user. This does not seem to be easily exploitable.

I suspect several container-related codebases must have some TOCTOU vulnerabilities related to PID recycling anyway. Podman seems to have some such vulnerabilities (eg. in podman run, podman mount). These vulnerabilities are probably quite theoretical however.

Using this nsenter command, we can choose which container namespaces we want to enter:

-U for the user namespace (this is required when running as nont root);
-m for the mount namespace;
-n for the network namespace;
-p for the PID namespace;
-C for the cgroup namespace;
-i for the IPC namespace.

For example, we can spawn a process which is using both:

the host filesystem namespace (so we can execute the program from the host filesystem);
and the container network namespace (so we can communicate to localhost-bound container sockets).

Applications

Exposing a container localhost-bound socket through a host Unix socket

This can for example be used to expose a localhost-bound TCP socket of the container through a path-based Unix socket of the host by executing socat from the host filesystem:

# Expose container localhost-bound as host Unix socket:
nsenter -t "$pid" -U -n socat UNIX-LISTEN:./proxy.sock,fork TCP:127.0.0.1:8000

# Test from host:
curl --unix-socket ./proxy.sock http://example/

Note: motivations

we cannot --publish a localhost-bound service;
podman does not let us publish a service through a Unix socket.

An alternative solution without nsenter would be:

podman run --rm \
  --network=container:foo \
  -v $(pwd):/mnt \
  docker.io/library/debian \
  sh -c 'apt update && apt install -y socat && socat UNIX-LISTEN:/mnt/proxy.sock,fork TCP:127.0.0.18000'

Alternatively, this hack could be used in order to execute programs of the host filesystem:

podman run --rm -it \
  --network=container:foo \
  -v /:/mnt \
  docker.io/library/alpine \
  chroot /mnt socat "UNIX-LISTEN:$(pwd)/proxy.sock,fork" TCP:127.0.0.1:8000

Exposing a container Unix socket through a host Unix socket

Forwarding a container Unix socket through a host Unix socket can be achieved using podman mount in order to acces both filesystems at the same time:

podman unshare sh -c '
  dir="$(podman mount foo)"  
  # Workaround for Unix socket address max length:
  mount --bind "$dir" /mnt/
  socat UNIX-LISTEN:/run/user/1000/test.sock,fork UNIX:/mnt/run/test.sock
'

Exposing a host Unix socket through a container Unix socket

We can do it the other way around:

podman unshare sh -c '
  dir="$(podman mount foo)"
  # Workaround for Unix socket address max length:
  mount --bind "$dir" /mnt/
  socat UNIX-LISTEN:/mnt/run/test.sock,fork UNIX:/run/user/1000/test.sock
'

Forwarding Unix socket between two containers

We can forward Unix socket communications between two containers:

podman unshare sh -c '
  foo="$(podman mount foo)"
  bar="$(podman mount bar)"
  mount -t tmpfs -o size=1M tmpfs /mnt/
  mkdir /mnt/foo
  mkdir /mnt/bar
  mount --bind "$foo/run" /mnt/foo
  mount --bind "$bar/run" /mnt/bar
  socat UNIX-LISTEN:/mnt/bar/test.sock,fork UNIX:/mnt/foo/test.sock
'

Note

Directly bind-mounting the socket from one container to the other (mount --bind "$foo/run/test.sock" "$bar/run/test.sock") does not appear to work. The process in one container is not allowed to connect to the socket from the other container for some reason.

Security considerations

Question: is it safe to use such an “hybrid” process (a process which is attached to some namespaces of the container and some namespaces of the host)? Could this be used for privilege escalation? Could a compromised process in the container try to exploit the hybrid process in order to get access resources in the host system?

Summary:

Joining the container PID namespace makes your process visible inside the container.
- It can be killed by a container process (running as the same user as seen from outside of the container).
- It could possibly be debugged by a container process which could then try to access other resources outside of the container through the hybrid process. This currently does not work on my case (see below for details).
If the hybrid process executes code from the container filesystem, a container process with the right access could modify this code in order to hijack the hybrid process.

Attack 1: attacking a hybrid process executing programs from the container filesystem

If the hybrid process is running code from the container filesystem, a compromised process in the container could hijack the hybrid process by modifiying this code on the container filesystem.

Attack 2: debugging the hybrid process from the container

Could a compromised process in the container try to debug (eg. using gdb, ptrace, /prov/$pid/mem, process_vm_readv/process_vm_writev, pidfd_getfd, etc.) the hybrid process?

This assumes that the hybrid process has joined the container PID namespace (-p) in order to be able to be seen from the container processes.

Case 1: container process and hybrid process running as root in the container

If the container process is running as root in the container, the container process could manage to debug the hybrid process. In pratice, my attemps failed with “ptrace: Operation not permitted.”.

The reason appears to be the following condition from ptrace documentation:

Deny access if neither of the following is true:

The caller and the target process are in the same user namespace, and the caller's capabilities are a superset of the target process's permitted capabilities.

The caller has the CAP_SYS_PTRACE capability in the target process's user namespace.

In my case, the capabilites (can be seen using cat /proc/$pid/status | grep -i ^Cap) were:

0x00000000800405fb in the container;
0x000001ffffffffff for the hybrid process (either from nsenter or podman unshare).

The first condition does not hold because the caller's (container) capabilities are more restricted than the target (hybrid process) ones. The second condition does not hold either because CAP_SYS_PTRACE is not set in the container. However, if Podman were to include CAP_SYS_PTRACE in the container capabilities it would probably be possible for the container process to debug the hybrid process from a container process.

Case 2: container process and hybrid process running as non-root in the container

If the processes are running as the same (non-root) user in the container, the capability set is usually empty for both processes and the container process is able to debug the hybrid process.

References

Podman
bubblewrap, unprivileged sandboxing tool
unshare(1) — Linux manual page
nsenter(1) — Linux manual page
namespaces(7) — Linux manual page
lsns(8) — Linux manual page