Entering in Podman containers
Published:
Updated:
Some commands for interacting with the namespaces of Podman containers.
Table of content
Features
Builtin podman features
Let us create a podman container:
podman run -it --rm \
--name foo docker.io/library/python \
python -m http.server --bind 127.0.0.1 --directory /etc 8000
We can execute some command in the container with:
podman exec foo ls /
podman exec -t foo python3 # Get a Python shell
We can mount the container filesystem using podman mount
.
As a regular (non root) user, we cannot mount in the host filesystem.
However, we can create a new user namespace
and a mount namepace:
this makes us root in this new user namespace
and are allowed to mount in the associated (new) mount namespace.
We can achieve with using the podman unshare
command:
podman unshare sh -c 'cd "$(podman mount foo)" ; pwd ; ls'
Note
This mount is not visible from the host filesystem/namespace.
It is only visible in the process subtree below podman unshare
.
Using nsenter with podman containers
We can use the nsenter
command
to launch some process attached to the namespaces
of the container
(either all of the namespaces of the container or a subset thereof):
# Get PID of a process in the container:
pid="$(podman inspect foo -f '{{.State.Pid}}')"
# Launch a process by attaching to these namespaces:
nsenter -t "$pid" -U -m -n -p -C -i /bin/bash
Warning: time-of-check to time-of-use vulnerability
There is a race condition introduced by the fact that we are
accessing the namespace via PID:
if the container main process dies, its PID might be recycled
for a new process.
This might happen by the time we are opening its namespace handles
(by opening /proc/$pid/ns/{cgroup,ipc,mnt,net,pid,time,user,uts}
).
In this case, we might attach to the wrong namespaces.
As a result, this pattern may be vulnerable
to a time-of-check to time-of-use (TOCTOU) vulnerability:
if the container main process dies for some reason,
a malicious user on the host system may try
to exploit PID recycling in order to trick the nsenter
process
into joigning the wrong namespaces.
As long as nsenter
is executing as non-root,
it would not be able to join namespaces from another user.
In this case, the only attack would be to attempt to make nsenter
join another
(more recent) container from the same user.
This does not seem to be easily exploitable.
I suspect several container-related codebases
must have some TOCTOU vulnerabilities related to PID recycling anyway.
Podman seems to have some such vulnerabilities
(eg. in podman run
, podman mount
).
These vulnerabilities are probably quite theoretical however.
Using this nsenter
command, we can choose which container namespaces we want to enter:
-U
for the user namespace (this is required when running as nont root);-m
for the mount namespace;-n
for the network namespace;-p
for the PID namespace;-C
for the cgroup namespace;-i
for the IPC namespace.
For example, we can spawn a process which is using both:
- the host filesystem namespace (so we can execute the program from the host filesystem);
- and the container network namespace (so we can communicate to localhost-bound container sockets).
Applications
Exposing a container localhost-bound socket through a host Unix socket
This can for example be used to expose a localhost-bound TCP socket of the container
through a path-based Unix socket of the host by executing socat
from the host filesystem:
# Expose container localhost-bound as host Unix socket:
nsenter -t "$pid" -U -n socat UNIX-LISTEN:./proxy.sock,fork TCP:127.0.0.1:8000
# Test from host:
curl --unix-socket ./proxy.sock http://example/
Note: motivations
- we cannot
--publish
a localhost-bound service; podman
does not let us publish a service through a Unix socket.
An alternative solution without nsenter
would be:
podman run --rm \
--network=container:foo \
-v $(pwd):/mnt \
docker.io/library/debian \
sh -c 'apt update && apt install -y socat && socat UNIX-LISTEN:/mnt/proxy.sock,fork TCP:127.0.0.18000'
Alternatively, this hack could be used in order to execute programs of the host filesystem:
podman run --rm -it \
--network=container:foo \
-v /:/mnt \
docker.io/library/alpine \
chroot /mnt socat "UNIX-LISTEN:$(pwd)/proxy.sock,fork" TCP:127.0.0.1:8000
Exposing a container Unix socket through a host Unix socket
Forwarding a container Unix socket through a host Unix socket can be achieved using
podman mount
in order to acces both filesystems at the same time:
podman unshare sh -c '
dir="$(podman mount foo)"
# Workaround for Unix socket address max length:
mount --bind "$dir" /mnt/
socat UNIX-LISTEN:/run/user/1000/test.sock,fork UNIX:/mnt/run/test.sock
'
Exposing a host Unix socket through a container Unix socket
We can do it the other way around:
podman unshare sh -c '
dir="$(podman mount foo)"
# Workaround for Unix socket address max length:
mount --bind "$dir" /mnt/
socat UNIX-LISTEN:/mnt/run/test.sock,fork UNIX:/run/user/1000/test.sock
'
Forwarding Unix socket between two containers
We can forward Unix socket communications between two containers:
podman unshare sh -c '
foo="$(podman mount foo)"
bar="$(podman mount bar)"
mount -t tmpfs -o size=1M tmpfs /mnt/
mkdir /mnt/foo
mkdir /mnt/bar
mount --bind "$foo/run" /mnt/foo
mount --bind "$bar/run" /mnt/bar
socat UNIX-LISTEN:/mnt/bar/test.sock,fork UNIX:/mnt/foo/test.sock
'
Note
Directly bind-mounting the socket from one container to the other
(mount --bind "$foo/run/test.sock" "$bar/run/test.sock"
)
does not appear to work.
The process in one container is not allowed to connect to the socket
from the other container for some reason.
Security considerations
Question: is it safe to use such an “hybrid” process (a process which is attached to some namespaces of the container and some namespaces of the host)? Could this be used for privilege escalation? Could a compromised process in the container try to exploit the hybrid process in order to get access resources in the host system?
Summary:
- Joining the container PID namespace makes your process visible inside the container.
- It can be killed by a container process (running as the same user as seen from outside of the container).
- It could possibly be debugged by a container process which could then try to access other resources outside of the container through the hybrid process. This currently does not work on my case (see below for details).
- If the hybrid process executes code from the container filesystem, a container process with the right access could modify this code in order to hijack the hybrid process.
Attack 1: attacking a hybrid process executing programs from the container filesystem
If the hybrid process is running code from the container filesystem, a compromised process in the container could hijack the hybrid process by modifiying this code on the container filesystem.
Attack 2: debugging the hybrid process from the container
Could a compromised process in the container try to debug
(eg. using gdb
, ptrace
, /prov/$pid/mem
, process_vm_readv
/process_vm_writev
, pidfd_getfd
, etc.)
the hybrid process?
This assumes that the hybrid process has joined the container PID namespace (-p
)
in order to be able to be seen from the container processes.
Case 1: container process and hybrid process running as root in the container
If the container process is running as root in the container, the container process could manage to debug the hybrid process. In pratice, my attemps failed with “ptrace: Operation not permitted.”.
The reason appears to be the following condition from ptrace
documentation:
Deny access if neither of the following is true:
- The caller and the target process are in the same user namespace, and the caller's capabilities are a superset of the target process's permitted capabilities.
- The caller has the CAP_SYS_PTRACE capability in the target process's user namespace.
In my case, the capabilites (can be seen using cat /proc/$pid/status | grep -i ^Cap
) were:
- 0x00000000800405fb in the container;
- 0x000001ffffffffff for the hybrid process (either from
nsenter
orpodman unshare
).
The first condition does not hold because the caller's (container) capabilities
are more restricted than the target (hybrid process) ones.
The second condition does not hold either because CAP_SYS_PTRACE
is not set in the container.
However, if Podman were to include CAP_SYS_PTRACE
in the container capabilities
it would probably be possible for the container process to debug the hybrid process
from a container process.
Case 2: container process and hybrid process running as non-root in the container
If the processes are running as the same (non-root) user in the container, the capability set is usually empty for both processes and the container process is able to debug the hybrid process.
References
- Podman
- bubblewrap, unprivileged sandboxing tool
- unshare(1) — Linux manual page
- nsenter(1) — Linux manual page
- namespaces(7) — Linux manual page
- lsns(8) — Linux manual page