Containers 101
- A container is a process the kernel lies to about what it can see and use. That is namespaces (what it sees) plus cgroups (how much it gets).
- Image ≠ container. Image is the filesystem + metadata on disk; container is a running (or stopped) instance of it.
latestis a lie. Pin by digest (@sha256:…) in anything you care about.- Containers share the host kernel. They are a boundary, not a wall. If you need a wall, use a VM.
- Rootless is the default you should be starting from in 2026, not an advanced setting.
- Every
RUNin a Dockerfile is a layer. Order them by churn rate: slow-changing things first, source code last.
- What a container actually is
- Image vs container vs registry
- OCI: the standard that makes it all interchangeable
- Layers, union filesystems, and caching
- Rootless, and why it matters
- The security boundary (honestly)
- Digests vs tags: latest is a lie
- A minimal Dockerfile and the rules it obeys
- Multi-stage builds
- Docker and Podman side-by-side
What a container actually is
A container is not a virtual machine. It is a regular Linux process that the kernel has been asked to isolate using two features:
- Namespaces — what the process can see. Each namespace type (
pid,mnt,net,uts,ipc,user,cgroup,time) gives the process its own view of that slice of the system. A process in its own PID namespace sees itself as PID 1 and cannot see the host's other processes. - cgroups (v2) — what the process can use. CPU shares, memory limits, IO bandwidth, number of processes, device allow-lists.
That is it. There is no hypervisor. There is no "container daemon" in the kernel. When you run docker run alpine sh, you get a process that is still running on your host's kernel — it just thinks it's alone on a very small machine.
You can prove this to yourself without a container runtime:
# On any Linux host with util-linux:
sudo unshare --pid --fork --mount-proc --uts --net --ipc /bin/bash
# You are now in a new PID, mount, UTS, net, and IPC namespace.
# ps -ef -> shows only bash
# hostname -> changeable without affecting host
# ip addr -> shows only lo (no interfaces until you build one)
That shell is, in every sense that matters, a container. A container runtime like Podman or Docker wraps this with an image, a working directory, cgroup limits, a writeable layer, optional networking, and a lifecycle API — but the underlying kernel mechanism is the same six lines above.
Image vs container vs registry
The three things people confuse the most:
| Thing | What it is | Lives on |
|---|---|---|
| Image | An immutable, content-addressed filesystem + JSON metadata (entrypoint, env, exposed ports). | Disk (local cache) or a registry. |
| Container | A running (or stopped) instance of an image, plus a writeable top layer and runtime config. | The host where it was created. |
| Registry | A content-addressed store that serves images over HTTPS using the OCI distribution spec. | A URL. Docker Hub, GHCR, ECR, a self-hosted Harbor, a local registry:2. |
Think of it the way you'd think of executables: the image is the binary on disk, the container is the running process, and the registry is the package mirror that shipped you the binary.
OCI: the standard that makes it all interchangeable
The Open Container Initiative is the spec that makes "containers" a portable concept. Three documents matter:
- Image spec — how an image's layers, config, and manifest are laid out on disk.
- Runtime spec — how a runtime (
runc,crun,youki) turns an image into a running process. - Distribution spec — how registries serve images over HTTP.
If a tool says "OCI-compatible", it means: any registry can serve its images, any runtime can run them, and any build tool can produce them. This is why Podman can pull a Docker-built image from Docker Hub and run it under crun. They are not Docker images; they are OCI images that Docker also happens to produce.
Layers, union filesystems, and caching
An image is a stack of tarballs called layers. Each layer adds or removes files relative to the layer below it. At runtime, a union filesystem (usually overlayfs) merges them into a single view and adds one writeable layer on top — that layer is the container's ephemeral state.
┌──────────────────────────┐ writeable layer (container)
├──────────────────────────┤ COPY app.jar (image layer, your code)
├──────────────────────────┤ RUN apk add curl (image layer)
├──────────────────────────┤ FROM alpine:3.20 (image layer, base OS)
└──────────────────────────┘
Layers are content-addressed by the SHA-256 of their tarball, which means:
- Identical layers across two images are stored once on disk and pulled once over the network.
- A layer cache hit in a build is deterministic: if the
RUNstep's input files and command haven't changed, the builder reuses the existing layer. - Reordering your Dockerfile to put slow-changing things first (installed packages) and fast-changing things last (source code) is not a premature optimisation — it is the whole point of the layer system.
Rootless, and why it matters
"Rootless" means the container runtime itself, and the container processes, run as an unprivileged user on the host. It uses user namespaces to give the container an ID range that looks like root inside but is a regular user outside:
# Inside the container:
$ id
uid=0(root) gid=0(root)
# On the host, the same process:
$ ps -eo user,pid,cmd | grep myapp
alice 31234 /usr/bin/myapp # not root on the host
The mapping is configured in /etc/subuid and /etc/subgid. A user gets an allocated range of sub-UIDs (typically 65,536 per user); container UID 0 maps to, say, host UID 100000.
The security boundary (honestly)
A shared kernel is a shared attack surface. A kernel bug that lets an in-container process break isolation affects every container on the host. That is not hypothetical — it happens — and the mitigations are layered:
- Run rootless. An escape from an unprivileged container means the attacker is an unprivileged host user, not host root.
- Drop capabilities. The default set is much smaller than root-on-host, but it is still bigger than most apps need.
--cap-drop=ALLthen add back only what you need (most web apps need none). - Seccomp filters the syscalls a container can call. The default profile blocks the dangerous ones (
keyctl,ptracein most configs, etc.). Don't disable it. - SELinux / AppArmor gives you MAC on top of DAC. On Fedora/RHEL, SELinux labels containers automatically; don't put
--privilegedin the launch line to make an error go away. - Read-only root filesystem (
--read-only) plustmpfsfor/tmpstops most persistence tricks. - Don't run as root inside the container either — add a
USERto the Dockerfile.
Digests vs tags: latest is a lie
A tag (myimage:1.2.3) is a mutable pointer. The registry owner can repoint it tomorrow. A digest (myimage@sha256:abc123…) is the image's content hash — it cannot be changed without changing the digest.
# Get the current digest for a tag
docker buildx imagetools inspect nginx:1.27-alpine
# Or after pulling:
docker inspect --format '{{index .RepoDigests 0}}' nginx:1.27-alpine
# Pin by digest in production:
# FROM nginx@sha256:1234abcd...
latestis just a tag like any other. It is whatever the publisher last pushed with no tag. It changes silently. Never deploy it.- Semver tags (
1.27-alpine) are better but still mutable — the publisher can force-push a new image to the same tag. - Digests are the only immutable reference. Use them in production Dockerfiles, Kubernetes manifests, and CI caches. Use a tool like Dependabot or Renovate to bump digests in PRs so you still get updates.
A minimal Dockerfile and the rules it obeys
# Pin a specific base by tag; pin by digest for prod.
FROM python:3.12-slim-bookworm
# Create a non-root user early so subsequent steps can chown to it.
RUN groupadd --system app && useradd --system --gid app --home /app app
WORKDIR /app
# Dependencies first — they change least often, maximising layer cache hits.
COPY --chown=app:app requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Source code last — it changes on every build.
COPY --chown=app:app src/ ./src/
USER app
EXPOSE 8080
# Exec form (no shell) is required for correct signal handling.
ENTRYPOINT ["python", "-m", "src.app"]
CMD ["--port", "8080"]
The rules this Dockerfile follows:
- One concern per image. This is the app image; logging, metrics, and a reverse proxy are other containers.
- Pinned base image. Consider digest-pinning in production.
- Dependencies before source. A source-only change still reuses the pip-install layer.
--no-cache-diron installers — the cache bloats the image for no runtime benefit.- A non-root
USER. - Exec form (
["python", "-m", "..."]) so PID 1 receivesSIGTERMfrom the runtime. Shell form (python -m ...) does not. - No secrets in
ENV. ARGs and ENVs end up in the image history; use build-time secrets (RUN --mount=type=secret) if you need them during build.
Multi-stage builds
Multi-stage builds let you use a heavy toolchain to produce a binary, then copy only the binary into a small runtime image. The intermediate layers never ship to production.
# --- build stage ---------------------------------------------------
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/app ./cmd/app
# --- runtime stage -------------------------------------------------
FROM gcr.io/distroless/static:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
The resulting image is the Go binary plus the distroless base (~2 MB). The Go toolchain, module cache, and source tree never leave the build stage. This is the pattern for Go, Rust, Java (build on JDK, run on JRE), C/C++, and most compiled languages.
docker build --target build . stops at the build stage — useful for running tests against the build image in CI without rebuilding.
Docker and Podman side-by-side
Podman is a near drop-in replacement for the Docker CLI. Most commands are identical; the defaults differ (Podman is rootless and daemonless).
| Task | Docker | Podman |
|---|---|---|
| Run an image interactively | docker run --rm -it alpine sh | podman run --rm -it alpine sh |
| Build from a Dockerfile | docker build -t myapp . | podman build -t myapp . |
| List running containers | docker ps | podman ps |
| List images | docker images | podman images |
| Pull by digest | docker pull nginx@sha256:… | podman pull nginx@sha256:… |
| Log in to a registry | docker login ghcr.io | podman login ghcr.io |
| Exec into a container | docker exec -it web sh | podman exec -it web sh |
| Stop and remove | docker rm -f web | podman rm -f web |
| Inspect JSON | docker inspect web | podman inspect web |
| View a container's resources | docker stats | podman stats |
| Map a port | -p 8080:80 | -p 8080:80 (rootless can't bind <1024 without capabilities) |
| Generate systemd units | (third party) | podman generate systemd, or Quadlet |
| Auto-update | (Watchtower, third party) | podman auto-update + label |
The two big behavioural differences:
- Daemon vs. fork/exec. Docker talks to a long-running daemon; Podman forks a runtime directly. That means you can
stracea Podman container as the user that launched it, and there is no root daemon with socket access for every user in thedockergroup to inherit. - Pods. Podman has a first-class
podconcept (a shared network/IPC/UTS namespace for a set of containers) that Docker does not. It's the same abstraction Kubernetes pods use. See Podman Basics.
Next up: Podman basics for the daily workflow, Docker Compose for multi-container local dev, and Kubernetes Light when one host isn't enough.