Containers 101

A container is a Linux process with namespaces and cgroups, not a tiny VM. Images, registries, OCI, layers, union filesystems, rootless, the security boundary, image digests, and a minimal Dockerfile.

If you only remember six things

A container is a process the kernel lies to about what it can see and use. That is namespaces (what it sees) plus cgroups (how much it gets).
Image ≠ container. Image is the filesystem + metadata on disk; container is a running (or stopped) instance of it.
latest is a lie. Pin by digest (@sha256:…) in anything you care about.
Containers share the host kernel. They are a boundary, not a wall. If you need a wall, use a VM.
Rootless is the default you should be starting from in 2026, not an advanced setting.
Every RUN in a Dockerfile is a layer. Order them by churn rate: slow-changing things first, source code last.

On this page

What a container actually is
Image vs container vs registry
OCI: the standard that makes it all interchangeable
Layers, union filesystems, and caching
Rootless, and why it matters
The security boundary (honestly)
Digests vs tags: latest is a lie
A minimal Dockerfile and the rules it obeys
Multi-stage builds
Docker and Podman side-by-side

What a container actually is

A container is not a virtual machine. It is a regular Linux process that the kernel has been asked to isolate using two features:

Namespaces — what the process can see. Each namespace type (pid, mnt, net, uts, ipc, user, cgroup, time) gives the process its own view of that slice of the system. A process in its own PID namespace sees itself as PID 1 and cannot see the host's other processes.
cgroups (v2) — what the process can use. CPU shares, memory limits, IO bandwidth, number of processes, device allow-lists.

That is it. There is no hypervisor. There is no "container daemon" in the kernel. When you run docker run alpine sh, you get a process that is still running on your host's kernel — it just thinks it's alone on a very small machine.

You can prove this to yourself without a container runtime:

# On any Linux host with util-linux:
sudo unshare --pid --fork --mount-proc --uts --net --ipc /bin/bash
# You are now in a new PID, mount, UTS, net, and IPC namespace.
# ps -ef      -> shows only bash
# hostname    -> changeable without affecting host
# ip addr     -> shows only lo (no interfaces until you build one)

That shell is, in every sense that matters, a container. A container runtime like Podman or Docker wraps this with an image, a working directory, cgroup limits, a writeable layer, optional networking, and a lifecycle API — but the underlying kernel mechanism is the same six lines above.

Contrast with VMs. A VM has its own kernel and its own hardware abstractions (virtual CPU, NIC, disk). Containers share the host kernel. That is what makes them fast to start and cheap to run, and it is also the source of every meaningful security caveat below.

Image vs container vs registry

The three things people confuse the most:

Thing	What it is	Lives on
Image	An immutable, content-addressed filesystem + JSON metadata (entrypoint, env, exposed ports).	Disk (local cache) or a registry.
Container	A running (or stopped) instance of an image, plus a writeable top layer and runtime config.	The host where it was created.
Registry	A content-addressed store that serves images over HTTPS using the OCI distribution spec.	A URL. Docker Hub, GHCR, ECR, a self-hosted Harbor, a local `registry:2`.

Think of it the way you'd think of executables: the image is the binary on disk, the container is the running process, and the registry is the package mirror that shipped you the binary.

OCI: the standard that makes it all interchangeable

The Open Container Initiative is the spec that makes "containers" a portable concept. Three documents matter:

Image spec — how an image's layers, config, and manifest are laid out on disk.
Runtime spec — how a runtime (runc, crun, youki) turns an image into a running process.
Distribution spec — how registries serve images over HTTP.

If a tool says "OCI-compatible", it means: any registry can serve its images, any runtime can run them, and any build tool can produce them. This is why Podman can pull a Docker-built image from Docker Hub and run it under crun. They are not Docker images; they are OCI images that Docker also happens to produce.

Layers, union filesystems, and caching

An image is a stack of tarballs called layers. Each layer adds or removes files relative to the layer below it. At runtime, a union filesystem (usually overlayfs) merges them into a single view and adds one writeable layer on top — that layer is the container's ephemeral state.

┌──────────────────────────┐  writeable layer (container)
├──────────────────────────┤  COPY app.jar       (image layer, your code)
├──────────────────────────┤  RUN apk add curl   (image layer)
├──────────────────────────┤  FROM alpine:3.20   (image layer, base OS)
└──────────────────────────┘

Layers are content-addressed by the SHA-256 of their tarball, which means:

Identical layers across two images are stored once on disk and pulled once over the network.
A layer cache hit in a build is deterministic: if the RUN step's input files and command haven't changed, the builder reuses the existing layer.
Reordering your Dockerfile to put slow-changing things first (installed packages) and fast-changing things last (source code) is not a premature optimisation — it is the whole point of the layer system.

Rootless, and why it matters

"Rootless" means the container runtime itself, and the container processes, run as an unprivileged user on the host. It uses user namespaces to give the container an ID range that looks like root inside but is a regular user outside:

# Inside the container:
$ id
uid=0(root) gid=0(root)

# On the host, the same process:
$ ps -eo user,pid,cmd | grep myapp
alice   31234   /usr/bin/myapp    # not root on the host

The mapping is configured in /etc/subuid and /etc/subgid. A user gets an allocated range of sub-UIDs (typically 65,536 per user); container UID 0 maps to, say, host UID 100000.

Why you want this. If a container process escapes its namespaces, it lands as an unprivileged host user, not as host root. That is a meaningful defence in depth and it costs you almost nothing. Podman is rootless by default. Docker has a rootless mode; use it unless you have a specific reason not to.

The security boundary (honestly)

A shared kernel is a shared attack surface. A kernel bug that lets an in-container process break isolation affects every container on the host. That is not hypothetical — it happens — and the mitigations are layered:

Run rootless. An escape from an unprivileged container means the attacker is an unprivileged host user, not host root.
Drop capabilities. The default set is much smaller than root-on-host, but it is still bigger than most apps need. --cap-drop=ALL then add back only what you need (most web apps need none).
Seccomp filters the syscalls a container can call. The default profile blocks the dangerous ones (keyctl, ptrace in most configs, etc.). Don't disable it.
SELinux / AppArmor gives you MAC on top of DAC. On Fedora/RHEL, SELinux labels containers automatically; don't put --privileged in the launch line to make an error go away.
Read-only root filesystem (--read-only) plus tmpfs for /tmp stops most persistence tricks.
Don't run as root inside the container either — add a USER to the Dockerfile.

Containers do not replace VMs for hostile multi-tenancy. If you are running untrusted customer code, use a VM boundary (Firecracker, Kata Containers, a real hypervisor). Containers are great for isolating your own software from itself.

Digests vs tags: `latest` is a lie

A tag (myimage:1.2.3) is a mutable pointer. The registry owner can repoint it tomorrow. A digest (myimage@sha256:abc123…) is the image's content hash — it cannot be changed without changing the digest.

# Get the current digest for a tag
docker buildx imagetools inspect nginx:1.27-alpine

# Or after pulling:
docker inspect --format '{{index .RepoDigests 0}}' nginx:1.27-alpine

# Pin by digest in production:
# FROM nginx@sha256:1234abcd...

latest is just a tag like any other. It is whatever the publisher last pushed with no tag. It changes silently. Never deploy it.
Semver tags (1.27-alpine) are better but still mutable — the publisher can force-push a new image to the same tag.
Digests are the only immutable reference. Use them in production Dockerfiles, Kubernetes manifests, and CI caches. Use a tool like Dependabot or Renovate to bump digests in PRs so you still get updates.

A minimal Dockerfile and the rules it obeys

# Pin a specific base by tag; pin by digest for prod.
FROM python:3.12-slim-bookworm

# Create a non-root user early so subsequent steps can chown to it.
RUN groupadd --system app && useradd --system --gid app --home /app app

WORKDIR /app

# Dependencies first — they change least often, maximising layer cache hits.
COPY --chown=app:app requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Source code last — it changes on every build.
COPY --chown=app:app src/ ./src/

USER app
EXPOSE 8080

# Exec form (no shell) is required for correct signal handling.
ENTRYPOINT ["python", "-m", "src.app"]
CMD ["--port", "8080"]

The rules this Dockerfile follows:

One concern per image. This is the app image; logging, metrics, and a reverse proxy are other containers.
Pinned base image. Consider digest-pinning in production.
Dependencies before source. A source-only change still reuses the pip-install layer.
--no-cache-dir on installers — the cache bloats the image for no runtime benefit.
A non-root USER.
Exec form (["python", "-m", "..."]) so PID 1 receives SIGTERM from the runtime. Shell form (python -m ...) does not.
No secrets in ENV. ARGs and ENVs end up in the image history; use build-time secrets (RUN --mount=type=secret) if you need them during build.

Multi-stage builds

Multi-stage builds let you use a heavy toolchain to produce a binary, then copy only the binary into a small runtime image. The intermediate layers never ship to production.

# --- build stage ---------------------------------------------------
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/app ./cmd/app

# --- runtime stage -------------------------------------------------
FROM gcr.io/distroless/static:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

The resulting image is the Go binary plus the distroless base (~2 MB). The Go toolchain, module cache, and source tree never leave the build stage. This is the pattern for Go, Rust, Java (build on JDK, run on JRE), C/C++, and most compiled languages.

Target a stage. docker build --target build . stops at the build stage — useful for running tests against the build image in CI without rebuilding.

Docker and Podman side-by-side

Podman is a near drop-in replacement for the Docker CLI. Most commands are identical; the defaults differ (Podman is rootless and daemonless).

Task	Docker	Podman
Run an image interactively	`docker run --rm -it alpine sh`	`podman run --rm -it alpine sh`
Build from a Dockerfile	`docker build -t myapp .`	`podman build -t myapp .`
List running containers	`docker ps`	`podman ps`
List images	`docker images`	`podman images`
Pull by digest	`docker pull nginx@sha256:…`	`podman pull nginx@sha256:…`
Log in to a registry	`docker login ghcr.io`	`podman login ghcr.io`
Exec into a container	`docker exec -it web sh`	`podman exec -it web sh`
Stop and remove	`docker rm -f web`	`podman rm -f web`
Inspect JSON	`docker inspect web`	`podman inspect web`
View a container's resources	`docker stats`	`podman stats`
Map a port	`-p 8080:80`	`-p 8080:80` (rootless can't bind <1024 without capabilities)
Generate systemd units	(third party)	`podman generate systemd`, or Quadlet
Auto-update	(Watchtower, third party)	`podman auto-update` + label

The two big behavioural differences:

Daemon vs. fork/exec. Docker talks to a long-running daemon; Podman forks a runtime directly. That means you can strace a Podman container as the user that launched it, and there is no root daemon with socket access for every user in the docker group to inherit.
Pods. Podman has a first-class pod concept (a shared network/IPC/UTS namespace for a set of containers) that Docker does not. It's the same abstraction Kubernetes pods use. See Podman Basics.

Next up: Podman basics for the daily workflow, Docker Compose for multi-container local dev, and Kubernetes Light when one host isn't enough.

Containers 101

What a container actually is

Image vs container vs registry

OCI: the standard that makes it all interchangeable

Layers, union filesystems, and caching

Rootless, and why it matters

The security boundary (honestly)

Digests vs tags: latest is a lie

A minimal Dockerfile and the rules it obeys

Multi-stage builds

Docker and Podman side-by-side

Digests vs tags: `latest` is a lie