Mahmoud Zalt

Posted on Dec 29 • Originally published at zalt.me on Dec 13

Kubelet As A Pod Micro‑OS

#kubernetes #kubelet #cloudnative #devops

On a busy Kubernetes node, the kubelet isn’t just “another daemon.” It behaves like a tiny operating system dedicated to pods: it boots services, schedules work, tracks processes, kills them, frees resources, and keeps reporting health upstream. When we look closely at pkg/kubelet/kubelet.go, we’re really looking at this pod micro‑OS kernel in action.

We’ll dissect that kernel: how it boots, how the main control loop dispatches work, and how the pod lifecycle is implemented through the SyncPod, SyncTerminatingPod, and SyncTerminatedPod trio. I’m Mahmoud Zalt, an AI solutions architect, and we’ll use this file as a case study in designing a resilient, event‑driven “micro‑OS” around a complex runtime.

The core lesson is simple: treat your orchestrator like an operating system kernel. Separate boot phases, centralize dispatch, drive each object through a reentrant lifecycle state machine, and make cleanup safely repeatable. Everything that follows serves that idea.

Booting the pod micro‑OS
The main loop as the kernel dispatcher
The three-step pod lifecycle state machine
Running under pressure
Patterns you can reuse

Booting the pod micro‑OS

Before kubelet can act like a pod micro‑OS, it has to boot its own subsystems: storage, runtime, metrics, garbage collection, and node status. That wiring lives around NewMainKubelet, initializeModules, and initializeRuntimeDependentModules.

pkg/
  kubelet/
    kubelet.go <-- core Kubelet orchestration
    container/ (kubecontainer interfaces)
    pleg/ (PodLifecycleEventGenerator)
    status/ (status.Manager)
    volumemanager/
    server/ (HTTP & PodResources servers)
    cm/ (ContainerManager)
    metrics/

NewMainKubelet
  -> status.NewManager
  -> volumemanager.NewVolumeManager
  -> eviction.NewManager
  -> lease.NewController
  -> nodeshutdown.NewManager

Run
  -> initializeModules
  -> initializeRuntimeDependentModules (via updateRuntimeUp)
  -> statusManager.Start
  -> volumeManager.Run
  -> evictionManager.Start
  -> syncLoop (main control loop)

Kubelet as a micro‑kernel orchestrating managers.

The pattern is deliberate:

NewMainKubelet only wires dependencies. It constructs managers (status, volume, eviction, runtime, plugin), configures backoff policies, sets up feature‑gated behavior, and returns a fully assembled *Kubelet. It does not start work.
initializeModules starts what does not depend on a healthy runtime: metrics registration, filesystem layout, image manager, certificate manager, OOM watcher, and resource analyzer.
initializeRuntimeDependentModules waits for the runtime to be healthy (via updateRuntimeUp and runtimeState), then starts cAdvisor, the container manager, eviction manager, container log manager, plugin manager, and shutdown manager.

This two‑phase boot is one of the file’s key design ideas: treat runtime‑dependent modules as a later boot phase, guarded by health checks and backoff. That’s how kubelet avoids thrashing when the runtime (containerd, CRI‑O, …) is down or slow.

Rule of thumb: If a module can’t function until a dependency is healthy (like the container runtime), don’t start it optimistically. Gate it behind a health‑checked initialization step, as kubelet does with initializeRuntimeDependentModules.

The main loop as the kernel dispatcher

Once bootstrapped, kubelet behaves like an OS kernel dispatcher: it listens to many “interrupts” and then tells pod workers what to do. That logic lives in syncLoop and syncLoopIteration.

func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
    klog.InfoS("Starting kubelet main sync loop")
    syncTicker := time.NewTicker(time.Second)
    defer syncTicker.Stop()
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
    defer housekeepingTicker.Stop()
    plegCh := kl.pleg.Watch()

    const (
        base = 100 * time.Millisecond
        max = 5 * time.Second
        factor = 2
    )
    duration := base

    if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
        kl.dnsConfigurer.CheckLimitsForResolvConf(klog.FromContext(ctx))
    }

    for {
        if err := kl.runtimeState.runtimeErrors(); err != nil {
            klog.ErrorS(err, "Skipping pod synchronization")
            time.Sleep(duration)
            duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
            continue
        }
        duration = base

        kl.syncLoopMonitor.Store(kl.clock.Now())
        if !kl.syncLoopIteration(ctx, updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
            break
        }
        kl.syncLoopMonitor.Store(kl.clock.Now())
    }
}

syncLoop — health‑gated event loop with exponential backoff.

syncLoop itself is simple: check runtime health, back off if unhealthy, then delegate to syncLoopIteration. The inner function is where the dispatcher behavior appears: it selects over different “interrupt lines” and hands work to pod workers.

Reading the select in syncLoopIteration from top to bottom, we see:

Configuration changes from files, HTTP, or the API server (configCh) → HandlePodAdditions, HandlePodUpdates, HandlePodRemoves, HandlePodReconcile.
PLEG events (PodLifecycleEventGenerator) from the runtime (plegCh) → when containers die or are created, resync just those pods.
Periodic sync (syncCh) → getPodsToSync decides which pods need attention; workers are scheduled accordingly.
Housekeeping (housekeepingCh) → HandlePodCleanups cleans up pods that finished without a final sync.
Probe result streams (liveness, readiness, startup) → update status and, if needed, re‑sync affected pods.
ContainerManager updates (device/resource changes) → re‑sync pods whose allocations changed.

This is the heart of the micro‑OS metaphor: syncLoop is the scheduler and interrupt handler that takes signals from across the node and decides which pods to send back through the lifecycle state machine.

Mental model: Think of syncLoop as an air‑traffic control tower. It doesn’t fly planes (pods) itself; it listens on all the radios (config, runtime events, probes, timers) and hands each plane off to the right controller (pod workers).

The three-step pod lifecycle state machine

With the dispatcher in place, kubelet enforces a clear three‑step lifecycle for each pod:

Running: SyncPod — converge the pod into its desired running state.
Terminating: SyncTerminatingPod — stop all containers and finalize status.
Terminated: SyncTerminatedPod — clean up volumes, cgroups, user namespaces, and final status.

Each pod has a dedicated worker (via podWorkers). That worker decides which of these phases to invoke based on state. Together they form the pod lifecycle state machine at the core of this micro‑OS.

Step 1: SyncPod — converge to running

SyncPod is a transaction script that does everything required to make a pod match its spec. It is intentionally reentrant: you can call it repeatedly, and it continues to converge towards the desired state instead of assuming one successful pass.

func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType,
    pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {

    ctx, otelSpan := kl.tracer.Start(ctx, "syncPod", ...)
    defer func() { ... otelSpan.End() }()

    // 1. Observe latency vs firstSeen annotation
    if updateType == kubetypes.SyncPodCreate { ... }

    // 2. Resize conditions for in-place vertical scaling
    if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
        if kl.containerRuntime.IsPodResizeInProgress(pod, podStatus) {
            kl.statusManager.SetPodResizeInProgressCondition(...)
        } else if generation, cleared := kl.statusManager.ClearPodResizeInProgressCondition(pod.UID); cleared {
            kl.recorder.Eventf(pod, v1.EventTypeNormal, events.ResizeCompleted, ...)
        }
    }

    // 3. Synthesize API pod status and propagate IPs
    apiPodStatus := kl.generateAPIPodStatus(pod, podStatus, false)
    podStatus.IPs = ... from apiPodStatus

    // 4. Short-circuit terminal pods
    if apiPodStatus.Phase == v1.PodSucceeded || apiPodStatus.Phase == v1.PodFailed {
        kl.statusManager.SetPodStatus(logger, pod, apiPodStatus)
        isTerminal = true
        return isTerminal, nil
    }

    // 5. Record pod start latency
    existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
    if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning { ... }
    kl.statusManager.SetPodStatus(logger, pod, apiPodStatus)

    // 6. Enforce network readiness (except hostNetwork pods)
    if err := kl.runtimeState.networkErrors(); err != nil && !kubecontainer.IsHostNetworkPod(pod) {
        kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, ...)
        return false, fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, err)
    }

    // 7. Register secrets/configMaps and set up pod cgroups
    // 8. Reconcile mirror pod for static pods
    // 9. Ensure pod data dirs and volumes
    // 10. Add pod to probeManager and call containerRuntime.SyncPod(...)
}

SyncPod — reentrant transaction for converging a pod to running.

The structure is consistent:

Observation and metrics first : latency, resize conditions, OpenTelemetry span.
Status synthesis : generateAPIPodStatus merges runtime state and kubelet’s view; only that synthesized status is written via statusManager.
Early exit for terminal pods : once a pod is Succeeded or Failed, SyncPod sets status, returns isTerminal = true, and leaves further work to terminating/terminated flows.
Guardrails : if the network isn’t ready and the pod isn’t host network, kubelet refuses to start it and records a clear event.
Side‑effect orchestration : register secrets/configmaps, ensure cgroups, reconcile mirror pods, create on‑disk directories, wait for volumes, register probes, then call containerRuntime.SyncPod.

This is effectively the “launch process” system call of the pod micro‑OS: compose address space (volumes), credentials (secrets/configmaps), process groups (cgroups), health checks (probes), then ask the “hardware” (CRI runtime) to run containers.

Design note: SyncPod is large, but each block is a distinct step in a transaction. The codebase itself recommends extracting helpers (e.g. ensurePodStorage, ensurePodCgroupsAndResources) to lower cognitive load without changing behavior: make steps explicit, keep semantics identical.

Step 2: SyncTerminatingPod — stopping containers safely

When a pod should no longer run (deletion, eviction, restart policy), the worker invokes SyncTerminatingPod. Here kubelet stops behaving like a launcher and acts as a careful reaper.

func (kl *Kubelet) SyncTerminatingPod(_ context.Context, pod *v1.Pod,
    podStatus *kubecontainer.PodStatus, gracePeriod *int64,
    podStatusFn func(*v1.PodStatus)) (err error) {

    ctx := context.Background() // TODO: thread caller context
    logger := klog.FromContext(ctx)

    apiPodStatus := kl.generateAPIPodStatus(pod, podStatus, false)
    if podStatusFn != nil {
        podStatusFn(&apiPodStatus)
    }
    kl.statusManager.SetPodStatus(logger, pod, apiPodStatus)

    kl.probeManager.StopLivenessAndStartup(pod)

    p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
    if err := kl.killPod(ctx, pod, p, gracePeriod); err != nil { ... return err }

    kl.probeManager.RemovePod(pod)

    stoppedPodStatus, err := kl.containerRuntime.GetPodStatus(ctx, pod.UID, pod.Name, pod.Namespace)
    if err != nil { return err }
    preserveDataFromBeforeStopping(stoppedPodStatus, podStatus)

    // Verify no containers are still running (CRI contract)
    ... if len(runningContainers) > 0 { return fmt.Errorf("CRI violation: %v", runningContainers) }

    if utilfeature.DefaultFeatureGate.Enabled(features.DynamicResourceAllocation) {
        if err := kl.UnprepareDynamicResources(ctx, pod); err != nil { return err }
    }

    apiPodStatus = kl.generateAPIPodStatus(pod, stoppedPodStatus, true)
    kl.statusManager.SetPodStatus(logger, pod, apiPodStatus)

    return nil
}

Key properties:

Idempotency : if SyncTerminatingPod runs again, killing already‑stopped containers is harmless, and GetPodStatus just confirms nothing is running.
Contract enforcement : after killPod, kubelet explicitly checks for remaining running containers and treats that as a CRI violation. That guards against buggy runtimes.
Ordered side‑effects : only after containers stop does kubelet unprepare dynamic resources, avoiding races with controllers that might reassign resources.

From the micro‑OS perspective, this is the controlled shutdown path: stop all processes in the pod, verify they’re gone, then free their dynamic resources.

Step 3: SyncTerminatedPod — cleaning up the pod shell

When containers are gone, a “shell” of the pod still exists: volumes, directories, cgroups, user namespaces. SyncTerminatedPod tears down that shell in a way that survives restarts and partial failures.

func (kl *Kubelet) SyncTerminatedPod(ctx context.Context, pod *v1.Pod,
    podStatus *kubecontainer.PodStatus) error {

    ctx, otelSpan := kl.tracer.Start(ctx, "syncTerminatedPod", ...)
    defer otelSpan.End()

    apiPodStatus := kl.generateAPIPodStatus(pod, podStatus, true)
    kl.statusManager.SetPodStatus(logger, pod, apiPodStatus)

    // 1. Wait for volumes to unmount
    if err := kl.volumeManager.WaitForUnmount(ctx, pod); err != nil { return err }

    // 2. Wait until volume paths are actually gone (background GC)
    if err := wait.PollUntilContextCancel(ctx, 100*time.Millisecond, true,
        func(ctx context.Context) (bool, error) {
            volumesExist := kl.podVolumesExist(pod.UID)
            return !volumesExist, nil
        }); err != nil { return err }

    // 3. Unregister secrets/configMaps
    if kl.secretManager != nil { kl.secretManager.UnregisterPod(pod) }
    if kl.configMapManager != nil { kl.configMapManager.UnregisterPod(pod) }

    // 4. Destroy cgroups (if using per-QoS cgroups)
    if kl.cgroupsPerQOS {
        pcm := kl.containerManager.NewPodContainerManager()
        name, _ := pcm.GetPodContainerName(pod)
        if err := pcm.Destroy(logger, name); err != nil { return err }
    }

    // 5. Release user namespaces and mark pod terminated in statusManager
    kl.usernsManager.Release(logger, pod.UID)
    kl.statusManager.TerminatePod(logger, pod)

    return nil
}

There’s an important resilience constraint behind this: kubelet has no durable local store for pod metadata, so all cleanup steps must be reentrant. If kubelet restarts mid‑cleanup, periodic GC and HandlePodCleanups must be able to finish the job based solely on the external world (runtime, volumes, cgroups), without relying on in‑memory state.

Resilience pattern: Treat cleanup as “eventually consistent” background work that is safe to run multiple times. If your process can crash halfway through a cleanup, you want to be able to simply try again.

Running under pressure

So far we focused on correctness. But this micro‑OS is built to run under load: hundreds or thousands of pods per node, noisy neighbors, slow runtimes, and an overloaded API server. The file encodes several strategies to keep kubelet responsive in those conditions.

Event-driven plus periodic scanning

Kubelet does not rely on a single mechanism to keep pods in sync. It combines:

Evented signals : PLEG events when containers die, probe result updates, container manager updates.
Config deltas : ADD, UPDATE, REMOVE, RECONCILE from configuration sources.
Periodic sweeps : syncCh ticking every second, scanning for pods that still need work.

This hybrid model is common in distributed systems: react when events arrive, and periodically double‑check in case you missed something.

Scoped concurrency with per-pod workers

Instead of letting any component race to modify a pod, kubelet centralizes lifecycle transitions through podWorkers. Each pod gets a single worker goroutine that sequences calls to SyncPod, SyncTerminatingPod, and SyncTerminatedPod. Other components (eviction manager, shutdown manager, probe handlers) don’t manipulate containers directly; they enqueue work to the pod worker.

This shrinks the concurrency problem from “many goroutines might touch pod X” to “at most one worker manages pod X’s lifecycle,” dramatically reducing race risks around restarts, cgroup changes, or volume teardown.

Health gating and backoff

When the container runtime isn’t healthy, hammering it just makes things worse. runtimeState and updateRuntimeUp implement a simple pattern:

Track runtime and network readiness via CRI Status.
If unhealthy, let syncLoop sleep with exponential backoff (100ms → 5s) before trying again.
Only initialize dependent modules (cAdvisor, containerManager, pluginManager, evictionManager) after the runtime is up.

This protects both the runtime and kubelet from “thundering herd” behavior during outages.

Observability on the hot paths

The code highlights several metrics tied directly to these control paths:

kubelet_sync_pod_duration_seconds — latency of SyncPod per pod.
kubelet_sync_loop_iteration_seconds — duration of each syncLoopIteration.
kubelet_runtime_errors_total — counts of runtime/network readiness errors from runtimeState.
kubelet_pod_worker_queue_length — backlog of pods pending worker processing.
kubelet_housekeeping_duration_seconds — time spent on housekeeping versus its 1s period.

Because these metrics align with the path we just traced (loop iterations, per‑pod syncs, runtime health), they give a direct view into when the micro‑OS is falling behind: high sync durations or long loop iterations mean pod operations are slow; rising runtime errors signal a flapping runtime; long housekeeping suggests cleanup starvation.

Ops takeaway: If you adopt a similar event‑driven kernel, instrument the main loop and the lifecycle transaction scripts, not just individual helpers. That’s how you detect systemic slowness.

Patterns you can reuse

kubelet.go is big, and the Kubelet struct is undeniably a “god object.” The code itself calls that out and suggests extracting controllers (for example, a NodeStatusController) and splitting large functions like SyncPod and NewMainKubelet. Even so, several architectural patterns are immediately reusable.

Separate desired, actual, and reported state

Kubelet draws a hard line between:

Desired state — podManager: what pods should exist, based on configuration.
Actual lifecycle state — podWorkers: what pods are actually running, terminating, or terminated on the node.
Reported status — statusManager: the synthesized PodStatus published to the API server.

The separation is why the system tolerates force‑deleted pods, restarts, and partial failures: each layer has a single job and its own notion of truth.

Why it matters: If you collapse desired, actual, and reported state into one object, you will eventually have impossible situations (“this says the thing is running, but the process is gone”) with no clean recovery path.

Use reentrant transaction scripts for lifecycle

SyncPod, SyncTerminatingPod, and SyncTerminatedPod are classic “transaction scripts” for multi‑step operations, written to be reentrant and idempotent:

They recompute status on every call instead of depending on prior partial work.
They treat “already done” as success: existing cgroups, mounted/unmounted volumes, containers already killed.
They avoid hidden mutable intermediate state, relying instead on runtimes and managers to reflect reality.

That style is robust under retries, process restarts, and partial failures, which is exactly what you want in controllers.

Localize cross-cutting concerns

Cross‑cutting concerns — metrics, tracing, context cancellation, feature gates, and even intentionally insecure pieces like the insecureContainerLifecycleHTTPClient — are handled via named managers and consistent patterns:

OpenTelemetry spans at the top of major lifecycle methods.
Central metrics registration in initializeModules and predictable metric names per path.
Feature gates for controlled behavioral changes.
Carefully documented “dangerous” bits constrained to narrow surfaces.

The code suggests going further (for example, wrapping os.Exit to improve testability), but the basic pattern is sound: if a concern touches many parts of the system, give it a well‑named manager or helper instead of sprinkling logic everywhere.

Accept hubs, but manage their cost

The Kubelet struct is a hub: it coordinates pods, volumes, cgroups, node status, plugins, and more. That coupling is partly inherent to its role. The file manages this with:

Interfaces and DI : cadvisor.Interface, kubecontainer.Runtime, secret.Manager, volumeManager.VolumeManager, and others injected via a Dependencies struct.
Dedicated managers for big concerns (status, volumes, eviction, runtime class, plugin, shutdown).
Functional options (Option type) so configuration doesn’t explode constructor parameters further.

There are still clear refactor targets: extracting a NodeStatusController out of Kubelet.Run, or splitting SyncPod into named helpers. But even as a “god object,” kubelet leans heavily on interfaces and composition to keep behavior testable and evolvable.

Current pattern	Suggested improvement	Benefit
Monolithic `SyncPod` (200+ lines)	Extract helpers: `ensureNetworkAndRegistrations`, `ensurePodStorage`, etc.	Lower cognitive load; easier unit testing of each step.
Node status & leases mixed into `Kubelet.Run`	Introduce `NodeStatusController` owning lease & status loops	Clearer ownership; node health logic evolves without touching pod lifecycle.
Direct `os.Exit` in runtime‑dependent initialization	Wrap in a fatal error handler or return fatal errors to `main()`	Improved testability; fewer surprises when embedding kubelet logic.

Closing thoughts

Reading kubelet.go as just a big Go file is intimidating. Reading it as the kernel of a pod‑focused micro‑OS makes the structure clear:

Boot in phases, gated by dependency health.
Dispatch events through a single main loop that feeds per‑pod workers.
Drive lifecycle with a three‑step, reentrant state machine (SyncPod → SyncTerminatingPod → SyncTerminatedPod).
Instrument hot paths so you can see when the system falls behind.

The primary lesson is to design orchestrators as kernels : explicitly model desired, actual, and reported state; centralize dispatch; implement lifecycle as reentrant transaction scripts; and make cleanup safe to repeat after restarts. That’s how kubelet stays resilient around an unreliable, high‑latency runtime.

If you’re building controllers, operators, or any long‑running orchestrator, you can adapt these patterns directly:

Model desired vs actual vs reported state explicitly.
Use per‑object workers and reentrant transaction scripts for lifecycle steps.
Gate complex modules behind health checks instead of assuming they’re always up.
Make cleanup idempotent so restarts just resume work.

Kubelet has grown organically over years and carries historical weight, but underneath that it’s a rich example of a resilient, scalable micro‑OS built around a complex runtime. If we treat it that way—as a kernel to learn from rather than a heap of code—we can bring those lessons into any large‑scale system we design.

DEV Community