DEV Community

Mahmoud Zalt
Mahmoud Zalt

Posted on • Originally published at zalt.me on

The Tiny Struct That Boots Grafana

We’re examining how Grafana boots, runs, and shuts down as a single coherent process. Grafana is a large observability platform, but at its core, there’s a modest Go file, server.go, that quietly coordinates the entire application lifecycle. Inside it lives a Server struct that wires dependencies, bridges to the OS, and enforces a safe Init–Run–Shutdown contract. I’m Mahmoud Zalt, an AI solutions architect and software engineer, and we’ll use this struct as a blueprint for designing reliable lifecycles in our own services.


We’ll see how this one type acts as a composition root, why its lifecycle methods are safe to over-call, how it isolates OS-specific concerns, and how its failure behavior shapes the design. By the end, you should have a concrete pattern for building a tiny, focused orchestration type that keeps complex systems predictable.





The Server Struct as Composition Root


server.go lives near the top of Grafana’s package tree and acts as the process and lifecycle orchestrator. Downstream packages implement HTTP, background services, access control, provisioning, metrics, and tracing. The Server type doesn’t do that work itself; it just coordinates when those subsystems start and stop.

Project: grafana

pkg/
  server/
    server.go <-- process & lifecycle orchestrator
  api/
    http_server.go (used as *api.HTTPServer)
  infra/
    log/
    metrics/
    tracing/
  registry/
    backgroundsvcs/
      adapter/
        manager_adapter.go (wrapped by managerAdapter)
  services/
    accesscontrol/
    featuremgmt/
    provisioning/
  setting/

Call graph (simplified):

New --> newServer --> &Server{...}
  | |
  | -> injects: cfg, HTTPServer, RoleRegistry, ProvisioningService,
  | BackgroundServiceRegistry, TracingService, FeatureToggles, promReg
  -> s.Init()
       |
       +-> writePIDFile()
       +-> metrics.SetEnvironmentInformation()
       +-> roleRegistry.RegisterFixedRoles() [conditional]
       +-> provisioningService.RunInitProvisioners()

Run --> Init() [idempotent]
      --> tracerProvider.Start("server.Run")
      --> notifySystemd("READY=1")
      --> managerAdapter.Run()

Shutdown --> managerAdapter.Shutdown() [once]
           --> context deadline check
<figcaption>The <code>Server</code> type as composition root, orchestrating lower-level services.</figcaption>
Enter fullscreen mode Exit fullscreen mode

The heart of this file is a single struct that owns almost no business logic but all of the orchestration:

type Server struct {
context context.Context
log log.Logger
cfg *setting.Cfg
shutdownOnce sync.Once
isInitialized bool
mtx sync.Mutex
pidFile string
version string
commit string
buildBranch string

backgroundServiceRegistry registry.BackgroundServiceRegistry
tracerProvider *tracing.TracingService
features featuremgmt.FeatureToggles

HTTPServer *api.HTTPServer
roleRegistry accesscontrol.RoleRegistry
provisioningService provisioning.ProvisioningService
promReg prometheus.Registerer
managerAdapter *adapter.ManagerAdapter

}

Think of Server as an air traffic controller. Subsystems like the HTTP server, background jobs, and provisioning are the planes. Server decides when they take off (Init), keep flying (Run), and land safely (Shutdown), but it never flies them itself.

Rule of thumb: it’s acceptable for a top-level type to depend on many subsystems if it only coordinates them and doesn’t implement their internal logic.




A Safe Init–Run–Shutdown Contract


Once we see Server as an orchestrator, the core question becomes: how do we make starting and stopping safe to call under real-world conditions—multiple callers, retries, partial failures?

Idempotent initialization


Idempotent initialization means you can call Init multiple times, but only the first call performs work; later calls leave the system in the same final state. Grafana implements this with a mutex and a boolean guard:

func (s *Server) Init() error {
s.mtx.Lock()
defer s.mtx.Unlock()
if s.isInitialized {
    return nil
}
s.isInitialized = true

if err := s.writePIDFile(); err != nil {
    return err
}

if err := metrics.SetEnvironmentInformation(s.promReg, s.cfg.MetricsGrafanaEnvironmentInfo); err != nil {
    return err
}

//nolint:staticcheck // not yet migrated to OpenFeature
if !s.features.IsEnabledGlobally(featuremgmt.FlagPluginStoreServiceLoading) {
    if err := s.roleRegistry.RegisterFixedRoles(s.context); err != nil {
        return err
    }
}

return s.provisioningService.RunInitProvisioners(s.context)

}

The sequence is linear and guarded:


  1. Lock so only one goroutine can initialize.
  2. Skip if initialization already happened.
  3. Write the PID file.
  4. Register environment information with Prometheus.
  5. Conditionally register fixed roles behind a feature flag.
  6. Run provisioning init.

Any failure short-circuits and returns an error. This keeps initialization predictable and prevents “half-initialized” states.

Mental model: treat Init like flipping the main breaker in a building. Do it once, in a fixed order, and stop immediately if something looks unsafe.


Run: one entry point, fully instrumented


After initialization, the Run method is intentionally small:

func (s *Server) Run() error {
if err := s.Init(); err != nil {
return err
}
ctx, span := s.tracerProvider.Start(s.context, "server.Run")
defer span.End()

s.notifySystemd("READY=1")
return s.managerAdapter.Run(ctx)

}

This packs a few important decisions:


  • Always call Init first: because Init is idempotent, callers can safely just call Run and know initialization happened.
  • Wrap execution in a tracing span: the entire run phase is grouped under a server.Run span.
  • Signal readiness to systemd: the OS learns when Grafana considers itself “up.”
  • Delegate continuous work to managerAdapter.Run, which owns background services.

From the outside, Run is the single entry point that guarantees initialization, instrumentation, and OS readiness signaling.

Shutdown: at-most-once, context-aware


Shutdown has the opposite problem to initialization: you want to make sure shutdown logic runs at most once, even if multiple parts of the system try to trigger it. Grafana uses sync.Once for this:

func (s *Server) Shutdown(ctx context.Context, reason string) error {
var err error
s.shutdownOnce.Do(func() {
    s.log.Info("Shutdown started", "reason", reason)

    if shutdownErr := s.managerAdapter.Shutdown(ctx, "shutdown"); shutdownErr != nil {
        s.log.Error("Failed to shutdown background services", "error", shutdownErr)
    }

    select {
    case &lt;-ctx.Done():
        s.log.Warn("Timed out while waiting for server to shut down")
        err = fmt.Errorf("timeout waiting for shutdown")
    default:
        s.log.Debug("Finished waiting for server to shut down")
    }
})

return err

}

The contract this enforces:


  • Only the first caller actually initiates shutdown; later calls are no-ops.
  • Callers control patience via the ctx deadline or timeout.
  • Background services are stopped through a single adapter, keeping the surface area small.
  • If the context expires, Shutdown returns a timeout error and logs a warning.

Refinement opportunity: shutdown failures are currently only logged. Returning those errors as wrapped values along with timeouts would make automation and tests more informative.




Bridging to the OS Without Leaking Complexity


Server is also where Grafana touches OS-level concerns like PID files and systemd readiness. Keeping those bridges here prevents lower-level packages from knowing anything about process IDs or Unix sockets.

PID file: small, sharp, and fail-fast


A PID file is a tiny file containing the process ID so external tools can find and signal the process. Server owns writing it:

func (s *Server) writePIDFile() error {
if s.pidFile == "" {
return nil
}
if err := os.MkdirAll(filepath.Dir(s.pidFile), 0700); err != nil {
    s.log.Error("Failed to verify pid directory", "error", err)
    return fmt.Errorf("failed to verify pid directory: %s", err)
}

pid := strconv.Itoa(os.Getpid())
if err := os.WriteFile(s.pidFile, []byte(pid), 0644); err != nil {
    s.log.Error("Failed to write pidfile", "error", err)
    return fmt.Errorf("failed to write pidfile: %s", err)
}

s.log.Info("Writing PID file", "path", s.pidFile, "pid", pid)
return nil

}

Key characteristics:


  • Opt-in: if no PID path is configured, it returns immediately.
  • Ensures directory existence: calls MkdirAll to avoid runtime surprises.
  • Logs failures with enough context for operators.
  • Fails initialization on error, because a broken PID setup is treated as a configuration bug.

The code currently wraps errors with %s; switching to %w would preserve original errors for inspection and unwrapping, which is useful for debugging.

Systemd readiness: best-effort notification


On systemd-based Linux systems, services can send readiness notifications over a Unix datagram socket. Server implements this as a non-fatal, best-effort operation:

func (s *Server) notifySystemd(state string) {
notifySocket := os.Getenv("NOTIFY_SOCKET")
if notifySocket == "" {
s.log.Debug("NOTIFY_SOCKET environment variable empty or unset, can't send systemd notification")
return
}
socketAddr := &amp;net.UnixAddr{Name: notifySocket, Net: "unixgram"}
conn, err := net.DialUnix(socketAddr.Net, nil, socketAddr)
if err != nil {
    s.log.Warn("Failed to connect to systemd", "err", err, "socket", notifySocket)
    return
}
defer func() {
    if err := conn.Close(); err != nil {
        s.log.Warn("Failed to close connection", "err", err)
    }
}()

if _, err = conn.Write([]byte(state)); err != nil {
    s.log.Warn("Failed to write notification to systemd", "err", err)
}

}

The decisions here are deliberate:


  • If NOTIFY_SOCKET is unset, it only logs a debug line and returns.
  • Connection and write failures are logged as warnings but do not fail Run.

Compare this to PID handling: PID failures abort initialization, while systemd failures are tolerated. A misconfigured PID file is a clear operator mistake; a missing NOTIFY_SOCKET is often just “not running under systemd.”

Architectural win: all OS-specific behavior (PID files and systemd sockets) is confined to server.go. The rest of Grafana stays portable and doesn’t depend on platform details.




How Failure Behavior Shapes the Design


The clarity of Server comes partly from how it treats failures at each stage of the lifecycle. The rules are simple but consistent.

Startup: fail fast, avoid half-starts


During construction and Init, all serious problems are treated as hard failures:


  • PID directory creation or file write fails.
  • Metrics environment information registration fails.
  • Fixed role registration fails when the feature flag requires it.
  • Provisioning initialization fails.

This reflects a stance that it is better not to start than to start in a broken, opaque state. If provisioning or access control setup fails, operators get a clear error instead of a running process with partially applied configuration.

Run: narrow error surface


Run only returns errors from:


  1. Init(), covering all startup safety checks.
  2. managerAdapter.Run(ctx), representing the core background services.

Systemd notification issues are logged but not returned. That keeps the meaning of a Run error narrow: either startup failed, or the main run loop encountered a problem.

Shutdown: more visibility would help


Shutdown currently only returns an error when the shutdown context expires; failures from managerAdapter.Shutdown are logged but not surfaced to the caller. A more informative design would wrap both timeout and shutdown errors.


Why surfacing shutdown errors matters

In automated deployments, orchestrators and test suites often need to know if a service shut down cleanly. If Shutdown only signals timeouts, persistent shutdown bugs can hide behind “success” as long as they complete before the context deadline. Propagating those errors lets higher-level tooling fail fast and draw attention to misbehaving components.




What to Steal for Your Own Systems


Stepping back, this tiny Server type encodes a clear pattern: use a single orchestration struct to own the application lifecycle, keep it thin, and make its contract safe to over-call. That pattern transfers well to almost any stack.

1. Define a single orchestration type


Create a top-level type whose responsibility is only to coordinate: wire dependencies, initialize them, run the main loop, and shut everything down. Inject actual work via interfaces or collaborators. This keeps main small and your wiring explicit.

2. Make Init and Shutdown safe to over-call


Use a mutex plus a boolean guard for initialization and a Once-like primitive for shutdown. That way, multiple callers, retries, or defensive calls don’t introduce races or double work.

3. Isolate OS-specific behavior


Keep PID management, systemd notifications, or other platform quirks in a thin layer near the top of your process. The rest of your system should be oblivious to how readiness is signaled or how the process is discovered.

4. Treat startup failures as configuration bugs


If provisioning, metrics environment setup, or core access control wiring fail, stop the process and surface a clear error. Don’t limp into a partially initialized state that operators can’t reason about.

5. Instrument lifecycle, not just requests


Even though server.go doesn’t expose them directly, the design naturally suggests metrics like initialization duration, shutdown duration, and shutdown timeouts. Tracking these gives you a view into lifecycle health—the part of the system that’s most stressed during deploys and rollbacks.

The primary lesson from Grafana’s Server is that a small, focused orchestration type can make a large system’s lifecycle predictable. By centralizing wiring, enforcing idempotent Init and at-most-once Shutdown, and isolating OS bridges, you get services that start and stop reliably under pressure. Bring this pattern into your own codebase—even for smaller services—and you reduce surprise at exactly the moments where failure is most costly.


Top comments (0)