Outages complicate system monitoring
When the grid you're observing also powers the thing observing it, the failure modes get interesting. A field-level look at why monitoring a mini-grid is not the same problem as monitoring a server.
If you've ever monitored a server rack, your mental model of "monitoring" probably looks like this: the thing being monitored is up, the thing doing the monitoring is up, and the network between them works most of the time. The interesting edge cases live at the boundaries. On a mini-grid, none of those assumptions hold. The grid drops, then the logger drops, then the cell tower drops, and you find yourself diagnosing an outage with the same instruments that the outage took out.
This first essay is a tour of where the chain breaks. The next three will go deeper into specific links: communication, power, and the long tail of mobile-network failure modes.
The four-link chain
From the meter at the bottom to the cloud dashboard at the top, there are usually four moving parts that have to be working in concert. In practice, they fail in different combinations, and the order in which they fail tells you something about what's going on.
The hardest part of monitoring a mini-grid isn't reading the meter. It's noticing when you've stopped reading the meter — and being right about why.
Power loss at the logger
The most common — and most interesting — failure is the one where the system that's monitoring the grid is itself powered by the grid. When it drops, the logger drops with it, and the absence of data is itself the signal. The trick is telling that absence apart from a network failure, a configuration error, or a logger that has quietly died on its own schedule.
An anecdote from a site visit in 2023: a "logger outage" alert turned out to be a network outage, which turned out to be a regional cell tower outage, which turned out to be a grid outage powering the cell tower. The mini-grid itself was fine.
What a backup buys you
A small UPS or battery on the logger doesn't solve the problem — the upstream network is still down — but it changes the shape of the data you keep. Even thirty minutes of post-outage logging is often enough to capture the recovery transient, which is where most of the interesting electrical behavior is.
# a minimal "is the logger up?" heartbeat every 60s: if grid_voltage < 50: log("grid_down", ts=now()) heartbeat()
Network as silent failure
Mobile networks fail in more ways than most monitoring systems are prepared to detect. The connection can be up at the modem layer and broken at the application layer; data can queue locally for hours and then arrive all at once, with stale timestamps that the receiving system happily accepts as live. The next essay in this series catalogs the failure modes I've seen on real installations.
What this series is actually about
It's tempting to write about monitoring as a software problem. It mostly isn't. It's a problem about physical things being unreliable in correlated ways, and software being one of the easier ways to notice and absorb that unreliability. The rest of this series is an attempt to be honest about that.