Infrastructure

My Home Assistant Backup Strategy After 18 Months Running 50+ Devices

Independent technologist · 200+ HA devices · GriswoldLabs
9 min read

The most expensive lesson I’ve learned with Home Assistant didn’t come from a hardware failure. It came from a botched manual restore where I overwrote a working install with a corrupted snapshot, lost about three weeks of automations and history data, and had to rebuild from a partial config dump in git. That was 18 months ago. Since then, my backup strategy has been deliberate.

This post is the strategy I actually run, the retention numbers, the test cadence, and the holes I haven’t filled yet.

What I’m protecting against

Three failure modes, in order of how often they happen:

  1. I broke something. Most common by far. Bad YAML edit, wrong service call in a script, a 2 AM half-asleep automation tweak that turns out to have been wrong. I want to be able to roll back to “this morning” without thinking.
  2. HA broke something. Less common but real — an integration update changes behavior, a core release deprecates a YAML pattern I was using, a database migration goes sideways. I want to be able to roll back to “before the update.”
  3. The hardware failed. Rare. The HA host (a small Intel NUC) has been online for over a year with no issues. But SD cards die, NVMe drives die, power supplies die. I want to be able to rebuild on new hardware in under an hour.

The strategy has three tiers, one per failure mode. Each tier is independent — losing a tier doesn’t break the others.

Tier 1: HA’s built-in snapshots, every day

HA’s built-in backup mechanism (Settings → System → Backups) creates a .tar.gz snapshot containing the config, the database, the secrets, and any installed add-on data. The system can be restored from a snapshot in about 5 minutes if it’s still booting; longer if you have to set up a fresh OS first.

I have an automation that creates a snapshot every morning at 4 AM and keeps the last seven:

automation:
  - id: 'daily_backup'
    alias: 'Daily HA Backup'
    triggers:
      - trigger: time
        at: "04:00:00"
    actions:
      - action: backup.create_automatic
      - delay: "00:05:00"
      - action: notify.charles
        data:
          title: "HA Backup"
          message: "Daily snapshot created"
          data:
            tag: ha-backup

backup.create_automatic is HA’s automation-managed backup action. It applies the retention policy you’ve configured under Settings → System → Backups → Configure → Automatic backups, which I have set to “keep last 7.” So every day creates a new snapshot, and the eighth-oldest one falls off automatically.

These snapshots live on the HA host’s internal storage. They cover failure mode 1 (I broke something) — within seconds I have access to “what HA looked like yesterday morning” and can restore from there.

Tier 2: Off-host pulls to the NAS, every week

The Tier 1 snapshots are on the same machine as the HA install. If the HA host’s drive fails, those backups go with it. Tier 2 fixes that by pulling them to the NAS.

Every Sunday at 3 AM, a small script on my Unraid server runs rsync against the HA snapshot directory and pulls anything new to /mnt/user/backups/homeassistant/. The retention is 12 weeks (so I always have the last three months of weekly snapshots even after the Tier 1 retention drops them).

#!/usr/bin/env bash
# /mnt/user/scripts/ha-backup-sync.sh
SRC="ha:/config/backups/"
DST="/mnt/user/backups/homeassistant/"

# Pull new snapshots
rsync -av --include='*.tar' --exclude='*' "$SRC" "$DST"

# Trim anything older than 12 weeks
find "$DST" -name '*.tar' -mtime +84 -delete

The ha: host alias in ~/.ssh/config points at the HA host with a key that has read-only access to /config/backups/. The script runs from a User Scripts plugin on Unraid, scheduled with cron.

This tier covers failure mode 2 (HA itself broke something) and gives me a longer history than the seven-day rolling Tier 1 — useful if the breakage happens but I don’t notice for a couple of weeks.

Tier 3: Monthly cold copy to USB drive

The first two tiers are both on hot storage that’s always online. If something goes badly wrong with the home network — ransomware, a fire, a really bad cascading failure — they could go together. Tier 3 is a cold copy.

On the first of each month, I plug a USB drive into the Unraid server and copy /mnt/user/backups/homeassistant/ to it. The drive lives in a fireproof safe between copies. There are two physical drives that I rotate (so one is always offsite-equivalent — i.e., not in the running computer — at any given time).

This is manual. I haven’t automated it because the manual step is the protection — if there’s a software vulnerability that can encrypt my backups, it can’t reach the drive that’s not plugged in.

I get an iOS calendar reminder at 8 AM on the first of each month. Takes about 90 seconds to swap drives, takes another 5-10 minutes for the copy to finish.

The actual retention policy

Putting all three tiers together:

TierSourceFrequencyRetentionStorage
1HA snapshot.createDaily 4 AM7 daysHA host internal
2rsync to NASWeekly Sun 3 AM12 weeksUnraid (mirrored)
3USB cold copyMonthly 1stLast 2 monthsFireproof safe

The “last 2 months” on Tier 3 is because I rotate two drives — at any given moment one drive has last month’s copy and one has the month before. When I do the monthly copy, the older drive gets overwritten.

The test cadence (don’t skip this)

A backup you’ve never restored isn’t a backup. The single most valuable thing I do for my backup strategy is the quarterly restore drill.

Once every three months, I:

  1. Spin up a fresh Home Assistant VM on the Unraid server (just for testing, separate from production)
  2. Restore the most recent Tier 2 snapshot to that VM
  3. Boot it, confirm it loads my config, check that a few entities show their last-known state
  4. Tear the VM down

This catches three things that aren’t visible from “the snapshot file exists”:

  • Snapshot corruption. Once, a Tier 1 snapshot was incomplete because HA was in the middle of a database write when the backup ran. The file was there but the restore failed mid-way through. I now know about that class of failure because I exercise the restore path.
  • Compatibility drift. If I’m running HA core 2026.4 in production but the snapshot is from core 2025.10, will it restore cleanly? Mostly yes, but there have been a few breaking changes in that span. The drill catches “the snapshot is too old to restore directly” before I need it.
  • Process drift. The restore steps change between HA versions. The drill keeps my muscle memory current. The first time you do a restore should not be at 11 PM with the family wondering why the lights aren’t responding.

I also write a one-paragraph note after each drill describing what I did, what worked, and what didn’t. It lives in /work/homeassistant/AUDIT-YYYY-MM-DD.md alongside the configs in git.

What I learned from the near-miss

The original incident, 18 months ago, was a sequence:

  1. I made a YAML change that broke a critical automation
  2. I tried to “restore from yesterday’s backup” through the UI
  3. The UI offered me a list of backups, including one I’d never seen before — labeled with a partial timestamp
  4. I selected what I thought was the most recent one
  5. It turned out to be a manually-created snapshot from three weeks earlier, mislabeled in my mental model
  6. The restore overwrote everything and erased three weeks of work

The fixable parts of that:

  • I now name backups deliberately. Auto-generated daily backups have a clear timestamp. Anything I create manually gets a name describing why I created it (pre-update-2026-04-22, etc.). The list is no longer ambiguous.
  • I never restore directly to production for a non-emergency rollback. If something is broken, I do the restore on the test VM first, confirm the state I want is in there, and then either copy specific files back or do a full restore.
  • I keep configs in git, separately from the snapshot. /config/automations/, /config/scripts/, /config/packages/, and configuration.yaml are all in a git repo that pushes to a private remote. Even if every snapshot were lost, I could rebuild the config from git (the database history would be lost but everything else would survive).

That last one — git for configs — does most of the heavy lifting for “I broke something” recovery now. I can git diff to see exactly what I changed today, git checkout to roll back, and the snapshot is mostly insurance for “and the database too.”

Holes I haven’t filled

Two things on the to-do list that I haven’t done yet:

Off-site cloud backup. Tier 3 is offline-but-local. If the house burns down, I lose everything. A small Backblaze B2 bucket holding encrypted Tier 2 snapshots would close that gap for under $5/month. I keep meaning to set it up.

Automated restore testing. The quarterly drill is manual. It would be straightforward to write a script that spins up the test VM, restores the most recent snapshot, exercises a few HA service calls to confirm it’s alive, tears down. Haven’t done it. Would let me run the drill weekly instead of quarterly.

Both are on the list. Neither has been a high-enough priority to bump above the regular backlog.

If you’re starting from no backups

The order I’d suggest:

  1. Today: turn on HA’s built-in automatic backups with a 7-day retention. This is one toggle. Don’t put it off.
  2. This week: set up an rsync (or equivalent) pull from the HA host to wherever your bulk storage lives. Schedule it weekly.
  3. This month: do your first restore drill. Do it on a test VM, not production. Confirm it actually works.
  4. This quarter: figure out an off-host storage strategy (USB rotation, cloud, both) for the case where your home network is compromised.

The temptation when starting is to build the whole stack at once. Don’t. Get Tier 1 working first; live with it for a week; then add Tier 2; live with it for a week; then add Tier 3. Each tier independently is dramatically better than no backups, and the incremental approach keeps you from getting overwhelmed and skipping the whole thing.

The goal isn’t to have the world’s most elaborate backup system. The goal is to be able to sleep through a botched YAML edit and fix it tomorrow morning with a five-minute restore.

Tags: #home-assistant #backups #disaster-recovery #infrastructure
Share: X / Twitter Facebook

Related Articles