Infrastructure Hardening

A Security Audit, Offsite Backups, and Locking Down the Stack

The Context

Two weeks after the TrueNAS migration, my homelab was running smoothly. Services were stable, the compute/storage separation was working well, and everything had been humming along without incident. That's usually the dangerous moment, when things seem fine and you stop looking closely.

I decided to conduct a thorough infrastructure audit: examine every host, every service, every firewall rule, and every NFS export. The goal was to find the gaps I knew existed but hadn't quantified, and to address the biggest one of all: I had no offsite backups. If my house burned down or was burglarized, every photo, document, and piece of data I'd accumulated over the years would be gone.

The Audit (January 31)

I spent the day systematically examining every component of the infrastructure. The audit uncovered findings across four severity levels. Some were already mitigated by existing architecture decisions I'd made earlier. Others needed immediate attention.

What Needed Fixing

The most pressing issues discovered:

NFS exports were wide open to the entire subnet. Any machine on the management network with root access could read and write all NFS-shared data, including photos, documents, and file shares. The exports used no_root_squash with no host restrictions.
The Proxmox SSD thin pool was at 84% capacity. LVM thin pools start degrading above 85% and risk catastrophic failure. I was one busy backup cycle away from potential data corruption.
No offsite backups existed. All data lived in one physical location. The ZFS mirror protected against drive failure, but not against theft, fire, or flood.
The public VPS had no intrusion protection. The only machine with a public IP was running without any brute-force detection or automated banning.

The Fixes

Relieving the Storage Pressure

The most time-sensitive issue was the SSD thin pool at 84%. I used Proxmox's live disk migration to move VM 600 (the metrics server running Grafana and InfluxDB) from the SSD to the 1TB HDD added during the TrueNAS migration. The command qm move-disk 600 scsi0 hdd-storage --delete 1 mirrored the disk live with zero downtime, then cleaned up the old copy. The SSD pool dropped from 84% to 70%, safely out of the danger zone.

Locking Down NFS

Every NFS export on TrueNAS was restricted from the entire subnet to specific host IPs using the TrueNAS middleware CLI. The Immich, Nextcloud, and Paperless data shares now only accept connections from the application server VM. The InfluxDB share only accepts the metrics VM. The Proxmox backup share only accepts the two hypervisor hosts.

This means that even if an attacker compromises another VM or gains access to any other machine on the management network, they can't mount or access application data over NFS. The attack surface went from "any root user on the subnet" to "only the specific VMs that need the data."

Securing the Public VPS

The VPS at DigitalOcean is the only machine with a public IP address, making it the primary target for internet-based attacks. I installed Fail2Ban with an aggressive SSH configuration: three failed login attempts triggers a 24-hour ban. Within minutes of activation, multiple malicious IPs were already being blocked. It was a good reminder that public-facing infrastructure is under constant automated attack.

I also discovered the Caddy reverse proxy's apt repository had an expired GPG key, silently blocking security updates for the public-facing web server. The key was refreshed and Caddy updated to v2.10.2 the following day.

Establishing Offsite Backups

This was the most important outcome of the entire audit. I configured Backblaze B2 as an offsite backup destination with rclone crypt encryption, meaning the data is encrypted before it ever leaves my network. Even Backblaze can't read it.

Six automated cloud sync tasks now run nightly from TrueNAS:

Photos (Immich): ~105 GB, syncs at 2:00 AM
Documents (Paperless): ~2.3 GB, syncs at 2:15 AM
Files (Nextcloud): ~17.7 GB, syncs at 2:30 AM
Metrics (InfluxDB): ~678 MB, syncs at 2:45 AM
Shared storage: ~640 GB, syncs at 3:00 AM
VM backups (Proxmox): ~171 GB, syncs at 4:00 AM

Total estimated cost: about $5.62 per month for ~935 GB of encrypted cloud storage. B2 has free egress through their Cloudflare partnership, so restoring data wouldn't incur additional charges. For the peace of mind of knowing a house fire won't erase years of photos and documents, that's a bargain.

The Cleanup

Reclaiming Wasted Space

The audit also uncovered significant waste. An obsolete 414 GB tarball from the original December migration attempt was still sitting on the backup pool. Ten old migration snapshots lingered on the primary storage pool. An empty dataset and two invalid storage references on the secondary Proxmox node were cluttering the configuration.

After cleanup, the backup pool went from nearly full to 169 GB used out of 730 GB available. I also ran the first-ever scrub on the backup pool, which had never been checked for data integrity since creation. The scrub completed with zero errors.

Verifying What Already Worked

Not everything in the audit was bad news. Several items I'd flagged as potential problems turned out to already be resolved:

DNS failover: All five VLAN DHCP scopes already pushed both Pi-hole DNS servers. If the primary goes down, clients automatically fall back to the backup.
NFS mount ordering: All NFS mounts on the application server already used x-systemd.automount, eliminating boot-order race conditions with TrueNAS. The "unmounted" status that initially concerned me was just normal automount behavior, where mounts stay in a waiting state until first access.

It's a good feeling to discover that past decisions held up under scrutiny.

The Experiment That Failed

Not everything went smoothly. I attempted to harden the network switch trunk port by changing its default VLAN (PVID) to a dead VLAN with no routing. The idea was sound: if someone plugs an untagged device into the trunk port, it should land in a black hole instead of an active network.

The change caused an immediate management network outage. The root cause turned out to be an OpenWrt DSA bridge driver quirk where the CPU port doesn't correctly retain its "PVID Egress Untagged" setting after a network restart. All infrastructure connectivity was lost.

I reverted all changes and everything came back up. The fix is documented and achievable with physical console access via the JetKVM, but since the trunk port currently has no device connected, the risk is theoretical. It goes on the list for when I actually need that port.

Automation Groundwork

The audit also prompted some automation work. I created a Docker update script that handles all five compose projects on the application server: Immich, Paperless-ngx, Umami (web analytics), the personal website, and Dockhand (Docker management). The script pulls new images, restarts updated containers, and prunes old images, all logged for review. The systemd timer to run it weekly is ready to deploy.

Where Things Stand

After the audit and hardening session, the infrastructure is in a significantly stronger position:

Offsite backups: All critical data encrypted and synced to Backblaze B2 nightly
NFS security: Exports locked to specific host IPs instead of the entire subnet
VPS hardened: Fail2Ban active, Caddy updated, GPG keys current
Storage healthy: SSD thin pool at 70%, backup pool scrubbed with zero errors
Cleanup complete: 414 GB of obsolete data removed, stale snapshots and configs cleaned up

The most important lesson from this exercise: infrastructure that runs without problems isn't necessarily infrastructure without problems. You have to look.

Backblaze B2 rclone Fail2Ban TrueNAS NFS OpenWrt ZFS Proxmox Caddy