← Back to Homelab
Completed

The Deep Audit

Verifying Every Detail, Fixing What Broke, and Locking the Doors

The Context

After the infrastructure hardening sprint at the end of January, I had a detailed document describing my entire homelab: every host, every service, every firewall rule, every NFS mount. The problem was that I'd been making rapid changes during the hardening session, and I wasn't confident the document still matched reality. Storage metrics had drifted. Services may have been reconfigured without updating the docs. Things that "should" be running might not be.

I decided to do something I'd never attempted before: a comprehensive, automated verification of every verifiable claim in the infrastructure document. Not just spot-checking a few things, but systematically SSHing into every host and comparing what was actually there against what the documentation said should be there. I used Claude Code as an AI assistant to help orchestrate this, letting it drive the SSH sessions while I supervised and made decisions.

The Automated Audit (February 4

Methodology

The approach was straightforward but thorough: take every factual claim in the infrastructure document and verify it. CPU models, RAM amounts, disk SMART status, BIOS versions, kernel versions, service versions, firewall rules, DHCP scopes, DNS configurations, VLAN assignments, backup jobs, cloud sync tasks, Tailscale nodes, Docker containers, NFS exports, port mappings, and more. If the document said it, we checked it.

This meant SSH sessions into all seven hosts: the router, both Proxmox hypervisors, the TrueNAS storage server, both application VMs, and the VPS. The audit also hit external endpoints to verify public-facing services were responding correctly.

What Matched

The good news: the vast majority of the document was accurate. All six VMs were running. All 13 Docker containers were healthy. The five-VLAN network architecture was correctly configured with proper DHCP scopes and firewall zones. All NFS exports had the host restrictions applied during the January hardening. The Backblaze B2 offsite backup tasks were running on schedule, with four completed and two large initial syncs still in progress. Fail2Ban on the VPS had already banned over 250 malicious IPs since installation. All external URLs were responding with correct content and security headers.

Disk SMART checks passed across all drives. The enterprise SAS drives in the TrueNAS server continued to show clean health. Backup jobs were running, with 39 successful backups on file across both Proxmox hosts.

What Didn't

The audit uncovered several discrepancies, ranging from cosmetic to significant:

  • collectd was not running on the router. The monitoring data flow diagram showed router metrics flowing to Grafana, but collectd had actually been crashing silently. The process manager had given up restarting it after repeated failures. This meant the router monitoring dashboards were showing stale data with nobody noticing.
  • Storage metrics had drifted significantly. The SSD thin pool documented at 70% was actually at 65%. The Tailscale VM disk documented at 83% was down to 53%. The Home Assistant disk documented at 84% had dropped to 75%. InfluxDB data had more than doubled. None of these were problems, but the document was out of date.
  • One VM had no SSH key deployed. The metrics server couldn't be accessed via SSH key authentication because nobody had ever added the authorized key. It had been set up before I standardized SSH access.
  • A USB device was misidentified. The document listed an ESP32 connected via USB, but lsusb showed it was actually a Nabu Casa Zigbee coordinator. A different USB device listed as connected wasn't detected at all, likely unplugged at some point.

Fixing the Gaps (February 4-5)

The collectd Crash Investigation

This was the most interesting fix of the audit. The router's collectd service had been crashing repeatedly, with the process manager logging eight consecutive crashes before giving up entirely. The service was dead, and since nobody was alerting on it, nobody knew.

The root cause turned out to be three problematic plugins. The processes plugin required kernel TASKSTATS support that wasn't compiled into the OpenWrt kernel. The tcpconns plugin was throwing netlink errors. And rrdtool was configured but unnecessary since the data was being shipped to InfluxDB anyway.

The fix was to create a simplified configuration with only the stable, useful plugins: CPU, memory, load, network interfaces, thermal, uptime, connection tracking, DHCP leases, and WiFi statistics. After restarting with the lean config, collectd ran stable. I verified data was flowing end-to-end: router collectd to Telegraf to InfluxDB to Grafana dashboards. The monitoring gap was closed.

This was a perfect example of why you need alerting, not just monitoring. Having beautiful Grafana dashboards doesn't help if you don't notice when they stop updating.

Tailscale Cleanup

The audit revealed two stale phone entries in the Tailscale network from devices I no longer use. These were removed via the Tailscale API, bringing the mesh network from nine nodes down to seven. I also verified the subnet routing configuration: the primary router and backup were correctly designated with no conflicts.

Docker Update Automation

With 13 containers running across five Docker Compose projects, manually checking for updates was tedious and easily forgotten. I deployed an automated update script that runs weekly via a systemd timer. It pulls new images for each compose project, recreates containers if updates are available, and prunes old images to prevent disk bloat. Everything is logged for review after each run.

Nextcloud Sync Troubleshooting

A separate issue surfaced when Obsidian's sync plugin started throwing 403 Forbidden errors against Nextcloud. The root cause was subtle: a new Obsidian plugin had created a local directory that didn't exist on the Nextcloud server, and Nextcloud returned 403 (not 404) when the sync tried to upload files to a nonexistent path. Creating the missing directory and running Nextcloud's file scanner resolved it.

This also revealed that Nextcloud's internal log file hadn't been updating since May 2025. The Apache access logs were the reliable source for troubleshooting, which I documented for next time.

The Peer Review (February 5)

With the infrastructure document freshly verified and corrected, I submitted it for external review from three different AI assistants. These AI reviewers assessed the complete document, each bringing different perspectives on security, architecture, and operational practices.

Key Feedback

The reviewers were broadly positive about the VLAN segmentation, the offsite backup strategy, and the overall architecture. But they found real gaps:

  • SSH was still accepting passwords. Most hosts still allowed password-based SSH login alongside key authentication. Two reviewers pointed out this left a brute-force attack surface open on every host in the infrastructure.
  • No swap on the main application server. The VM running 13 Docker containers had no swap configured. Under heavy load, it would OOM-kill processes with no graceful degradation path.
  • Home Assistant sprawl. Two reviewers noted the smart home setup had hundreds of entities but very few active automations, with many sensors showing as unavailable. The kindest description was "a setup that got built out ambitiously and then partially abandoned."

The feedback led to a complete reorganization of my priority list into four tiers, from immediate quick wins to items that could wait for a planned house move.

The SSH Lockdown (February 6)

Closing the Password Door

Prompted by the reviewers' feedback, I conducted a full SSH authentication audit across all seven hosts. The results were worse than expected: five of seven hosts still accepted password-based SSH logins. Three of those also allowed root login with a password. My SSH key was deployed everywhere and working, but the password fallback left every host vulnerable to brute-force attacks.

The fix was systematic. On each Linux host, I disabled password authentication and restricted root login to key-only access. The router required a different approach since it uses Dropbear instead of OpenSSH, but the end result was the same. After each change, I verified I could still connect with my key before moving to the next host. TrueNAS was the only host that was already correctly configured as key-only.

The result: every host in the infrastructure now requires key-based SSH authentication. No passwords accepted anywhere. I also exported a backup of my SSH keys for secure storage in a password manager, since losing that key would now mean console-only access to everything.

Other Fixes (February 5-6)

Preventing an OOM Crisis

The main application server was running 13 Docker containers on under 8 GB of RAM with zero swap space. Under normal load, it sat at about 85% memory usage. A burst of activity from Immich's machine learning service or Collabora Online could push it over the edge, triggering the OOM killer and potentially corrupting database state.

I added a 4 GB swap file to give the system breathing room. Swap isn't a replacement for adequate RAM, but it's the difference between graceful degradation and sudden process death. I also cleaned up a stale swap partition reference from the original Debian installation that was pointing at a partition that no longer existed.

Nextcloud Upload Limits

Nextcloud's PHP configuration still had default upload limits: 2 MB maximum file size. This made it essentially useless for anything larger than a small document. I raised the limits to 2 GB and increased the memory and timeout settings to handle large file transfers. A simple fix that should have been caught during initial setup.

Tailscale Key Expiry

The backup Pi-hole's Tailscale key was set to expire in two weeks. If it had expired unnoticed, every device on the Tailscale mesh using it as a DNS resolver would have lost DNS resolution. I rotated the key and disabled automatic expiry on both Tailscale nodes to prevent this from becoming a recurring maintenance burden.

Documentation Consolidation

I merged the separate smart home documentation into the main infrastructure document, expanding it from a summary into a comprehensive reference covering all Home Assistant areas, integrations, automations, and known issues. Having everything in one document eliminates the drift that inevitably happens when related information lives in separate files.

What I Learned

Monitoring Without Alerting Is Just a Dashboard

The collectd outage was the most valuable finding of the entire audit. I had beautiful Grafana dashboards for router metrics, and the data source had been dead for an unknown period of time. Nobody noticed because nobody was watching the dashboards 24/7, which is exactly the point of automated alerting. Grafana alerting is now near the top of my priority list.

Document, Then Verify, Then Repeat

An infrastructure document that isn't regularly verified against reality will drift. It took less than a week of rapid changes during the January hardening sprint for multiple metrics to become stale. The automated audit approach, systematically checking every claim, is something I plan to repeat periodically.

External Review Finds What You Miss

When you're deep in the weeds of NFS security and ZFS replication, it's easy to overlook the simple things. Several items I'd been deprioritizing for days were immediately flagged as critical by fresh eyes. Independent reviewers , even AI reviewers, bring perspective that's impossible to maintain when you're the one building and operating the system.

AI-Assisted Auditing Works

Using Claude Code to drive the verification process was remarkably effective. It could hold the entire infrastructure document in context, SSH into each host, run the appropriate verification commands, and flag discrepancies. The process that would have taken me many days of manual checking was completed in a few focused hours. I still made all the decisions about what to fix and how, but the tedious verification work was automated away.

Where Things Stand

After the deep audit and remediation sessions, here's the current state:

  • SSH hardened: Key-only authentication across all seven hosts. No password-based SSH access anywhere.
  • Monitoring restored: Router metrics flowing again after collectd fix. Full data pipeline verified end-to-end.
  • Memory safety net: Swap space added to the main application server, preventing OOM kills under load.
  • Documentation verified: Every verifiable claim in the infrastructure document checked against reality and corrected.
  • Offsite backups confirmed: All six B2 cloud sync tasks verified running on schedule with encryption.
  • Docker updates automated: Weekly timer handles image pulls and container recreation across all compose projects.
  • Tailscale cleaned up: Stale devices removed, key expiry disabled, routing conflicts verified absent.

The big remaining items from the peer review: Grafana alerting (the most important operational improvement) and a backup drive upgrade to enable local ZFS replication alongside the existing offsite backups.

The infrastructure is in the strongest position it's ever been. But if this audit taught me anything, it's that "things seem fine" is a dangerous assumption. I'll keep verifying.

Claude Code SSH collectd Grafana Tailscale Nextcloud Docker OpenWrt systemd