Manage Your Ubuntu Server Like a Pro with NodeJS

Date
3/18/2026

If you're currently working with a VPS for NestJS/Next.js, you've probably come across the usual content: “Install Nginx, start Node, point your domain, done.” And yes, that's how you get an app online. But what these articles rarely provide is what people with the highest purchase intent are actually searching for: “How do I make sure the setup runs stably, with alerts, logs, backups, and a plan for when things go wrong?” That's exactly where the difference lies between a server that “somehow runs” and a server that supports your business.

This article is therefore not a deployment guide, but a production stack: opinionated, pragmatic, and copy-paste ready. You'll build a three-layer monitoring system (external, services, host), set up logs so you can find errors in minutes instead of hours, create database backups that you can actually restore, harden SSH without locking yourself out, configure Nginx/PM2 for deployments without 502 shocks, and in the end, you'll have an incident runbook that supports you like a handrail in an emergency.

If you take away only one core message: Stability is not a feature you “add later.” Stability is a chain, and it always breaks at the weakest link. In practice, these are almost always missing alerts, untested backups, open SSH doors, log sprawl, and a lack of a plan when the pager icon (or Slack ping) lights up.

Uptime Monitoring That Really Wakes You Up

If you've ever had a night where you wake up in the morning, open Slack, and your first thought is “Please no…”, then you know that monitoring isn't just "nice to have." Monitoring is peace of mind. The most common reason why VPS projects seem unstable is rarely a single big bug. It's a small, nasty bundle of things: a process hangs, the certificate expires, the disk fills up, a deploy briefly causes a 502, or the provider has network hiccups. That's exactly why you need monitoring as an early warning system before the damage reaches your business.

The cleanest mental model for Node apps on a VPS is a three-layer monitoring system: outside, inside, and underneath. Outside means: What does the user see? Inside means: Are dependencies (DB/Redis/Queue) alive? Underneath means: How is the host doing (CPU/RAM/Disk/Network)?

three_layers_monitoring

Why is this three-part division so important? Because otherwise, during an incident, you ask the wrong questions. If only “Host CPU high” alerts, you don't know if users are affected. If only “Homepage 200 OK” alerts, you don't notice your DB is slowly dying. Good stability means you can determine what is broken in 30 seconds.

For the external layer, a tool like Uptime Kuma is popular because you can self-host it very quickly and get many monitor types. The official install guide shows a very direct Docker run example, and that's exactly how you want it in a VPS stack: small, clear, and reproducible.

docker run -d --restart=unless-stopped \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma \
  louislam/uptime-kuma:2

If you use Docker here, --restart=unless-stopped is a pragmatic choice because, in most cases, you want monitoring to come back after a host reboot. Docker documents restart policies and how they behave in certain situations (e.g., if you manually stop containers). These “small” details are often the difference between: “Everything was back after the reboot” and “Why was my monitoring down exactly when I needed it?”

Another advantage of Uptime Kuma is that the project itself warns that filesystem support for POSIX file locks is important to avoid SQLite corruption. So, don't put your data volume on a quirky NFS setup. If you map /app/data cleanly to a local disk or local volume, you're safe with this classic.

What should you monitor? Not just “/”. Monitor what really hurts your business. Ideally, in NestJS, you have a health endpoint like /healthz that only returns 200 if the app and key dependencies are okay. Add a “real” path that hits the user flow, such as /login or /api/auth/session, or a read-only endpoint that truly traverses your entire stack. Also monitor certificate expiry, because otherwise, you'll always notice it on Sunday evening.

Now for the host layer, which many underestimate. If you want a stable system on a server, you can't avoid “host metrics.” The standard building block for this is Node Exporter, which (officially documented) provides a wide range of hardware- and kernel-level metrics that can be scraped by Prometheus.

You don't have to set up the whole “Prometheus + Grafana” universe right away. You can start small. But small doesn't mean “blind.” Small means: You choose the alerts that protect you from real disasters. In practice, these are five things that almost always cause the biggest fires: disk fill level, RAM/memory pressure, persistently high CPU/load, HTTP 5xx, and latency outliers (P95/P99).

Log Stack That Doesn't Eat Your Disk

Monitoring tells you that something's burning. Logs tell you, why exactly is it burning and where is the source? Without logs, you're debugging in the dark. With bad logs, you're debugging in the dark and losing your disk in the process. On a server, “disk full due to logs” and “backups fail because disk is full” are among the most common chain reactions, which is why a log stack is not a luxury but a stability foundation.

You want logs to work like good bookkeeping. You want to find them, filter them, rotate them, retain them, and you don't want them to eat your system.

The pragmatic way is two-stage. First, clean locally, then central if you really need it. Local means: Nginx access/error logs, Node app logs via stdout/stderr, and systemd/journal as a base. If you use PM2 for Node, log rotation is practically mandatory. PM2 itself documents the logrotate module and explicitly mentions the installation path, as this prevents oversized log files in production.

pm2 install pm2-logrotate

If you use PM2, you also want your logs always accessible and your processes to survive a reboot. PM2 describes that it can generate startup scripts to keep your process list intact across restarts.

pm2 start ecosystem.config.js --env production
pm2 save
pm2 startup

If you use PM2, it has its own log world. You can follow your logs live with

pm2 logs

You typically open the files in the PM2 home, and if you use systemd, journalctl is your friend. PM2 also explicitly states that logs are by default in ~/.pm2/logs and that you can stream them live.

Now for the second stage: central logs. You don't need them if you debug once a month and you're alone. You need them if you debug more often, have multiple services, or find yourself jumping between SSH sessions to find the right server. A common approach here is Loki, because (simply put) it doesn't index the entire log text but works with labels, which often makes it cheaper to operate.

An important note for everyone setting up their log stack today: Promtail is now deprecated. Grafana will only maintain Promtail in LTS mode after February 13, 2025, with end-of-life on March 2, 2026, and clearly recommends Alloy as the primary collector for logs to Loki for new setups. So if you're starting fresh, plan directly with Alloy. If you're already using Promtail, you don't need to rush to rebuild everything, but you should plan the migration consciously instead of relying on a component that's being phased out.

When debugging logs, you want to answer typical questions quickly. Why do I have 502? Then you look in Nginx error.log, because that's where “Upstream refused/timeout” hints are. Why is the app suddenly gone? Then check PM2 status and the last lines, as many crashes are explained in the last 20 log lines. Why are logs “suddenly gone”? It's usually rotation/retention. Why is disk full? It's usually log growth plus backups plus Docker layers plus journald.

If you want to tattoo a single opinionated best practice sentence into your mind: Logs need a budget. Set a limit. Decide consciously how many days you keep locally? How big can a log get? What should never end up in logs (secrets, tokens, full request bodies)? Then implement it technically, first locally with rotation, later centrally with a collector.

DB Backups You Can Actually Restore

Backups are a strange topic. Everyone knows you need them. And yet they're missing in a shocking number of VPS setups. Why? Because backups don't provide instant gratification. They only provide it when everything goes wrong, and then it's too late to build them quickly.

For Node apps, the database is often the real single point of failure, and “backup” doesn't mean “there's a file somewhere,” but: I can reproducibly restore on a fresh machine, within a time my business accepts.

The classic, extremely robust guideline is the 3-2-1 rule: three copies, two different media, one copy offsite. It's quoted so often because it immediately corrects the misconception that a backup on the same server counts. In practice, this means data on the VPS is not a backup copy.

A pragmatic target for a VPS is to make a daily logical backup of the database, compress it, keep a short local retention (e.g., 7 days), and push it offsite encrypted. If you do this well, it's not complicated, but it is consistent.

For MySQL, mysqldump is the classic for logical backups. The MySQL docs describe it very clearly: mysqldump generates SQL statements that can reproduce database objects and table data. That's exactly what you want for an “easy restore.”

Here's a backup script you can use directly on an Ubuntu VPS. It's deliberately simple: one database, one gzipped dump, retention via find. That's how you start—not perfect, but stable.

#!/usr/bin/env bash
set -euo pipefail

BACKUP_DIR="/var/backups/mysql"
DB_NAME="app_db"
DATE="$(date +%F)"
FILE="${BACKUP_DIR}/${DB_NAME}_${DATE}.sql.gz"

mkdir -p "$BACKUP_DIR"
chmod 700 "$BACKUP_DIR"

mysqldump \
  --single-transaction \
  --routines \
  --triggers \
  --databases "$DB_NAME" \
  | gzip -c > "$FILE"

find "$BACKUP_DIR" -type f -name "${DB_NAME}_*.sql.gz" -mtime +7 -delete

echo "Backup written: $FILE"

Three details are important to know. First, --single-transaction with InnoDB is a proven way to take a consistent snapshot without hard locking everything. The mysqldump manpage mentions exactly this online backup pattern for InnoDB.
Second, you include --routines and --triggers because otherwise, something will be missing during a restore, and you'll often only notice when it's too late.
Third, you put the backup in /var/backups/mysql and give the folder restrictive permissions because backups are highly sensitive.

How do you restore this backup? You decompress to stdout and feed it to the MySQL client. This is the most common restore form with gzip because it's fast, clear, and works.

gunzip -c app_db_2026-03-09.sql.gz | mysql -u root -p

Restore tests are not optional. Backup files often exist without being usable. Wrong credentials, broken dumps, missing permissions, missing routines, wrong charset/collation, or too little disk during restore mean some backups are useless. That's why the “fire drill” once a month is such a powerful best practice. You take a fresh VM or container, restore the last dump, do a smoke test (login, two core endpoints), and measure the time. That's how you get a feel for RTO/reality.

If you want to go a step further, there's Point-in-Time Recovery. You need this when “as of last night” isn't enough because someone accidentally deleted something at 3:17 PM and you want the state just before that. The MySQL docs describe PITR via binary logs and the mysqlbinlog utility, including the ability to select and apply parts of the binlogs by time or position.

Offsite is the next stability lever. Offsite doesn't just mean copying away, but also not readable if the storage is compromised. This is where restic shines, as it's built as a backup tool with an encrypted repository and can be placed on various backends. The restic docs describe initializing a repository and that it can be local or remote. The practical effect for you is that you can do it offsite without having to encrypt the data yourself.

A minimalist start that's truly realistic would be as follows. Install restic on the VPS, initialize a repository on an S3-compatible bucket or SFTP target, and push your backup files there. Restic explicitly emphasizes that without the password, you can't access the data and that a lost password is not recoverable. That's inconvenient, but that's exactly what security is.

If you want a “copy/paste” skeleton for this, here's a usable, very minimalist start for an S3 repo (you'll need to adjust the repository URL and credentials):

export RESTIC_REPOSITORY="s3:s3.amazonaws.com/your-bucket-name"
export RESTIC_PASSWORD="CHANGE-ME-TO-A-LONG-PASSPHRASE"

restic init
restic backup /var/backups/mysql

The nice thing is that with such a setup, you can start immediately and add retention, forget/prune, and monitoring later. It doesn't have to be perfect, but it should exist and be offsite.

SSH Hardening Without Locking Yourself Out

There are two ways people approach SSH hardening. They leave everything as is until they see bots in the log, or they change three options, hit enter, and lock themselves out. You want neither. You want: Consistently secure, but safely implementable.

If your VPS is on the internet, it will be scanned. That's not a feeling, that's reality. That's why SSH hardening is the biggest lever per minute invested. You cut off an entire class of attacks by eliminating password login and closing root access.

The core revolves around a few directives in the SSHD configuration. The manpages and Ubuntu manpages describe very specifically what options like PermitRootLogin mean (including variants like prohibit-password) and how password authentication works.

The most important rule to keep in mind is simple and saves you from the classic mistake: Keep an existing SSH session open, open a second session to test, and only when the second session works do you disable password login. This sentence is so banal that many forget it, but so crucial that you should always keep it in mind.

I also recommend: On modern Ubuntu versions, use a drop-in file instead of editing the main file. It's not a must, but it makes your changes clean and versionable. So, for example, create a file under /etc/ssh/sshd_config.d/:

sudo nano /etc/ssh/sshd_config.d/99-hardening.conf

And here's a conservative set that works as a starting point for many VPS setups:

PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
AllowUsers deploy

Before you reload, test the configuration. That's the second anti-lockout rule. Validate first, then load.

sudo sshd -t
sudo systemctl reload ssh

That sshd -t is a valid way to check the configuration before activating changes is one of those little tricks that are worth gold in an emergency, because otherwise, you might end up with an unreachable server.

Now for the firewall. Many think “securing SSH” just means “no passwords.” But network exposure is just as important. Ubuntu documents ufw as the standard tool for managing firewall rules in a user-friendly way. The crucial beginner's mistake is activating the firewall before allowing SSH. That's why you should always do it in this order: default deny incoming, allow outgoing, allow SSH, allow HTTP/HTTPS, then enable.

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
sudo ufw status verbose

And then there's the topic of “slowing down bots.” This is where Fail2Ban comes in. Fail2Ban scans log files like /var/log/auth.log and bans IPs that generate too many failed attempts by updating firewall rules. It's not a replacement for key-only SSH, but it's a very good guardrail against unnecessary noise and brute-force spam.

One last security building block that's often “forgotten” is automatic security updates. Ubuntu documents that security updates are applied automatically (via unattended-upgrades) and describes the available options, up to automatic reboots if necessary. It's very important that, even if packages are installed, you should check if it's really active and consciously decide whether to allow automatic reboots or prefer to plan them in maintenance windows.

If you want to activate this cleanly, this is sufficient in many cases:

sudo apt update
sudo apt install unattended-upgrades
sudo dpkg-reconfigure --priority=low unattended-upgrades

Reverse Proxy, WebSockets, Security Headers, and PM2 Mindset

Nginx in your stack is not just a “reverse proxy.” Nginx is your bouncer. It sits in front, terminates TLS (or forwards it), can limit payload sizes, do rate limiting, configure buffering sensibly, and is exactly the place where you catch many security and stability issues before they even reach your Node processes. The NGINX documentation describes reverse proxy configuration as a core task, including modifying request headers and response buffering.

When self-hosting Next.js apps, it's explicitly recommended to put a reverse proxy like Nginx in front of the Next.js server instead of exposing it directly to the internet. Among other things, because the reverse proxy can handle things like malformed requests, slow connection attacks, payload size limits, or rate limiting.

“Why 502 Bad Gateway after deploy?” Once you understand this, 502 suddenly seems less like a mystery and more like a symptom. A 502 from Nginx almost always means that Nginx couldn't communicate cleanly with the upstream. The most common reasons are trivial. Your Node process was briefly down during reload, it's listening on a different port than expected, it crashed, it takes longer to start than Nginx is patient, or WebSockets/SSE are misconfigured, leading to upgrades/timeouts.

That's why a solid Nginx server block template is helpful. It's not a cure-all, but it's a stable foundation to iterate on. It sets the classic X-Forwarded headers, enables WebSocket upgrade headers, and sets timeouts in a range that doesn't immediately break on slow requests.

server {
  listen 80;
  server_name example.com;

  client_max_body_size 20m;

  location / {
    proxy_pass http://127.0.0.1:3000;

    proxy_http_version 1.1;

    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";

    proxy_read_timeout 60s;
    proxy_send_timeout 60s;
  }
}

The next classic is HTTPS/redirects behaving oddly, cookies not being “secure,” and redirect loops. In Node stacks, this often happens when your app layer doesn't correctly understand that the original request was HTTPS, while internally (between Nginx and Node) it's HTTP. Here, X-Forwarded-Proto and “trust proxy” are key. Behind proxies, you should use the trust-proxy setting so values like client IP and other request properties are derived correctly. This isn't nice-to-have; it's the basis for your app to correctly perceive what came from the client in production.

Also, with security headers: Please don't just blindly copy-paste a wall of headers from some tutorial. That's where a setup often loses more quality than it gains. The better way is to start small, understand the effect, and then expand purposefully. Not every header makes sense in every scenario, and more isn't automatically more secure.

PM2 also plays a crucial role for a stable Node VPS, as it not only keeps your processes alive but also ensures your app comes back up after a reboot. That's one of the invisible differences between “somehow runs” and “runs stably.” If Nginx forwards cleanly but PM2 isn't properly set up as a startup service, your server appears unstable from the outside, even though the real problem is just an incomplete process setup.

What saves many in production is the Memory Threshold Auto Restart. PM2 documents that you can automatically reload/restart processes when they reach a memory threshold. This doesn't fix a memory leak. But it prevents a leak from taking your app completely offline before you've even found the cause. That's why it's a very useful stability tool for small teams.

You can activate it like this:

pm2 start "npm run start" --name app --max-memory-restart 400M
pm2 save

But don't forget, this is an airbag, not a fix. If you see this restart, it's a signal that you need profiling and memory analysis or that you have an underprovisioning problem.

The 502 after deploy issue is usually not solved with an Nginx trick, but with deployment discipline. The most important lever is a zero-downtime mindset. That means you don't hard-stop the process, you reload in an orderly fashion, you give your process time to finish requests cleanly, and you make sure new instances are ready before the old ones die. PM2 can help with this in cluster mode, alternatively systemd with two instances, or you use “blue/green” via port switch. Which way you choose depends on your setup, but the core is always: The upstream must not briefly disappear while Nginx is still pointing to it.

When Things Go Wrong: Act Without Panic

Here's a runbook flow you can really internalize. It's deliberately low-threshold because in an alarm situation you don't need “complex,” you need “clear.”

incident_flowchart_kompakt_v2

The first step is the most important: Scope. Instead of digging through logs, you immediately see how big the fire is. Is the system completely down (no TCP/HTTP)? Is it degraded (slow, 5xx)? Or does it only affect one route? If you separate this cleanly once, you avoid two hours of DB debugging when the disk was actually full.

Now you need a 5-minute diagnostic set. These are the few commands you can run in any situation to identify 80% of causes. These commands aren't glamorous, but they're exactly what you want in an alarm: always the same, always fast, always a signal.

# Host quick check
uptime
df -h
free -h

# Nginx quick check
sudo nginx -t
sudo systemctl status nginx --no-pager
sudo tail -n 80 /var/log/nginx/error.log

# PM2/Node quick check
pm2 status
pm2 logs --lines 120

And because the most common incidents in the VPS context are very recurring, you can train three typical patterns from the article (and in your own mind).

Pattern one is “502 Bad Gateway after deploy.” You see 502 spikes right after a restart. In Nginx error.log, you find hints like “connect() failed,” “upstream timed out,” or “refused.” It's almost never because NGINX is just “acting up.” It's almost always “upstream was briefly unavailable” or “upstream responded too slowly.” The mitigation is: immediate rollback or restart to a previous build, or an orderly reload, possibly adjust timeouts, and in the long term, build your deploy flow so the upstream doesn't disappear (cluster/reload/graceful shutdown).

Pattern two is “Everything is slow.” You see high load, high CPU, memory pressure, or slow DB queries. Mitigation can be: throttle traffic (rate limit), reduce workers, stop a heavy cron job, activate cache, or short-term vertical scaling. In the long term, the answer is you need metrics (host + app) so you can distinguish whether CPU, RAM, disk IO, or DB is the bottleneck.

Pattern three is “Suddenly down.” Very often it's disk full, an expired certificate, or a reboot without autostart. It sounds trivial, but that's exactly the reality. And that's why your stack is built this way: disk alerts, TLS monitoring, PM2 startup, unattended upgrades consciously configured.

After the incident comes the part that makes you stable in the long run: the mini-postmortem. You don't need a novel. You need a concise, flawless process. What was the trigger? What was the root cause? What helped, what was missing? And which 1–3 action items prevent a repeat? This approach ensures you don't bleed twice in the same place.

If you enjoyed this article and want to read more interesting blog posts on similar topics, subscribe to my newsletter. I'll keep you up to date.

Comments

Please sign in to leave a comment.