Most teams don’t have a bad DevOps infrastructure. They have an incomplete one.
The CI/CD pipeline exists, but deployments still make everyone nervous. There’s monitoring, sort of – a few dashboards nobody checks. Backups run, probably. The problem isn’t any one missing piece. The problem is that the pieces were added in the wrong order, so nothing connects cleanly and everything feels fragile.
There’s a sequence that works. Here’s what it looks like.
Stage 1: Chaos → Functional DevOps infrastructure
Where you might be: Services running on individual servers, deployments done manually, someone SSH-ing into production to fix things.
The question: Can we deploy without downtime?
Until you can answer yes to that, nothing else matters. Monitoring a system you can’t reliably deploy to is pointless. CI/CD pipelines that push to a manually-managed server are flaw.
The work at this stage is foundational:
- Containerize everything. Move services into Docker containers with docker-compose for local development. This eliminates environment mismatch, the „works on my machine“ problem, permanently.
- Add basic orchestration. Docker Swarm handles scheduling, restarts and load distribution. Health checks on every service so failures are caught automatically, not by users.
- Set up a reverse proxy with auto-TLS. Traefik handles incoming traffic and manages SSL certificates automatically. Manual certificate management will fail at the worst possible time.
- Manage secrets properly. Remove hardcoded credentials from everything. One leaked credential can end your company. This is not a step to defer.
The trap: Skipping straight to CI/CD before this foundation is in place. A CI/CD pipeline that deploys a broken system just deploys it faster.
You’re done with Stage 1 when: You can deploy without touching production directly, SSL certificates renew on their own, and no credentials exist in any git repository.
Stage 2: Functional → Observable DevOps infrastructure
Where you might be: Deployments work, but you find out about problems when users complain.
The question: Can we see problems before users report them?
Most teams underestimate how much time gets burned on this stage. Not because the tooling is complex, but because they skip it entirely and go straight to building more features. Then an incident happens, there are no logs centralized anywhere, metrics don’t exist and debugging takes four times longer than it should.
What to work on:
- Centralized logging. Loki with Promtail collects logs from every container into one place. You should be able to investigate any issue without SSH access to any server.
- Metrics collection. Prometheus scrapes CPU, memory, network and disk usage per container in real time, automatically. No configuration needed for basic container metrics.
- Dashboards. Grafana with one dashboard per service minimum. Metrics that nobody looks at are useless – the dashboard is what makes monitoring real.
- Alerting. Start with 3–5 critical alerts only. Alert fatigue is a real failure mode: too many notifications trains the team to ignore all of them.
The trap: Too many alerts, too soon. A paged engineer who gets woken up for something non-critical twice will start ignoring pages. Start narrow and expand.
You’re done with Stage 2 when: Mean time to detect an issue is under 5 minutes, you can investigate problems without touching servers and the team checks dashboards as a habit.
Our blueprint includes the exact Prometheus, Loki and Grafana configuration patterns for this stage along with the alert thresholds we use in production. Download it free here.
Stage 3: Observable → Reliable DevOps infrastructure
Where you might be: You see problems quickly, but recovery is still manual and stressful.
The question: Can we recover from failures without human intervention?
Visibility without recovery procedures means you watch things break in real time. The work at this stage turns incident response from a fire drill into a routine.
- Automated backups. Daily database dumps and volume snapshots, uploaded to S3. Test restores monthly, untested backups are just Schrödinger’s backups. Data loss is the only truly irrecoverable failure.
- Automated recovery. Failed containers restart and replace themselves via health checks. Recovery time drops from hours to seconds.
- CI/CD pipeline. Code pushed to Bitbucket triggers a build, pushes an image to Harbor and a Portainer webhook redeploys the stack. Deployment becomes boring, which is the goal.
- Written disaster recovery plan. Documented recovery procedures for each service, practiced quarterly. The time to create a DR plan is before you need it.
The trap: Building a CI/CD setup so complex only one person understands it. Keep it simple. A pipeline that any engineer can debug is worth more than an elegant one only the author can maintain.
You’re done with Stage 3 when: Any single service recovers in under 15 minutes, deployments are one-click or fully automated and the team has actually practiced a database restore.
Stage 4: Reliable → Scalable DevOps infrastructure
Where you might be: Things work reliably, but growth requires manual intervention.
The question: Can we handle 10x traffic without 10x the team?
This is where most teams want to start. It’s the last stage to address, not the first.
- Auto-scaling. Define replica counts and resource thresholds per service. Docker Swarm adjusts capacity automatically. Manual scaling means someone has to be awake when traffic spikes.
- Performance optimization. Database query tuning and a Redis caching layer before throwing more hardware at the problem. Scaling is expensive; optimization is cheap.
- Cost monitoring. Track infrastructure costs per service and alert on unusual increases. Scaling without cost visibility means runaway bills.
You’re done with Stage 4 when: Traffic spikes are handled automatically, time to add a new service is under two hours and the infrastructure team is never the bottleneck.
Why the order matters
Teams get into trouble when they build these stages out of sequence. Auto-scaling a system with no observability means you have no idea what it’s doing at scale. Adding CI/CD to a system with no containers means you’re automating a fundamentally broken deployment model.
The sequence isn’t arbitrary. Each stage creates the preconditions for the next one to work properly.
The good news: each stage is achievable pretty fast when you have the right tooling and configuration patterns in front of you.
Want the full technical spec for each stage?
The Ascendro DevOps Infrastructure Blueprint covers every tool, configuration pattern and success metric across all four stages; the same setup we run for production clients across automotive, manufacturing and fintech. Download it free.
Als engagiertes Softwareentwicklungsteam mit Fachkenntnissen in den Bereichen Nearshore-Softwareentwicklung, Outsourcing der Softwareentwicklung, IT-Personalverstärkung und vielem mehr sind wir auf die Bereitstellung innovativer Lösungen für verschiedene Branchen spezialisiert, von der Entwicklung kundenspezifischer Fertigungssoftware bis hin zur Optimierung von Geschäftsprozessen, um sicherzustellen, dass unsere Kunden wettbewerbsfähig und effizient arbeiten können. Sehen Sie sich unsere Softwareentwicklungsprojekte hier an.

