Docker Swarm Deployment — Session Summary¶

Starting Point¶

Working .NET API (AverAzure) proved working locally via docker compose and on the VPS via compose. Image pushed to GHCR. Goal was to deploy via Docker Swarm using docker-stack.yml.

Problem 1 — Subnet Pool Exhaustion (Coolify VPS)¶

What happened¶

First docker swarm init and docker stack deploy attempt on the Coolify VPS. Every service rejected with:

invalid pool request: Pool overlaps with other one on this address space

Diagnosis¶

docker service ps aver_api --no-trunc revealed the pool overlap error
docker network inspect $(docker network ls -q) | grep Subnet showed Coolify had exhausted most of the 10.0.0.0/8 address space with dozens of bridge networks
ip route confirmed host routing table had 10.0.x.x routes for every Coolify network
Swarm's overlay network kept getting assigned 10.0.0.0/24 or 10.0.1.0/24 — both already taken

Attempts¶

Tried specifying explicit subnets in the stack file (10.0.10.0/24, 192.168.100.0/24, 172.20.0.0/24, 172.16.1.0/24) — all clashed
Modified daemon.json to add 172.30.0.0/16 as a second pool — ignored by Swarm for overlay networks
Manually recreated ingress network on 172.16.0.0/24 — broken the routing mesh

Root Cause¶

Docker's default address pool 10.0.0.0/8 was exhausted by Coolify. Swarm overlay networks couldn't find a free subnet. Additionally, manually recreating the ingress network broke the routing mesh permanently on that Swarm instance.

Decision¶

Full Docker reinstall on the VPS — Coolify was broken anyway.

Problem 2 — Subnet Clash on Fresh Install¶

What happened¶

Even after full Docker reinstall and fresh docker swarm init, same pool overlap error appeared immediately.

Diagnosis¶

docker network ls showed only default networks — clean slate confirmed
docker network inspect bridge ingress | grep Subnet revealed:
- docker0 bridge on 10.0.0.0/24
- ingress on 10.0.0.0/24 — same subnet, clash with docker0
- docker_gwbridge on 10.0.1.0/24 — created by Swarm itself
- aver_aver-overlay also trying to get 10.0.1.0/24 — clash with docker_gwbridge
This is a known Docker Swarm issue — the default pool 10.0.0.0/8 conflicts with Docker's own internal networks (docker0, docker_gwbridge)

Fix¶

Docker documentation states: "Default address pools can only be configured on swarm init and cannot be altered after cluster creation."

docker swarm leave --force
docker swarm init --default-addr-pool 10.20.0.0/16 --default-addr-pool-mask-length 24

This moves all Swarm overlay network allocation to 10.20.x.x — completely clear of Docker's internal networks. Stack file required no explicit subnet declarations after this.

Problem 3 — RabbitMQ Consumer Crashing App on Startup¶

What happened¶

API container kept crashing on Swarm because RabbitMQ wasn't ready when the API started. Without depends_on (not supported in Swarm), the app crashed during startup.

Root Cause¶

RabbitMqConnection constructor called factory.CreateConnection() eagerly. When DI resolved RabbitMqConnection during host startup and RabbitMQ wasn't ready, it threw BrokerUnreachableException. This propagated out of BackgroundService.StartAsync() with abortOnFirstException: true, crashing the entire host.

Fix — Two changes¶

RabbitMqConnection.cs — Lazy initialisation: Constructor stores only the factory configuration. CreateConnection() is deferred to CreateChannel() — first actual use.

public RabbitMqConnection(IConfiguration config)
{
    _factory = new ConnectionFactory
    {
        Uri = new Uri(connectionString),
        DispatchConsumersAsync = true,
        AutomaticRecoveryEnabled = true,
        NetworkRecoveryInterval = TimeSpan.FromSeconds(5)
    };
}

public IModel CreateChannel()
{
    lock (_lock)
    {
        if (_connection == null || !_connection.IsOpen)
            _connection = _factory.CreateConnection();
    }
    // ...
}

RabbitMqConsumer.cs — Retry loop in ExecuteAsync: Wraps the entire channel setup in a while loop with try/catch. If connection fails, logs a warning, waits 5 seconds, and retries. Exception never propagates out of ExecuteAsync so the host never aborts.

protected override async Task ExecuteAsync(CancellationToken ct)
{
    while (!ct.IsCancellationRequested)
    {
        try
        {
            _channel = _connection.CreateChannel();
            // ... setup queues, consumers ...
            await Task.Delay(Timeout.Infinite, ct);
        }
        catch (OperationCanceledException) { break; }
        catch (Exception ex)
        {
            _logger.LogWarning(ex, "[CONSUMER] Failed to connect. Retrying in 5s...");
            _channel?.Dispose();
            _channel = null;
            await Task.Delay(TimeSpan.FromSeconds(5), ct);
        }
    }
}

Result: App starts cleanly, consumer retries in background, connects when RabbitMQ is ready. This is the canonical pattern — same approach used internally by MassTransit and NServiceBus.

Problem 4 — Health Endpoint Hanging¶

What happened¶

All three services showing 1/1 replicas. docker service logs showed [CONSUMER READY] and Kestrel Now listening on: http://[::]:8080. But curl http://localhost:8080/health hung indefinitely.

Diagnosis¶

ss -tlnp | grep 8080 — port listening on host, mesh accepting connections
docker service inspect aver_api | grep VirtualIPs — VIP assigned correctly on ingress
Direct container IP also hung — ruled out mesh routing as the issue
curl -4 http://localhost:8080/health — worked immediately

Root Cause¶

Kestrel binds to http://[::]:8080 — IPv6 wildcard. Swarm overlay networks have EnableIPv6: false. When curl localhost:8080 runs on Linux, it resolves localhost to ::1 (IPv6 loopback) first. The request enters the Swarm mesh which is IPv4 only and hangs instead of failing.

In Compose this wasn't an issue because the bridge network has IPv6 routing set up properly. In Swarm overlay it doesn't.

Fix¶

Change ASPNETCORE_URLS in docker-stack.yml from http://+:8080 to http://0.0.0.0:8080 — forces Kestrel to bind IPv4 only, consistent with what the Swarm mesh routes.

environment:
  - ASPNETCORE_URLS=http://0.0.0.0:8080

Note: This only affects curl localhost on the server. Browser access via domain name or public IP works regardless because domain names resolve to IPv4 addresses.

Final State¶

Docker Swarm running on VPS with clean address pool (10.20.0.0/16)
aver stack deployed: aver_api, aver_rabbitmq, aver_seq all 1/1
API responding on http://31.97.237.138:8080
Consumer connecting to RabbitMQ with retry loop
Image in GHCR: ghcr.io/abhishek052off/averazure:latest

Key Learnings¶

docker swarm init --default-addr-pool must be set at init time — cannot change after
Docker's default pool conflicts with its own internal networks on a fresh install — always use a custom pool
Swarm has no depends_on — apps must be resilient to dependency unavailability at startup
BackgroundService exceptions during StartAsync abort the host — always catch inside ExecuteAsync
Swarm overlay networks are IPv4 only by default — bind Kestrel to 0.0.0.0 not +
Swarm routing mesh (IPVS) handles port publishing — no host-level port listeners needed

What's Next¶

Fix ASPNETCORE_URLS=http://0.0.0.0:8080 in stack file, rebuild image, redeploy
Verify full flow on Swarm — upload invoice, event through RabbitMQ, logs in Seq
GitHub Actions CI/CD pipeline — auto build and push to GHCR on push to main
Interview narrative — be able to defend every decision made here