Docker Swarm Deployment — Session Summary¶
Starting Point¶
Working .NET API (AverAzure) proved working locally via docker compose and on the VPS via compose. Image pushed to GHCR. Goal was to deploy via Docker Swarm using docker-stack.yml.
Problem 1 — Subnet Pool Exhaustion (Coolify VPS)¶
What happened¶
First docker swarm init and docker stack deploy attempt on the Coolify VPS. Every service rejected with:
invalid pool request: Pool overlaps with other one on this address space
Diagnosis¶
docker service ps aver_api --no-truncrevealed the pool overlap errordocker network inspect $(docker network ls -q) | grep Subnetshowed Coolify had exhausted most of the10.0.0.0/8address space with dozens of bridge networksip routeconfirmed host routing table had10.0.x.xroutes for every Coolify network- Swarm's overlay network kept getting assigned
10.0.0.0/24or10.0.1.0/24— both already taken
Attempts¶
- Tried specifying explicit subnets in the stack file (
10.0.10.0/24,192.168.100.0/24,172.20.0.0/24,172.16.1.0/24) — all clashed - Modified
daemon.jsonto add172.30.0.0/16as a second pool — ignored by Swarm for overlay networks - Manually recreated ingress network on
172.16.0.0/24— broken the routing mesh
Root Cause¶
Docker's default address pool 10.0.0.0/8 was exhausted by Coolify. Swarm overlay networks couldn't find a free subnet. Additionally, manually recreating the ingress network broke the routing mesh permanently on that Swarm instance.
Decision¶
Full Docker reinstall on the VPS — Coolify was broken anyway.
Problem 2 — Subnet Clash on Fresh Install¶
What happened¶
Even after full Docker reinstall and fresh docker swarm init, same pool overlap error appeared immediately.
Diagnosis¶
docker network lsshowed only default networks — clean slate confirmeddocker network inspect bridge ingress | grep Subnetrevealed:docker0bridge on10.0.0.0/24ingresson10.0.0.0/24— same subnet, clash with docker0docker_gwbridgeon10.0.1.0/24— created by Swarm itselfaver_aver-overlayalso trying to get10.0.1.0/24— clash withdocker_gwbridge
- This is a known Docker Swarm issue — the default pool
10.0.0.0/8conflicts with Docker's own internal networks (docker0,docker_gwbridge)
Fix¶
Docker documentation states: "Default address pools can only be configured on swarm init and cannot be altered after cluster creation."
docker swarm leave --force
docker swarm init --default-addr-pool 10.20.0.0/16 --default-addr-pool-mask-length 24
This moves all Swarm overlay network allocation to 10.20.x.x — completely clear of Docker's internal networks. Stack file required no explicit subnet declarations after this.
Problem 3 — RabbitMQ Consumer Crashing App on Startup¶
What happened¶
API container kept crashing on Swarm because RabbitMQ wasn't ready when the API started. Without depends_on (not supported in Swarm), the app crashed during startup.
Root Cause¶
RabbitMqConnection constructor called factory.CreateConnection() eagerly. When DI resolved RabbitMqConnection during host startup and RabbitMQ wasn't ready, it threw BrokerUnreachableException. This propagated out of BackgroundService.StartAsync() with abortOnFirstException: true, crashing the entire host.
Fix — Two changes¶
RabbitMqConnection.cs — Lazy initialisation: Constructor stores only the factory configuration. CreateConnection() is deferred to CreateChannel() — first actual use.
public RabbitMqConnection(IConfiguration config)
{
_factory = new ConnectionFactory
{
Uri = new Uri(connectionString),
DispatchConsumersAsync = true,
AutomaticRecoveryEnabled = true,
NetworkRecoveryInterval = TimeSpan.FromSeconds(5)
};
}
public IModel CreateChannel()
{
lock (_lock)
{
if (_connection == null || !_connection.IsOpen)
_connection = _factory.CreateConnection();
}
// ...
}
RabbitMqConsumer.cs — Retry loop in ExecuteAsync: Wraps the entire channel setup in a while loop with try/catch. If connection fails, logs a warning, waits 5 seconds, and retries. Exception never propagates out of ExecuteAsync so the host never aborts.
protected override async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
try
{
_channel = _connection.CreateChannel();
// ... setup queues, consumers ...
await Task.Delay(Timeout.Infinite, ct);
}
catch (OperationCanceledException) { break; }
catch (Exception ex)
{
_logger.LogWarning(ex, "[CONSUMER] Failed to connect. Retrying in 5s...");
_channel?.Dispose();
_channel = null;
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}
}
}
Result: App starts cleanly, consumer retries in background, connects when RabbitMQ is ready. This is the canonical pattern — same approach used internally by MassTransit and NServiceBus.
Problem 4 — Health Endpoint Hanging¶
What happened¶
All three services showing 1/1 replicas. docker service logs showed [CONSUMER READY] and Kestrel Now listening on: http://[::]:8080. But curl http://localhost:8080/health hung indefinitely.
Diagnosis¶
ss -tlnp | grep 8080— port listening on host, mesh accepting connectionsdocker service inspect aver_api | grep VirtualIPs— VIP assigned correctly on ingress- Direct container IP also hung — ruled out mesh routing as the issue
curl -4 http://localhost:8080/health— worked immediately
Root Cause¶
Kestrel binds to http://[::]:8080 — IPv6 wildcard. Swarm overlay networks have EnableIPv6: false. When curl localhost:8080 runs on Linux, it resolves localhost to ::1 (IPv6 loopback) first. The request enters the Swarm mesh which is IPv4 only and hangs instead of failing.
In Compose this wasn't an issue because the bridge network has IPv6 routing set up properly. In Swarm overlay it doesn't.
Fix¶
Change ASPNETCORE_URLS in docker-stack.yml from http://+:8080 to http://0.0.0.0:8080 — forces Kestrel to bind IPv4 only, consistent with what the Swarm mesh routes.
environment:
- ASPNETCORE_URLS=http://0.0.0.0:8080
Note: This only affects curl localhost on the server. Browser access via domain name or public IP works regardless because domain names resolve to IPv4 addresses.
Final State¶
- Docker Swarm running on VPS with clean address pool (
10.20.0.0/16) averstack deployed:aver_api,aver_rabbitmq,aver_seqall1/1- API responding on
http://31.97.237.138:8080 - Consumer connecting to RabbitMQ with retry loop
- Image in GHCR:
ghcr.io/abhishek052off/averazure:latest
Key Learnings¶
docker swarm init --default-addr-poolmust be set at init time — cannot change after- Docker's default pool conflicts with its own internal networks on a fresh install — always use a custom pool
- Swarm has no
depends_on— apps must be resilient to dependency unavailability at startup BackgroundServiceexceptions duringStartAsyncabort the host — always catch insideExecuteAsync- Swarm overlay networks are IPv4 only by default — bind Kestrel to
0.0.0.0not+ - Swarm routing mesh (IPVS) handles port publishing — no host-level port listeners needed
What's Next¶
- Fix
ASPNETCORE_URLS=http://0.0.0.0:8080in stack file, rebuild image, redeploy - Verify full flow on Swarm — upload invoice, event through RabbitMQ, logs in Seq
- GitHub Actions CI/CD pipeline — auto build and push to GHCR on push to main
- Interview narrative — be able to defend every decision made here