Self-hosted Words.Cloud container OOM below container limit

james-ontra · June 18, 2026, 2:54pm

We’re self-hosting the words-cloud docker image (version 25.6, also testing out 26.5) with a ~2GB memory resource limit. A managed OOM is being thrown (System.OutOfMemoryException) from the compare online API while the container memory peak is at ~1.5GiB. This seems to line up with .NET’s default 75% hard limit.

I also observed the default ceiling is 1536MiB with GC.GetGCMemoryInfo().TotalAvailableMemoryBytes. Setting DOTNET_GCHeapHardLimitPercent=0x5A (i.e. 90%) raises it to 1843MiB. With the hard limit set, concurrent compares that previously OOM at around 5-parallelism now run cleanly up to 6. So this change seems to help at the margin.

However, at 90% the peak usage is right up to the ~2GB memory limit so this leaves little room for native memory. While I’m seeing graceful per-request OOM, the concern is the risk of hard kernel OOM-kills causing full pod restarts. For example, on 26.5 at least, I saw a one-off SIGSEGV under a test with high concurrency.

Do you recommend configuring DOTNET_GCHeapHardLimitPercent or DOTNET_GCHeapHardLimit? Or would you advise that we don’t override the default? If so, what value would you suggest? Some guidance on the safe headroom between the GC heap hard limit and the container memory limit would be helpful.

Thanks.

yaroslaw.ekimov · June 18, 2026, 3:43pm

What I see from your setup is I wouldn’t go to increase any defaults, but instead increase the limit for container, maybe up to 3gb. Because if you override the limit to 90%, it might be the case that the next allocation could actually rise the OOM and restart the pod. Or you could set a horizontal scale to increase the number of pods in times when your load is high.

james-ontra · June 18, 2026, 4:43pm

Thanks for the feedback. I was thinking we’re possibly leaving some memory on the table but perhaps the safer buffer from 75% is worth the cost at the end of the day. I think what you’re alluding to is there’s always going to be a risk from an allocation if the current usage is high enough.

So makes sense for now to forgo adjusting the heap hard limit. We’ll look into adjusting the autoscaling better.

Thanks again.