[dev.icinga.com #11917] Icinga2 agent OOM's when replaying large transaction logs #4271
Labels
area/distributed
Distributed monitoring (master, satellites, clients)
bug
Something isn't working
core/crash
Shouldn't happen, requires attention
This issue has been migrated from Redmine: https://dev.icinga.com/issues/11917
Created by ziaunys on 2016-06-08 19:11:31 +00:00
Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-06-08 19:11:31 +00:00 (in Redmine)
In my Icinga2 test environment I have around 80 agents that report to a single master instance. While testing I left the Icinga2 master off for over a day. The next time I tried to start the Icinga2 instance it would quickly consume all the memory on the system (2gb). Normally it uses around 500mb-800mb. It kept doing this several times after restarting itand then finally stabilized. I don't have log data to support this, but we have never seen this issue with our production Icinga2 master and the only difference is that the stage Icinga2 master was offline for over a day while its agents kept running and I'm not limiting log_duration on the Endpoint object. I do have perfdata in Graphite that shows large log lag when starting the master after a day.
I think another part of the problem is that the test instance of Postgresql wasn't keeping up the data processed from replaying the logs. It seems like it was queuing indefinitely. It seems like the intent is to limit the length of the queue and pause replaying the log until the queue drains, but this is all based on my naive understanding of the Icinga2 internals so I could be wrong.
The text was updated successfully, but these errors were encountered: