Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11917] Icinga2 agent OOM's when replaying large transaction logs #4271

Closed
icinga-migration opened this issue Jun 8, 2016 · 1 comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/crash Shouldn't happen, requires attention

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11917

Created by ziaunys on 2016-06-08 19:11:31 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-06-08 19:11:31 +00:00 (in Redmine)

Icinga Version: 2.4.10
Backport?: Not yet backported
Include in Changelog: 1

In my Icinga2 test environment I have around 80 agents that report to a single master instance. While testing I left the Icinga2 master off for over a day. The next time I tried to start the Icinga2 instance it would quickly consume all the memory on the system (2gb). Normally it uses around 500mb-800mb. It kept doing this several times after restarting itand then finally stabilized. I don't have log data to support this, but we have never seen this issue with our production Icinga2 master and the only difference is that the stage Icinga2 master was offline for over a day while its agents kept running and I'm not limiting log_duration on the Endpoint object. I do have perfdata in Graphite that shows large log lag when starting the master after a day.

I think another part of the problem is that the test instance of Postgresql wasn't keeping up the data processed from replaying the logs. It seems like it was queuing indefinitely. It seems like the intent is to limit the length of the queue and pause replaying the log until the queue drains, but this is all based on my naive understanding of the Icinga2 internals so I could be wrong.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@gunnarbeutner gunnarbeutner added the core/crash Shouldn't happen, requires attention label Feb 7, 2017
@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 6, 2018

This is a fairly old issue, I would believe this doesn't occur anymore. Our recommendation for clients is to use log_duration=0 in order to keep the reconnect footprint small enough.

@dnsmichi dnsmichi closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/crash Shouldn't happen, requires attention
Projects
None yet
Development

No branches or pull requests

3 participants