New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #11020] Master reloads with agents generate false alarms #3871
Comments
Updated by mfriedrich on 2016-01-25 09:59:19 +00:00
|
Updated by mfriedrich on 2016-01-25 10:30:57 +00:00
|
Updated by ziaunys on 2016-01-28 18:08:11 +00:00 tgelf wrote:
I just started to encounter this issue. I'm not sure if it's because I have a lot of agents now. There are a total of 270. In my environment when Puppet runs and adds a new host it will reload and most of the agent cluster-zone checks will fail once. I have 2 attempts set so some times a handful of checks will fail and page our on-call person which is confusing because it's usually a random set of hosts and it looks like a bunch of hosts have gone down from their perspective. |
Updated by mfriedrich on 2016-02-05 13:59:51 +00:00
|
Updated by mfriedrich on 2016-02-05 14:11:06 +00:00 I'll take a look into it, per customer requirement. |
Updated by mfriedrich on 2016-02-08 12:46:44 +00:00
|
Updated by gbeutner on 2016-02-23 09:59:37 +00:00
|
Updated by gbeutner on 2016-02-23 09:59:54 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/11020
Created by tgelf on 2016-01-22 15:23:50 +00:00
Assignee: gbeutner
Status: Resolved (closed on 2016-02-23 09:59:37 +00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:59:54 +00:00 (in Redmine)
The most convenient configuration variant for Icinga 2 Agents are command endpoints. In such an environment we generate a lot of superfluous state changes (ok/unknown/ok). I didn't try it out, but I guess on slow reloads combined with typical retry_interval settings this would allow one to reach a hard state pretty fast, resulting in false alarms. And even if not, this causes overhead in the IDO, might influence SLA reports and so on. We need some kind of "reload awareness" or grace period to handle this.
Best,
Thomas
Changesets
2016-02-08 08:46:01 +00:00 by (unknown) 6d5014b
2016-02-23 09:51:12 +00:00 by (unknown) b8195be
The text was updated successfully, but these errors were encountered: