Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11020] Master reloads with agents generate false alarms #3871

Closed
icinga-migration opened this issue Jan 22, 2016 · 8 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11020

Created by tgelf on 2016-01-22 15:23:50 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2016-02-23 09:59:37 +00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:59:54 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Already backported
Include in Changelog: 1

The most convenient configuration variant for Icinga 2 Agents are command endpoints. In such an environment we generate a lot of superfluous state changes (ok/unknown/ok). I didn't try it out, but I guess on slow reloads combined with typical retry_interval settings this would allow one to reach a hard state pretty fast, resulting in false alarms. And even if not, this causes overhead in the IDO, might influence SLA reports and so on. We need some kind of "reload awareness" or grace period to handle this.

Best,
Thomas

Changesets

2016-02-08 08:46:01 +00:00 by (unknown) 6d5014b

Increase grace period for agent-based checks

refs #11020

2016-02-23 09:51:12 +00:00 by (unknown) b8195be

Increase grace period for agent-based checks

refs #11020
@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-25 09:59:19 +00:00

  • Category set to Cluster
  • Target Version set to 2.5.0

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-25 10:30:57 +00:00

  • Target Version changed from 2.5.0 to 2.4.2

@icinga-migration
Copy link
Author

Updated by ziaunys on 2016-01-28 18:08:11 +00:00

tgelf wrote:

The most convenient configuration variant for Icinga 2 Agents are command endpoints. In such an environment we generate a lot of superfluous state changes (ok/unknown/ok). I didn't try it out, but I guess on slow reloads combined with typical retry_interval settings this would allow one to reach a hard state pretty fast, resulting in false alarms. And even if not, this causes overhead in the IDO, might influence SLA reports and so on. We need some kind of "reload awareness" or grace period to handle this.

Best,
Thomas

I just started to encounter this issue. I'm not sure if it's because I have a lot of agents now. There are a total of 270. In my environment when Puppet runs and adds a new host it will reload and most of the agent cluster-zone checks will fail once. I have 2 attempts set so some times a handful of checks will fail and page our on-call person which is confusing because it's usually a random set of hosts and it looks like a bunch of hosts have gone down from their perspective.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-05 13:59:51 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich
  • Priority changed from Normal to High

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-05 14:11:06 +00:00

I'll take a look into it, per customer requirement.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-08 12:46:44 +00:00

  • Assigned to changed from mfriedrich to gbeutner

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-23 09:59:37 +00:00

  • Status changed from Assigned to Resolved

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-23 09:59:54 +00:00

  • Backport? changed from Not yet backported to Already backported

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.2 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant