Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11041] Status update storm on three node cluster when using "command_endpoint" on services #3879

Closed
icinga-migration opened this issue Jan 27, 2016 · 22 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11041

Created by carljohnston on 2016-01-27 06:13:36 +00:00

Assignee: (none)
Status: Closed (closed on 2016-12-07 21:53:36 +00:00)
Target Version: (none)
Last Update: 2016-12-07 21:53:36 +00:00 (in Redmine)

Icinga Version: 2.4.1-1
Backport?: Not yet backported
Include in Changelog: 1

Hi Devs,

When configuring a three node HA master zone (which will eventually have two node HA satellite zones attached), I have come across an issue with API updates being stormed (100s per second) to all three nodes.

My setup is:

  • Three nodes in the 'master' zone; no global zones.
  • Configuration sync through zones.d
  • Using the chef-icinga2 cookbook; but have found myself hand-editing config files to track this issue down.
  • Only using the "cluster", "icinga", "ido", "cluster-zone" and "ping" checks on these three nodes; all checks are set to 30s intervals.
  • Local checks are pinned to the nodes using "command_endpoint"; these are for the "cluster", "icinga", "ido" and "cluster-zone" checks. The "ping" check is not pinned.

What I've found is that:

  • When only two nodes are running, status updates between nodes are normal and there is no storm; the storm starts ~5 seconds after the third node is started, and stops immediately when the third node is stopped.
  • When a Service definition (either apply-service, or singularly applied) has "command_endpoint" defined, the storm will occur; if all "command_endpoint" definitions are removed the storm does not occur.

Debug log has countless entries that look like:
@
[2016-01-27 01:07:43 -0500] notice/JsonRpcConnection: Received 'event::CheckResult' message from 'master2'
[2016-01-27 01:07:43 -0500] debug/Checkable: command_endpoint found for object 'master1!icinga', setting master1 as check_source.
[2016-01-27 01:07:43 -0500] debug/DbEvents: add checkable check history for 'master1!icinga'
[2016-01-27 01:07:43 -0500] notice/ApiListener: Relaying 'event::CheckResult' message
[2016-01-27 01:07:43 -0500] notice/ApiListener: Sending message to 'master3'
[2016-01-27 01:07:43 -0500] debug/DbObject: Endpoint node: 'master1' status update for 'master1!icinga'
@

This appears to affect:

  • The CPU / Load on the nodes (obviously)
  • Perfdata logs get filled very fast
  • I've noticed that GraphiteWriter is sending multiple metrics for the same service and time (upwards of 20+).

Thanks for any help you can provide,

Carl

Attachments


Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-29 09:23:35 +00:00

  • Status changed from New to Feedback
  • Assigned to set to carljohnston

That is probably the same issue we've been debugging and fixing at a customer lately. Check #11014 for details. Can you test the snapshot packages please?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-29 09:23:43 +00:00

  • Relates set to 11014

@icinga-migration
Copy link
Author

Updated by carljohnston on 2016-02-01 22:48:26 +00:00

dnsmichi wrote:

That is probably the same issue we've been debugging and fixing at a customer lately. Check #11014 for details. Can you test the snapshot packages please?

Hi dnsmichi,

Thanks for taking the time to look at this.

I tried a few snapshots yesterday - these were from snapshot201601292015 and previous - which came along with some API corruption bugs. These seemed to fix the update storm, but broke core functionality - it appeared that the non-zone-master cluster members weren't connected correctly to the cluster, command_endpoint enabled checks returned "Unknown" as the member's weren't contactable.

I've tried the snapshots from today (snapshot201602012014; version: v2.4.1-159-gec050dd), and the storm still exists with three nodes connected, but core functionality is returned.

Can I provide you with any other information to assist with resolving this?

Thank you,

Carl

@icinga-migration
Copy link
Author

Updated by carljohnston on 2016-02-01 23:25:43 +00:00

I've tried the snapshots from today (snapshot201602012014; version: v2.4.1-159-gec050dd), and the storm still exists with three nodes connected, but core functionality is returned.

Hi dnsmichi,

I've just tried a few more snapshots; snapshot201602011352 does not have the update storm, but core functionality is broken (endpoints appear disconnected in icingaweb2); snapshot201602011404 re-introduces the update storm but fixes the disconnected endpoints.

Thank you,

Carl

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-24 20:12:08 +00:00

Does the issue still exist with 2.4.3?

@icinga-migration
Copy link
Author

Updated by karambol on 2016-02-25 11:17:40 +00:00

I have this issue from version 2.4.2 (and 2.4.3)

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-02-26 14:57:51 +00:00

Hi,
we are experiencing a similar behavior on our 2 master setup + 2 zones (1 satellite per zone) running on icinga v2.4.2. I've noticed that the pnp4nagios instance on the master was complaining about an "illegal attempt to update".
See the following /var/log/pnp4nagios/perfdata.log snippet:

2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_1_min.rrd 1456497904:0.16
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_5_min.rrd 1456497904:0.09
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_15_min.rrd 1456497904:0.08
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_proc.rrd 1456497920:1
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_mem.rrd 1456497920:5.1
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_cpu.rrd 1456497920:6.4
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_rss.rrd 1456497920:201724.0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_vsz.rrd 1456497920:1273904.0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/memory-linux-icinga_ram_used.rrd 1456497920:1332304
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/memory-linux-icinga_swap_used.rrd 1456497920:17116
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/grafcrypbal01-t/memory-linux_ram_used.rrd 1456497926:561956
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497926.000000 when last update time is 1456497926.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/grafcrypbal01-t/memory-linux_swap_used.rrd 1456497926:0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497926.000000 when last update time is 1456497926.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/CPU_cpu_prct_used.rrd 1456497929:6.5
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497929.000000 when last update time is 1456497929.000000 (minimum one second step

Most of the RRDs where this problem is happening have the associated service-object with the command_endpoint property set to a given icinga2 instance.

Hope this helps.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-26 15:03:36 +00:00

Please add the following details.

  • zones.conf
  • an example host and service using command_endpoint

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-02-26 15:40:54 +00:00

  • File added service-object.txt

I've used ".fqdn" to omit our internal fqdn. I don't know if I can share stuff like this, but I think it shouldn't be a problem in this case.
I've also have attached all the objects attribute of the following example:

object Host "icingam02-p.fqdn" {
  import "standard-host"

  display_name = "icingam02-p"
  address = "10.139.163.131"
  command_endpoint = "icingam01-p.fqdn"

  vars.os = "linux"
  vars.zone = "master"
  vars.checker = "icingam01-p.fqdn"
}

apply Service "CPU" {
  import "standard-service"
  display_name = "CPU State"
  check_command = "snmp-load"
  vars.check_protocol = "snmp"
  check_interval = 1m
  retry_interval = 30s

  if (host.vars.checker) {
    command_endpoint = host.vars.checker
  }

  assign where host.vars.os == "windows" || host.vars.os == "linux"
}

All zones.conf:

[root@icingam01-p ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
        host = "icingam01-p.fqdn"
}

object Endpoint "icingam02-p.fqdn" {
        host = "icingam02-p.fqdn"
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingam02-p ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
        host = "icingam01-p.fqdn"
}

object Endpoint "icingam02-p.fqdn" {
        host = "icingam02-p.fqdn"
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingas01-p zones]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
}

object Endpoint "icingam02-p.fqdn" {
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingas01-d ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
}

object Endpoint "icingam02-p.fqdn" {
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-26 18:02:42 +00:00

Hm, so you are trying to pin the check inside the same zone on a specific host (master01 should check master02 via command_endpoint and vice versa).

Since you're using the director and api already, can you connect to the /v1/events endpoint, add a queue and type=CheckResult and a filter for your host/service name? I would guess there are multiple check results involved causing these update loops.

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-02-29 15:25:24 +00:00

I was able to get some data. This are the events through the api on both masters. I get 2 events on the first master where the check is pinned on:
icingam01-p:

{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.0743079185,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.1166100502,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0683379173,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0710589886,"type":"CheckResult"}

icingam02-p:

{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.0760920048,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0696389675,"type":"CheckResult"}

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-02-29 16:58:45 +00:00

I was trying to gather data regarding another problem that we are experiencing right now: high load peeks on the satellites after pushing a config.
In the /var/log/pnp4nagios/perfdata.log I've noticed that on every config reload both masters are receiving 2 check results (hostalive) for almost every host in our config. This could be the culprit for our load problems.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:54:18 +00:00

  • Parent Id set to 11313

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 11:19:46 +00:00

  • Relates set to 11196

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 14:27:42 +00:00

There was a bug in 2.4.2 which caused multiple check updates. Can you re-test with 2.4.4 if the issue you're having persists please?

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-04-05 17:11:14 +00:00

The issue is still present with 2.4.4. I've tested with the same object as above.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-04-06 16:01:36 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted carljohnston

Ok, we'll have to look into that. Thanks for your feedback.

Kind regards,
Michael

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-07-25 07:45:59 +00:00

  • Status changed from New to Feedback
  • Assigned to set to carljohnston

Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-07-25 07:46:16 +00:00

  • Duplicates set to 12179

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-09-14 10:01:56 +00:00

This problem seems to be fixed in 2.5.4.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-09 14:52:18 +00:00

  • Parent Id deleted 11313

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-12-07 21:53:36 +00:00

  • Status changed from Feedback to Closed
  • Assigned to deleted carljohnston

Cool thanks.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant