[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

icinga-migration · 2016-02-22T09:53:43Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11196

Created by peckel on 2016-02-22 09:53:43 +00:00

Assignee: (none)
Status: Closed (closed on 2016-08-04 16:07:08 +00:00)
Target Version: 2.5.0
Last Update: 2016-08-04 16:07:17 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Not yet backported
Include in Changelog: 1

I wrote a note about this a couple of weeks ago (Dec 7th) in the mailing list, but at that time I could not exactly pin down the problem. After some additional testing, I found what is causing the issue.

The basic setup is that I need to build a new monitoring environment for a customer who has several security zones in which monitoring must be kept local. So the general idea is to set an HA cluster with Icinga2, Graphite, Grafana and the IDO DB in the central zone, and one satellite cluster within each of the security zones that does the data collection, which works perfectly in general.

The problem is that the checks on the cluster nodes themselves can't be pinned to the individual nodes. While this is perfectly OK for HTTP checks, pingers and the standard SSH check, but the trouble starts with disk, memory, CPU and the other local stuff: When the checks are running on a cluster, it can't be determined what node alarms or perfdata belong to. This is not really an option. I've also reviewed feature requests 10977, 10679 and 10040 on this issue. As the monitoring infrastructure is very critical to the customer, we need to monitor the local resources on the cluster nodes as well.

My naive approach was to pin the command endpoint to the local node for the local resource checks on the cluster nodes. But as I do that, I immediately get substantially increased load of the icinga2 process on the cluster where this is configured in terms of memory and CPU. From the Graphite logs I can see that the satellite cluster with the pinned command endpoints starts sending the same perfdata over and over again, for the local checks and the checks on all connected Icinga 2 clients.

The only change in the configuration is the setting for the pinned command endpoint on the satellite cluster:

object Host "icinga2-satellite1.demo.hindenburgring.com" {
    import "generic-host"
    address = "icinga2-satellite1.demo.hindenburgring.com"
    vars.os = "Linux"

    vars.command_endpoint = "icinga2-satellite1.demo.hindenburgring.com"
}

object Host "icinga2-satellite2.demo.hindenburgring.com" {
    import "generic-host"
    address = "icinga2-satellite2.demo.hindenburgring.com"
    vars.os = "Linux"

   vars.command_endpoint = "icinga2-satellite2.demo.hindenburgring.com"
}

[...]

In the service definitions I have the following configuration:

apply Service "load" {
    import "generic-service"

    check_command = "load"

    command_endpoint = host.vars.command_endpoint
    assign where host.vars.command_endpoint
}

apply Service "procs" {
    import "generic-service"

    check_command = "procs"

    command_endpoint = host.vars.command_endpoint
    assign where host.vars.command_endpoint
}

[...]

The graphs for Graphite updateOperations vs. metricsReceived show nicely that the number of metrics coming in is increasing dramatically when the above setup is activated, but there are no real updates as the datapoints are essentially all the same.

As soon as command endpoint pinning is disabled, the load immediately drops to normal again (compare the log entries to the Graphite stats, they correlate nicely). This can now be reproduced at will, so if you need and data to help fixing this I can now provide them.

Attachments

Screen Shot 2016-02-22 at 10.41.53.png peckel - 2016-02-22 09:43:18 +00:00 - Graphite metricsReceived
Screen Shot 2016-02-22 at 10.42.18.png peckel - 2016-02-22 09:43:18 +00:00 - Graphite updateOperations
updates.log.gz peckel - 2016-02-22 09:44:15 +00:00 - Carbon Cache Log

Relations:

relates #11041
duplicates #12179

The text was updated successfully, but these errors were encountered:

icinga-migration · 2016-02-23T18:39:32Z

Updated by peckel on 2016-02-23 18:39:32 +00:00

Update:

The issue persists in 2.4.2.

icinga-migration · 2016-02-25T09:22:35Z

Updated by peckel on 2016-02-25 09:22:35 +00:00

Update: The issue is still showing in 2.4.3.

Additionally, it looks very similar to the one in #11041.

icinga-migration · 2016-03-04T15:54:22Z

Updated by mfriedrich on 2016-03-04 15:54:22 +00:00

Parent Id set to 11313

icinga-migration · 2016-03-18T11:19:45Z

Updated by mfriedrich on 2016-03-18 11:19:45 +00:00

Relates set to 11041

icinga-migration · 2016-03-18T11:22:27Z

Updated by mfriedrich on 2016-03-18 11:22:27 +00:00

Pinning a check on a node inside a HA zone on a specific endpoint is currently not supported nor implemented. You may run into undefined behaviour with one node being responsible for the executed check, and another one being the command_endpoint, executing the check, and then the check result gets synced back to all involved nodes.

There is a feature request to allow such behaviour, but for now I'd suggest to re-think your zone design. If you want to run specific checks on defined nodes, assign them their own zone, and make that the third level below the satellite zone.

icinga-migration · 2016-03-18T11:32:46Z

Updated by peckel on 2016-03-18 11:32:46 +00:00

Hi Michael,

thanks for your input. However, there is a certain catch-22:

I need to have a redundant pair of satellites to execute remote checks in the specific zone. No real problem here.

To ensure the availability of the monitoring setup, I also need to monitor those satellites themselves for the usual system resources, e.g. disk, memory, cpu - the standard stuff. All works fine (at least that what it looks like) when both satellites are up, but when one of them goes down the other one does not only take over the remote checks (which is desired) but also the checks for local resources (which most definitely isn't, as it then starts monitoring its own disks and recording perfdata for them instead of the failed node).

Putting the satellite node in its own zone isn't an option as it is already a member of the satellite cluster zone, and it can't be in two zones at the same time.

I have no idea how to catch that issue with a modified zone design. Or am I missing something here?

Thanks and best regards,

Peter.

icinga-migration · 2016-07-25T07:45:04Z

Updated by gbeutner on 2016-07-25 07:45:04 +00:00

Status changed from New to Assigned
Assigned to set to peckel

Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179.

icinga-migration · 2016-07-25T07:45:12Z

Updated by gbeutner on 2016-07-25 07:45:12 +00:00

Status changed from Assigned to Feedback

icinga-migration · 2016-07-25T07:45:29Z

Updated by gbeutner on 2016-07-25 07:45:29 +00:00

Duplicates set to 12179

icinga-migration · 2016-07-26T17:49:30Z

Updated by peckel on 2016-07-26 17:49:30 +00:00

Hi Gunnar,

thanks for the update.

I've just upgraded my test einvironment to today's snapshot, and it seems the problem has disappeared. Pinning the endpoint for certain services (e.g. local disks, processes etc.) to a particular cluster node instead of having them fail over is working now as well.

Great news, thanks!

Best regards,

Peter.

icinga-migration · 2016-08-04T16:07:09Z

Updated by mfriedrich on 2016-08-04 16:07:09 +00:00

Status changed from Feedback to Closed
Assigned to deleted ~~peckel~~
Target Version set to 2.5.0
Done % changed from 0 to 100

Thanks for the kind feedback.

icinga-migration · 2016-08-04T16:07:18Z

Updated by mfriedrich on 2016-08-04 16:07:18 +00:00

Parent Id deleted ~~11313~~

icinga-migration closed this as completed Aug 4, 2016

icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017

icinga-migration added this to the 2.5.0 milestone Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

icinga-migration commented Feb 22, 2016

icinga-migration commented Feb 23, 2016

icinga-migration commented Feb 25, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 26, 2016

icinga-migration commented Aug 4, 2016

icinga-migration commented Aug 4, 2016

[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

Comments

icinga-migration commented Feb 22, 2016

icinga-migration commented Feb 23, 2016

icinga-migration commented Feb 25, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 25, 2016

icinga-migration commented Jul 26, 2016

icinga-migration commented Aug 4, 2016

icinga-migration commented Aug 4, 2016