New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #11041] Status update storm on three node cluster when using "command_endpoint" on services #3879
Comments
Updated by mfriedrich on 2016-01-29 09:23:35 +00:00
That is probably the same issue we've been debugging and fixing at a customer lately. Check #11014 for details. Can you test the snapshot packages please? |
Updated by mfriedrich on 2016-01-29 09:23:43 +00:00
|
Updated by carljohnston on 2016-02-01 22:48:26 +00:00 dnsmichi wrote:
Hi dnsmichi, Thanks for taking the time to look at this. I tried a few snapshots yesterday - these were from snapshot201601292015 and previous - which came along with some API corruption bugs. These seemed to fix the update storm, but broke core functionality - it appeared that the non-zone-master cluster members weren't connected correctly to the cluster, command_endpoint enabled checks returned "Unknown" as the member's weren't contactable. I've tried the snapshots from today (snapshot201602012014; version: v2.4.1-159-gec050dd), and the storm still exists with three nodes connected, but core functionality is returned. Can I provide you with any other information to assist with resolving this? Thank you, Carl |
Updated by carljohnston on 2016-02-01 23:25:43 +00:00
Hi dnsmichi, I've just tried a few more snapshots; snapshot201602011352 does not have the update storm, but core functionality is broken (endpoints appear disconnected in icingaweb2); snapshot201602011404 re-introduces the update storm but fixes the disconnected endpoints. Thank you, Carl |
Updated by mfriedrich on 2016-02-24 20:12:08 +00:00 Does the issue still exist with 2.4.3? |
Updated by karambol on 2016-02-25 11:17:40 +00:00 I have this issue from version 2.4.2 (and 2.4.3) |
Updated by mnardin on 2016-02-26 14:57:51 +00:00 Hi,
Most of the RRDs where this problem is happening have the associated service-object with the command_endpoint property set to a given icinga2 instance. Hope this helps. |
Updated by mfriedrich on 2016-02-26 15:03:36 +00:00 Please add the following details.
|
Updated by mnardin on 2016-02-26 15:40:54 +00:00
I've used ".fqdn" to omit our internal fqdn. I don't know if I can share stuff like this, but I think it shouldn't be a problem in this case.
All zones.conf:
|
Updated by mfriedrich on 2016-02-26 18:02:42 +00:00 Hm, so you are trying to pin the check inside the same zone on a specific host (master01 should check master02 via command_endpoint and vice versa). Since you're using the director and api already, can you connect to the /v1/events endpoint, add a queue and type=CheckResult and a filter for your host/service name? I would guess there are multiple check results involved causing these update loops. |
Updated by mnardin on 2016-02-29 15:25:24 +00:00 I was able to get some data. This are the events through the api on both masters. I get 2 events on the first master where the check is pinned on:
icingam02-p:
|
Updated by mnardin on 2016-02-29 16:58:45 +00:00 I was trying to gather data regarding another problem that we are experiencing right now: high load peeks on the satellites after pushing a config. |
Updated by mfriedrich on 2016-03-04 15:54:18 +00:00
|
Updated by mfriedrich on 2016-03-18 11:19:46 +00:00
|
Updated by mfriedrich on 2016-03-18 14:27:42 +00:00 There was a bug in 2.4.2 which caused multiple check updates. Can you re-test with 2.4.4 if the issue you're having persists please? |
Updated by mnardin on 2016-04-05 17:11:14 +00:00 The issue is still present with 2.4.4. I've tested with the same object as above. |
Updated by mfriedrich on 2016-04-06 16:01:36 +00:00
Ok, we'll have to look into that. Thanks for your feedback. Kind regards, |
Updated by gbeutner on 2016-07-25 07:45:59 +00:00
Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179. |
Updated by gbeutner on 2016-07-25 07:46:16 +00:00
|
Updated by mnardin on 2016-09-14 10:01:56 +00:00 This problem seems to be fixed in 2.5.4. |
Updated by mfriedrich on 2016-11-09 14:52:18 +00:00
|
Updated by mfriedrich on 2016-12-07 21:53:36 +00:00
Cool thanks. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/11041
Created by carljohnston on 2016-01-27 06:13:36 +00:00
Assignee: (none)
Status: Closed (closed on 2016-12-07 21:53:36 +00:00)
Target Version: (none)
Last Update: 2016-12-07 21:53:36 +00:00 (in Redmine)
Hi Devs,
When configuring a three node HA master zone (which will eventually have two node HA satellite zones attached), I have come across an issue with API updates being stormed (100s per second) to all three nodes.
My setup is:
What I've found is that:
Debug log has countless entries that look like:
@
[2016-01-27 01:07:43 -0500] notice/JsonRpcConnection: Received 'event::CheckResult' message from 'master2'
[2016-01-27 01:07:43 -0500] debug/Checkable: command_endpoint found for object 'master1!icinga', setting master1 as check_source.
[2016-01-27 01:07:43 -0500] debug/DbEvents: add checkable check history for 'master1!icinga'
[2016-01-27 01:07:43 -0500] notice/ApiListener: Relaying 'event::CheckResult' message
[2016-01-27 01:07:43 -0500] notice/ApiListener: Sending message to 'master3'
[2016-01-27 01:07:43 -0500] debug/DbObject: Endpoint node: 'master1' status update for 'master1!icinga'
@
This appears to affect:
Thanks for any help you can provide,
Carl
Attachments
Relations:
The text was updated successfully, but these errors were encountered: