[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

icinga-migration · 2016-03-02T09:37:02Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11273

Created by ralph_b on 2016-03-02 09:37:02 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-03-11 08:36:17 +00:00)
Target Version: 2.4.4
Last Update: 2016-03-24 09:37:53 +00:00 (in Redmine)

Icinga Version: 2.4.3
Backport?: Already backported
Include in Changelog: 1

Hi to all,

icinga2 fires multiple times new checks on a service before check_interval has reached.

icinga2 --version:
icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3)
_
Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /var/run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /var/run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 6.7 (Santiago)
Kernel: Linux
Kernel version: 2.6.32-573.18.1.el6.x86_64
Architecture: x86_64_

icinga2 feature list:
Disabled features: compatlog debuglog gelf icingastatus livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker command graphite ido-pgsql mainlog notification

The output of the following commad you will find in the file attached:
ELECT
status.servicestatus_id,
status.status_update_time,
status.last_check,
status.next_check,
age(status.next_check, status.last_check) as delta,
to_char(services.check_interval * 60 , '99') as check_interval,
status.service_object_id,
status.current_state,
status.has_been_checked,
status.current_check_attempt,
status.should_be_scheduled,
status.is_flapping

FROM
public.icinga_servicestatus status inner join icinga_services services on (status.service_object_id = services.service_object_id)

where status.status_update_time > '2016-03-02 10:27:00+01'::timestamp
and age(status.next_check, status.last_check) < '00:00:58'::time

order by status_update_time;

Please pay attention to column "delta".

There is a unknowable rule/pattern for me on which these checks are fired.

Attachments

troubleshooting.txt ralph_b - 2016-03-02 09:25:34 +00:00
icinga_servicestatus.txt ralph_b - 2016-03-02 09:28:32 +00:00
Capture.PNG rgrey - 2016-03-07 11:46:48 +00:00
07-03-2016 17-10-27.png ralph_b - 2016-03-07 16:15:03 +00:00
09-03-2016 12-30-52.png ralph_b - 2016-03-09 11:34:34 +00:00

Changesets

2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-05 17:16:49 +00:00 by mfriedrich ef532f2

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:03 +00:00 by mfriedrich 8344f74

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:14 +00:00 by mfriedrich f99feab

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

The text was updated successfully, but these errors were encountered:

icinga-migration · 2016-03-02T16:14:12Z

Updated by mfriedrich on 2016-03-02 16:14:12 +00:00

Category set to Checker
Status changed from New to Feedback
Assigned to set to ralph_b

Are you using an Icinga 2 Cluster, or any nodes actually executing these checks? Please add the relevant zones.conf entries.

icinga-migration · 2016-03-02T20:53:31Z

Updated by ralph_b on 2016-03-02 20:53:31 +00:00

No, we actually don't use Icinga2 Cluster. The troubleshooting file contains the whole master1 zone definition. An icinga2 agent is installed on allmost all clients, but the services.conf on this client were empty. At moment all checks are triggered by the master. The communication between master and clients is a one way road (admin network to customer network).

icinga-migration · 2016-03-03T08:17:16Z

Updated by mfriedrich on 2016-03-03 08:17:16 +00:00

Status changed from Feedback to New
Assigned to changed from ralph_b to mfriedrich

Ok, thanks. I'll try to reproduce the issue.

Cheers,
Michael

icinga-migration · 2016-03-03T08:19:34Z

Updated by mfriedrich on 2016-03-03 08:19:34 +00:00

Status changed from New to Assigned

icinga-migration · 2016-03-03T08:44:41Z

Updated by ralph_b on 2016-03-03 08:44:41 +00:00

Additional infomation: I reduced the scenario to master -> one single client w/o icinga2 agent. In this scenario the master is showing the same behavior.

Cheers,
Ralph

icinga-migration · 2016-03-03T17:28:54Z

Updated by rgrey on 2016-03-03 17:28:54 +00:00

I think I'm experiencing the same issue on Ubuntu. Single node reporting in hundreds of times a second. Let me know if/what further info I can provide to help.

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

icinga-migration · 2016-03-04T09:49:16Z

Updated by rgrey on 2016-03-04 09:49:16 +00:00

So, some single node stats (aggregated through Graylog) for a node running for the last 5 minutes

Value          %    Count    

icinga  54.50%  1,496   
 load           10.53%  289 
 procs  10.27%  282 
 disk             5.17% 142 
 swap     5.10% 140 
 disk /   5.06% 139 
 ssh              5.03% 138 
 users    4.12% 113 
 apt              0.15% 4   
 ping4    0.07% 2

icinga-migration · 2016-03-04T14:03:43Z

Updated by mfriedrich on 2016-03-04 14:03:43 +00:00

https://monitoring-portal.org/index.php?thread/35412-services-checks-werden-mehrfach-ausgef%C3%BChrt/&postID=225805#post225805 (for reference)

icinga-migration · 2016-03-04T15:27:26Z

Updated by mfriedrich on 2016-03-04 15:27:26 +00:00

Relates set to 11257

icinga-migration · 2016-03-04T15:27:37Z

Updated by mfriedrich on 2016-03-04 15:27:37 +00:00

Relates set to 11248

icinga-migration · 2016-03-04T15:27:49Z

Updated by mfriedrich on 2016-03-04 15:27:49 +00:00

Relates set to 11226

icinga-migration · 2016-03-04T15:31:42Z

Updated by mfriedrich on 2016-03-04 15:31:42 +00:00

Relates deleted ~~11257~~

icinga-migration · 2016-03-04T15:33:36Z

Updated by mfriedrich on 2016-03-04 15:33:36 +00:00

Relates deleted ~~11248~~

icinga-migration · 2016-03-04T15:33:39Z

Updated by mfriedrich on 2016-03-04 15:33:39 +00:00

Relates deleted ~~11226~~

icinga-migration · 2016-03-04T15:33:46Z

Updated by mfriedrich on 2016-03-04 15:33:46 +00:00

Parent Id set to 11310

icinga-migration · 2016-03-05T17:40:18Z

Updated by mfriedrich on 2016-03-05 17:40:18 +00:00

I've reverted 2 commits which might be causing trouble here. Can you please re-test the current git master?

icinga-migration · 2016-03-07T11:48:01Z

Updated by rgrey on 2016-03-07 11:48:01 +00:00

File added Capture.PNG

dnsmichi wrote:

I've reverted 2 commits which might be causing trouble here. Can you please re-test the current git master?

I've downloaded and built the master from git and deployed that build to one node.

Results: last 5 minutes: > 13,000 service check messages sent to my Graylog instance - see the attached image.

icinga-migration · 2016-03-07T14:53:25Z

Updated by mfriedrich on 2016-03-07 14:53:25 +00:00

Hm, that's fairly strange. I'm using a 3 node cluster (2 nodes in master zone, 1 satellite for command_endpoint checks using the latest icinga2 --version v2.4.3-232-gef532f2) and I don't see such behavior.

@rgrey
Can you please add more details, such as the zones.conf from both the master and the client. Further an output of "icinga2 --version".

icinga-migration · 2016-03-07T15:06:57Z

Updated by rgrey on 2016-03-07 15:06:57 +00:00

Hmm, I must have done something wrong, as my icinga2 --version on the node still says r2.4.3-1 rather than a git version. I'll do some more work ... sorry. Also, I only built and deployed this to my single remote node. I hadn't changed my master installation. Please advise.

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr/local
Sysconf directory: /usr/local/etc
Run directory: /usr/local/var/run
Local state directory: /usr/local/var
Package data directory: /usr/local/share/icinga2
State path: /usr/local/var/lib/icinga2/icinga2.state
Modified attributes path: /usr/local/var/lib/icinga2/modified-attributes.conf
Objects path: /usr/local/var/cache/icinga2/icinga2.debug
Vars path: /usr/local/var/cache/icinga2/icinga2.vars
PID path: /usr/local/var/run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

icinga-migration · 2016-03-07T15:34:56Z

Updated by mfriedrich on 2016-03-07 15:34:56 +00:00

Fixed the snapshot package repository for ubuntu trusty, you should see the latest packages available over there.

Please update the affected node and the master.

icinga-migration · 2016-03-07T15:39:44Z

Updated by ralph_b on 2016-03-07 15:39:44 +00:00

Hi michael,

tried to build from github. Sorry, I never installed it this way. I am searching for HowTo/doc to test it on my box.

icinga-migration · 2016-03-07T15:43:42Z

Updated by mfriedrich on 2016-03-07 15:43:42 +00:00

@ralph_b

Change the repository to use the snapshot package repository instead of stable. Then you are able to install the icinga2 snapshot packages just like normal.

icinga-migration · 2016-03-07T16:09:25Z

Updated by rgrey on 2016-03-07 16:09:25 +00:00

Initial results look promising! I've updated my master using the snapshot repository and itself is now showing the expected number of service checks, rather than multiple versions within the same immediate timeframe.

Building (correctly!) from git master branch on my remote node currently ... although that now might be moot.

Great job.

icinga-migration · 2016-03-07T16:15:35Z

Updated by ralph_b on 2016-03-07 16:15:35 +00:00

File added 07-03-2016 17-10-27.png

Hi Michael,

thank you for the hint. I got it.

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3-233-g7439633)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /var/run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /var/run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 6.7 (Santiago)
Kernel: Linux
Kernel version: 2.6.32-573.18.1.el6.x86_64
Architecture: x86_64

Local triggerd checks are working fine now, but the remotely on icinga clients started checks are still showing strange behavior:

icinga-migration · 2016-03-08T15:16:20Z

Updated by rgrey on 2016-03-08 15:16:20 +00:00

FYI - this seems resolved by running the latest snapshot on my master node. Client nodes are still running stock latest Ubuntu stable release 2.4.3-1.

Master

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3-236-g19cb781)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

Client Node

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

icinga-migration · 2016-03-09T10:40:02Z

Updated by mfriedrich on 2016-03-09 10:40:02 +00:00

Priority changed from Normal to High
Target Version set to 2.4.4

Ok thanks for the tests. I suspect the problem is located updating the next check time when receiving a new check result, but without passing the cluster message origin. Besides that, the reverted commits merely affect the passive check results. A proper fix is discussed in #11336.

I'll assign this issue for 2.4.4 - it'll be great if you could do further tests with 1) the same snapshot version on all clients 2) ntp running on all nodes (I could guess of a time sync problem here as well).

icinga-migration · 2016-03-09T11:36:42Z

Updated by ralph_b on 2016-03-09 11:36:42 +00:00

File added 09-03-2016 12-30-52.png

Hi Michael,

there are three client hosts in my small landscape with icinga2 agents (2 Linux boxes and 1 Windows box) which are update now with the snapshot. Two of them had time differences due to not runnig ntpd (I have to talk with the server guys). It still remains one Linux box (host ID 97) with multiple checks within check_interval (please see attached screen shot). I am searching for the difference to the other hosts.

Cheers,
Ralph

icinga-migration · 2016-03-09T12:32:07Z

Updated by ralph_b on 2016-03-09 12:32:07 +00:00

Good news for the icinga2 team. Found the reason for host ID 97: services.conf was filled with the delivery content, but has to be emtpy, so the localy installed icinga2 agent fired checks by itself in addition the master (bad for myself).

icinga-migration · 2016-03-11T08:36:17Z

Updated by mfriedrich on 2016-03-11 08:36:17 +00:00

Status changed from Assigned to Resolved
Done % changed from 0 to 100

Ok thanks.

icinga-migration · 2016-03-11T14:56:08Z

Updated by mfriedrich on 2016-03-11 14:56:08 +00:00

Backport? changed from Not yet backported to Already backported

icinga-migration · 2016-03-24T09:37:53Z

Updated by mfriedrich on 2016-03-24 09:37:53 +00:00

Parent Id deleted ~~11310~~

icinga-migration closed this as completed Mar 11, 2016

icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working Checker labels Jan 17, 2017

icinga-migration added this to the 2.4.4 milestone Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 5, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 8, 2016

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 11, 2016

icinga-migration commented Mar 11, 2016

icinga-migration commented Mar 24, 2016

[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

Comments

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 2, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Mar 5, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 7, 2016

icinga-migration commented Mar 8, 2016

Master

Client Node

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 9, 2016

icinga-migration commented Mar 11, 2016

icinga-migration commented Mar 11, 2016

icinga-migration commented Mar 24, 2016