Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11273] Services status updated multiple times within check_interval even though no retry was triggered #3990

Closed
icinga-migration opened this issue Mar 2, 2016 · 31 comments
Labels
blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11273

Created by ralph_b on 2016-03-02 09:37:02 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-03-11 08:36:17 +00:00)
Target Version: 2.4.4
Last Update: 2016-03-24 09:37:53 +00:00 (in Redmine)

Icinga Version: 2.4.3
Backport?: Already backported
Include in Changelog: 1

Hi to all,

icinga2 fires multiple times new checks on a service before check_interval has reached.

icinga2 --version:
icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3)
_
Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /var/run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /var/run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 6.7 (Santiago)
Kernel: Linux
Kernel version: 2.6.32-573.18.1.el6.x86_64
Architecture: x86_64_

icinga2 feature list:
Disabled features: compatlog debuglog gelf icingastatus livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker command graphite ido-pgsql mainlog notification

The output of the following commad you will find in the file attached:
ELECT
status.servicestatus_id,
status.status_update_time,
status.last_check,
status.next_check,
age(status.next_check, status.last_check) as delta,
to_char(services.check_interval * 60 , '99') as check_interval,
status.service_object_id,
status.current_state,
status.has_been_checked,
status.current_check_attempt,
status.should_be_scheduled,
status.is_flapping

FROM
public.icinga_servicestatus status inner join icinga_services services on (status.service_object_id = services.service_object_id)

where status.status_update_time > '2016-03-02 10:27:00+01'::timestamp
and age(status.next_check, status.last_check) < '00:00:58'::time

order by status_update_time;

Please pay attention to column "delta".

There is a unknowable rule/pattern for me on which these checks are fired.

Attachments

Changesets

2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-05 17:16:49 +00:00 by mfriedrich ef532f2

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:03 +00:00 by mfriedrich 8344f74

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:14 +00:00 by mfriedrich f99feab

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287
@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-02 16:14:12 +00:00

  • Category set to Checker
  • Status changed from New to Feedback
  • Assigned to set to ralph_b

Are you using an Icinga 2 Cluster, or any nodes actually executing these checks? Please add the relevant zones.conf entries.

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-02 20:53:31 +00:00

No, we actually don't use Icinga2 Cluster. The troubleshooting file contains the whole master1 zone definition. An icinga2 agent is installed on allmost all clients, but the services.conf on this client were empty. At moment all checks are triggered by the master. The communication between master and clients is a one way road (admin network to customer network).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-03 08:17:16 +00:00

  • Status changed from Feedback to New
  • Assigned to changed from ralph_b to mfriedrich

Ok, thanks. I'll try to reproduce the issue.

Cheers,
Michael

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-03 08:19:34 +00:00

  • Status changed from New to Assigned

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-03 08:44:41 +00:00

Additional infomation: I reduced the scenario to master -> one single client w/o icinga2 agent. In this scenario the master is showing the same behavior.

Cheers,
Ralph

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-03 17:28:54 +00:00

I think I'm experiencing the same issue on Ubuntu. Single node reporting in hundreds of times a second. Let me know if/what further info I can provide to help.

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-04 09:49:16 +00:00

So, some single node stats (aggregated through Graylog) for a node running for the last 5 minutes

Value          %    Count    

icinga  54.50%  1,496   
 load           10.53%  289 
 procs  10.27%  282 
 disk             5.17% 142 
 swap     5.10% 140 
 disk /   5.06% 139 
 ssh              5.03% 138 
 users    4.12% 113 
 apt              0.15% 4   
 ping4    0.07% 2

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 14:03:43 +00:00

https://monitoring-portal.org/index.php?thread/35412-services-checks-werden-mehrfach-ausgef%C3%BChrt/&postID=225805#post225805 (for reference)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:27:26 +00:00

  • Relates set to 11257

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:27:37 +00:00

  • Relates set to 11248

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:27:49 +00:00

  • Relates set to 11226

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:31:42 +00:00

  • Relates deleted 11257

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:33:36 +00:00

  • Relates deleted 11248

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:33:39 +00:00

  • Relates deleted 11226

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:33:46 +00:00

  • Parent Id set to 11310

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-05 17:40:18 +00:00

I've reverted 2 commits which might be causing trouble here. Can you please re-test the current git master?

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-07 11:48:01 +00:00

  • File added Capture.PNG

dnsmichi wrote:

I've reverted 2 commits which might be causing trouble here. Can you please re-test the current git master?

I've downloaded and built the master from git and deployed that build to one node.

Results: last 5 minutes: > 13,000 service check messages sent to my Graylog instance - see the attached image.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-07 14:53:25 +00:00

Hm, that's fairly strange. I'm using a 3 node cluster (2 nodes in master zone, 1 satellite for command_endpoint checks using the latest icinga2 --version v2.4.3-232-gef532f2) and I don't see such behavior.

@rgrey
Can you please add more details, such as the zones.conf from both the master and the client. Further an output of "icinga2 --version".

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-07 15:06:57 +00:00

Hmm, I must have done something wrong, as my icinga2 --version on the node still says r2.4.3-1 rather than a git version. I'll do some more work ... sorry. Also, I only built and deployed this to my single remote node. I hadn't changed my master installation. Please advise.

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr/local
Sysconf directory: /usr/local/etc
Run directory: /usr/local/var/run
Local state directory: /usr/local/var
Package data directory: /usr/local/share/icinga2
State path: /usr/local/var/lib/icinga2/icinga2.state
Modified attributes path: /usr/local/var/lib/icinga2/modified-attributes.conf
Objects path: /usr/local/var/cache/icinga2/icinga2.debug
Vars path: /usr/local/var/cache/icinga2/icinga2.vars
PID path: /usr/local/var/run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-07 15:34:56 +00:00

Fixed the snapshot package repository for ubuntu trusty, you should see the latest packages available over there.

Please update the affected node and the master.

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-07 15:39:44 +00:00

Hi michael,

tried to build from github. Sorry, I never installed it this way. I am searching for HowTo/doc to test it on my box.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-07 15:43:42 +00:00

@ralph_b

Change the repository to use the snapshot package repository instead of stable. Then you are able to install the icinga2 snapshot packages just like normal.

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-07 16:09:25 +00:00

Initial results look promising! I've updated my master using the snapshot repository and itself is now showing the expected number of service checks, rather than multiple versions within the same immediate timeframe.

Building (correctly!) from git master branch on my remote node currently ... although that now might be moot.

Great job.

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-07 16:15:35 +00:00

  • File added 07-03-2016 17-10-27.png

Hi Michael,

thank you for the hint. I got it.

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3-233-g7439633)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /var/run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /var/run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 6.7 (Santiago)
Kernel: Linux
Kernel version: 2.6.32-573.18.1.el6.x86_64
Architecture: x86_64

Local triggerd checks are working fine now, but the remotely on icinga clients started checks are still showing strange behavior:

@icinga-migration
Copy link
Author

Updated by rgrey on 2016-03-08 15:16:20 +00:00

FYI - this seems resolved by running the latest snapshot on my master node. Client nodes are still running stock latest Ubuntu stable release 2.4.3-1.

Master

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.3-236-g19cb781)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

Client Node

icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.3-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Ubuntu
Platform version: 14.04.4 LTS, Trusty Tahr
Kernel: Linux
Kernel version: 3.13.0-79-generic
Architecture: x86_64

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-09 10:40:02 +00:00

  • Priority changed from Normal to High
  • Target Version set to 2.4.4

Ok thanks for the tests. I suspect the problem is located updating the next check time when receiving a new check result, but without passing the cluster message origin. Besides that, the reverted commits merely affect the passive check results. A proper fix is discussed in #11336.

I'll assign this issue for 2.4.4 - it'll be great if you could do further tests with 1) the same snapshot version on all clients 2) ntp running on all nodes (I could guess of a time sync problem here as well).

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-09 11:36:42 +00:00

  • File added 09-03-2016 12-30-52.png

Hi Michael,

there are three client hosts in my small landscape with icinga2 agents (2 Linux boxes and 1 Windows box) which are update now with the snapshot. Two of them had time differences due to not runnig ntpd (I have to talk with the server guys). It still remains one Linux box (host ID 97) with multiple checks within check_interval (please see attached screen shot). I am searching for the difference to the other hosts.

Cheers,
Ralph

@icinga-migration
Copy link
Author

Updated by ralph_b on 2016-03-09 12:32:07 +00:00

Good news for the icinga2 team. Found the reason for host ID 97: services.conf was filled with the delivery content, but has to be emtpy, so the localy installed icinga2 agent fired checks by itself in addition the master (bad for myself).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-11 08:36:17 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Ok thanks.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-11 14:56:08 +00:00

  • Backport? changed from Not yet backported to Already backported

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-24 09:37:53 +00:00

  • Parent Id deleted 11310

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working Checker labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.4 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant