New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #8833] service checks stuck in "pending" or last check and ignores force check #2810
Comments
Updated by mfriedrich on 2015-03-26 16:21:08 +00:00
Enable the debug log and trace the check from the external command through the actual execution. This might involve multiple instances and cluster relay messages. Post that here. |
Updated by rhillmann on 2015-03-30 15:01:34 +00:00
unfortunately i couldnt fiind where the check goes to (if it was sent to an checker/worker server), but here is the debug log of the master-1 server, where the forced check was sheduled. look for "server_to_check.domain.com" |
Updated by gbeutner on 2015-03-31 06:31:05 +00:00 Can you please show me the output of 'icinga2 object list' (from both instances) for some of the affected hosts and services? |
Updated by rhillmann on 2015-03-31 09:07:02 +00:00
Attached you will find an example of the hostalive check for one host, which currently stucks over 1h. |
Updated by rhillmann on 2015-04-14 20:10:28 +00:00 It seems to hit checks which are last checked in the same "time frame", for example all checks which last check was between 03:49 to 03:53. |
Updated by mfriedrich on 2015-06-18 08:55:25 +00:00 Can you re-test that with 2.3.5 please? |
Updated by gbeutner on 2015-07-16 08:12:28 +00:00
|
Updated by dgoetz on 2015-07-17 08:13:47 +00:00 I have seen this problem with checks staying in pending (while having status information from check run one time) and unknown (telling me endpoint is not connected). Updating to 2.3.7 fixed this for me in the environment currently working on. |
Updated by rhillmann on 2015-07-31 07:54:35 +00:00 Since 2.3.7 its quite better, but the problem still exists for some checks (random). |
Updated by sudeshkumar on 2016-01-14 14:45:56 +00:00
I too have the same issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing. I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot. |
Updated by rhillmann on 2016-01-14 14:54:46 +00:00 @sudeshkumar which version are you running. |
Updated by sudeshkumar on 2016-01-14 15:03:24 +00:00 @rhillmann |
Updated by sudeshkumar on 2016-01-18 13:47:01 +00:00 For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue. I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory. Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin. Please help me to resolve this. |
Updated by mfriedrich on 2016-03-04 15:54:12 +00:00
|
Updated by mjbrooks on 2016-03-10 15:52:28 +00:00 I can confirm that I've seen this bug in the wild on 2.4.3-r1, downgrading to 2.3.11 resolved the issue. So bisecting between those versions might shed some light on the problem. |
Updated by vsakhart on 2016-04-15 21:43:09 +00:00 I am also having this issue. I am on version 2.4.4-1 |
Updated by 00stromy00 on 2016-06-09 12:31:04 +00:00 I also can confirm this behavior in 2.4.4.1 |
Updated by rglemaire on 2016-06-12 16:47:29 +00:00
I have this issue and a restart isn't enough. After a lot of restart it has been reduced to 22 PENDING. Today I restarted 2 hosts. ICINGA2 detected it and put 6 services in CRITICAL and 3 UNKOWN. Zipped my debug file is too big (6.5MB). I can search for something, extract a part you want. |
Updated by mfriedrich on 2016-11-09 14:52:12 +00:00
|
Updated by rglemaire on 2016-11-15 09:05:12 +00:00 Hy, Good news ! It works for me. My mistake : feature livestatus disable Thanks. |
Updated by saravanakumar on 2016-11-17 07:07:59 +00:00 Hi I am also getting the same error, after icinga2 restart not all services are coming up in UI, it's in pending state for more than 40 mins. It is working once after clicking check_now button in UI. I am using standalone server not in cluster. Details: I have installed from source and using all source builds for Postgresql, httpd, Icinga web Icinga core both are version 2 sh-4.1$ /home/saravana/selfmonitoring/icinga2/icinga_server/lib64/icinga2/sbin/icinga2 -V Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/) Application information: System information:
|
Updated by gbeutner on 2016-11-17 08:58:43 +00:00 @saravanakumar: Consider testing this with 2.5.4. Also, you might want to use packages instead. |
Updated by saravanakumar on 2016-11-17 12:41:29 +00:00 gunnarbeutner wrote:
Which package I want to use? Is there any possibility to I merge that into version 2.4.6? |
I consider this being fixed with v2.6.1. Please upgrade to the latest stable & supported version. 2.4.x is EOL already. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8833
Created by rhillmann on 2015-03-21 22:36:46 +00:00
Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-11-17 12:41:29 +00:00 (in Redmine)
I have seen a big problem in our cluster environment and cant see what can be wrong, so i guess this is a critical bug in icinga2.
A lot of checks are stucks at the pending state or not getting re-sheduled. Some checks should checked in the past, but stuck at the last check time (>6h!).
Sheduling the next chick (with force) doesnt solve the problem, the check becomes only a new shedule time, but the check never happened.
I have doubled check the services with classicui and the new web2. Its almost the same, so the problem needs to be located at the core.
I observed this issue since 2.3.0, but i am not sure if this problem was present in earlier releases.
The cluster nodes are not under heavy load, so its not an performance issue. They are almost on a load of 2.
zone.conf
Attachments
The text was updated successfully, but these errors were encountered: