Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8833] service checks stuck in "pending" or last check and ignores force check #2810

Closed
icinga-migration opened this issue Mar 21, 2015 · 24 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8833

Created by rhillmann on 2015-03-21 22:36:46 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-11-17 12:41:29 +00:00 (in Redmine)

Icinga Version: 2.3.2
Backport?: Not yet backported
Include in Changelog: 1

I have seen a big problem in our cluster environment and cant see what can be wrong, so i guess this is a critical bug in icinga2.
A lot of checks are stucks at the pending state or not getting re-sheduled. Some checks should checked in the past, but stuck at the last check time (>6h!).

Sheduling the next chick (with force) doesnt solve the problem, the check becomes only a new shedule time, but the check never happened.
I have doubled check the services with classicui and the new web2. Its almost the same, so the problem needs to be located at the core.

I observed this issue since 2.3.0, but i am not sure if this problem was present in earlier releases.
The cluster nodes are not under heavy load, so its not an performance issue. They are almost on a load of 2.

zone.conf

object Endpoint "hydrogenguest6.a2.rz1.domain.com" {
  host = "hydrogenguest6.a2.rz1.domain.com" 
  log_duration = 2h
}
object Endpoint "fluorineguest6.b3.rz1.domain.com" {
  host = "fluorineguest6.b3.rz1.domain.com" 
  log_duration = 2h
}
object Endpoint "hydrogenguest4.a1.rz0.domain.com" {
  host = "hydrogenguest4.a1.rz0.domain.com" 
  log_duration = 2h
}
object Endpoint "dresdenguest6.muc.domain.com" {
  host = "dresdenguest6.muc.domain.com" 
  log_duration = 2h
}
object Endpoint "hydrogenguest3.a1.rz0.domain.com" {
  host = "hydrogenguest3.a1.rz0.domain.com" 
  log_duration = 2h
}
object Endpoint "chlorineguest10.a2.rz1.domain.com" {
  host = "chlorineguest10.a2.rz1.domain.com" 
  log_duration = 2h
}
object Endpoint "dresdenguest7.muc.domain.com" {
  host = "dresdenguest7.muc.domain.com" 
  log_duration = 2h
}
object Endpoint "hydrogenguest9.b2.rz1.domain.com" {
  host = "hydrogenguest9.b2.rz1.domain.com" 
  log_duration = 2h
}
object Zone "global" {
  global = true
}

object Zone "frontend" {
  endpoints = ["hydrogenguest4.a1.rz0.domain.com","dresdenguest7.muc.domain.com",]
}

object Zone "master" {
  endpoints = ["hydrogenguest6.a2.rz1.domain.com","fluorineguest6.b3.rz1.domain.com",]
  parent = "frontend" 
}

object Zone "worker" {
  endpoints = ["dresdenguest6.muc.domain.com","hydrogenguest3.a1.rz0.domain.com","chlorineguest10.a2.rz1.domain.com","hydrogenguest9.b2.rz1.domain.com",]
  parent = "master" 

Attachments

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-26 16:21:08 +00:00

  • Status changed from New to Feedback
  • Assigned to set to rhillmann

Enable the debug log and trace the check from the external command through the actual execution. This might involve multiple instances and cluster relay messages. Post that here.

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-03-30 15:01:34 +00:00

  • File added master-1.debug.1.tar.gz
  • File added master-1.debug.2.tar.gz

unfortunately i couldnt fiind where the check goes to (if it was sent to an checker/worker server), but here is the debug log of the master-1 server, where the forced check was sheduled. look for "server_to_check.domain.com"
The log is splitted in two parts (5mb limit sucks...)

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-31 06:31:05 +00:00

Can you please show me the output of 'icinga2 object list' (from both instances) for some of the affected hosts and services?

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-03-31 09:07:02 +00:00

  • File added object-list.txt

Attached you will find an example of the hostalive check for one host, which currently stucks over 1h.
The objects seems to be exist on all nodes.
Probably this is related to Bug #8712 ?

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-04-14 20:10:28 +00:00

It seems to hit checks which are last checked in the same "time frame", for example all checks which last check was between 03:49 to 03:53.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 08:55:25 +00:00

Can you re-test that with 2.3.5 please?

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-07-16 08:12:28 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted rhillmann

@icinga-migration
Copy link
Author

Updated by dgoetz on 2015-07-17 08:13:47 +00:00

I have seen this problem with checks staying in pending (while having status information from check run one time) and unknown (telling me endpoint is not connected). Updating to 2.3.7 fixed this for me in the environment currently working on.

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-07-31 07:54:35 +00:00

Since 2.3.7 its quite better, but the problem still exists for some checks (random).

@icinga-migration
Copy link
Author

Updated by sudeshkumar on 2016-01-14 14:45:56 +00:00

  • File added workqueue.PNG

I too have the same issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing.

I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot.

@icinga-migration
Copy link
Author

Updated by rhillmann on 2016-01-14 14:54:46 +00:00

@sudeshkumar which version are you running.
We are currently running 2.3.11, which is now working fine.

@icinga-migration
Copy link
Author

Updated by sudeshkumar on 2016-01-14 15:03:24 +00:00

@rhillmann
We are also running 2.3.11. But it's inconsistent, sometimes it was working fine. Probably at the time of reload & restart of icinga.

@icinga-migration
Copy link
Author

Updated by sudeshkumar on 2016-01-18 13:47:01 +00:00

For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue.

I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory.

Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin.

Please help me to resolve this.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:54:12 +00:00

  • Parent Id set to 11313

@icinga-migration
Copy link
Author

Updated by mjbrooks on 2016-03-10 15:52:28 +00:00

I can confirm that I've seen this bug in the wild on 2.4.3-r1, downgrading to 2.3.11 resolved the issue. So bisecting between those versions might shed some light on the problem.

@icinga-migration
Copy link
Author

Updated by vsakhart on 2016-04-15 21:43:09 +00:00

I am also having this issue. I am on version 2.4.4-1ppa1precise1. I have an HA cluster with 2 nodes and I managed to circumvent this issue by disabling my slave node and have only the master do checks. After a restart, this processed all the pending checks and cleared checks that were stuck (with check result is late) and would never execute.

@icinga-migration
Copy link
Author

Updated by 00stromy00 on 2016-06-09 12:31:04 +00:00

I also can confirm this behavior in 2.4.4.1
After a restart (service icinga2 restart) als is fine.
Indefinitely the "bug" is back and a new restart is required.

@icinga-migration
Copy link
Author

Updated by rglemaire on 2016-06-12 16:47:29 +00:00

  • File added troubleshooting-2016-06-12_16_25_32.log

I have this issue and a restart isn't enough.
My conf :
6 hosts
69 services
1 master icinga2 2.4.10-1 on debian jessie
4 clients 2 icinga2 2.4.10-1 on debian jessie and 2 icinga2 2.4.7-1 on opensuse leap

After a lot of restart it has been reduced to 22 PENDING.

Today I restarted 2 hosts. ICINGA2 detected it and put 6 services in CRITICAL and 3 UNKOWN.
For more than 1 hour the services are UP, but for ICINGA2 there are still CRITICAL or UNKOWN.
Checkes are done for OK and WARNING services and hosts.
No check is done for CRITICAL, UNKOWN or PENDING.
The concerned check can be remote or master check.

Zipped my debug file is too big (6.5MB). I can search for something, extract a part you want.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-09 14:52:12 +00:00

  • Parent Id deleted 11313

@icinga-migration
Copy link
Author

Updated by rglemaire on 2016-11-15 09:05:12 +00:00

Hy,

Good news ! It works for me.
icinga2 version: r2.5.4-1

My mistake : feature livestatus disable
After enabling livestatus, all work fine.

Thanks.

@icinga-migration
Copy link
Author

Updated by saravanakumar on 2016-11-17 07:07:59 +00:00

Hi I am also getting the same error, after icinga2 restart not all services are coming up in UI, it's in pending state for more than 40 mins. It is working once after clicking check_now button in UI. I am using standalone server not in cluster.

Details:

I have installed from source and using all source builds for Postgresql, httpd, Icinga web Icinga core both are version 2

sh-4.1$ /home/saravana/selfmonitoring/icinga2/icinga_server/lib64/icinga2/sbin/icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.6-1)

Copyright © 2012-2016 Icinga Development Team (https://www.icinga.org/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /home/saravana/selfmonitoring/icinga2/icinga_server
Sysconf directory: /home/saravana/selfmonitoring/icinga2/icinga_server/etc
Run directory: /home/saravana/selfmonitoring/icinga2/icinga_server/var/run
Local state directory: /home/saravana/selfmonitoring/icinga2/icinga_server/var
Package data directory: /home/saravana/selfmonitoring/icinga2/icinga_server/share/icinga2
State path: /home/saravana/selfmonitoring/icinga2/icinga_server/var/lib/icinga2/icinga2.state
Modified attributes path: /home/saravana/selfmonitoring/icinga2/icinga_server/var/lib/icinga2/modified-attributes.conf
Objects path: /home/saravana/selfmonitoring/icinga2/icinga_server/var/cache/icinga2/icinga2.debug
Vars path: /home/saravana/selfmonitoring/icinga2/icinga_server/var/cache/icinga2/icinga2.vars
PID path: /home/saravana/selfmonitoring/icinga2/icinga_server/var/run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 6.7 (Santiago)
Kernel: Linux
Kernel version: 2.6.32-573.el6.x86_64
Architecture: x86_64

-sh-4.1$ /home/saravana/selfmonitoring/icinga2/icinga_server/lib64/icinga2/sbin/icinga2 feature list
Disabled features: api
Enabled features: checker command compatlog debuglog gelf graphite icingastatus ido-pgsql livestatus mainlog notification opentsdb perfdata statusdata syslog

I am using check_oracle_health plugin and NRPE plugins to monitor remote database/host , only 3 hosts in my test environment.

Note: installation is different, I have edited it.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-11-17 08:58:43 +00:00

@saravanakumar: Consider testing this with 2.5.4. Also, you might want to use packages instead.

@icinga-migration
Copy link
Author

Updated by saravanakumar on 2016-11-17 12:41:29 +00:00

gunnarbeutner wrote:

@saravanakumar: Consider testing this with 2.5.4. Also, you might want to use packages instead.

Which package I want to use? Is there any possibility to I merge that into version 2.4.6?

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@dnsmichi
Copy link
Contributor

dnsmichi commented Feb 6, 2017

I consider this being fixed with v2.6.1. Please upgrade to the latest stable & supported version. 2.4.x is EOL already.

@dnsmichi dnsmichi closed this as completed Feb 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants