Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #4494] reload while check executing causes skip of remaining check attempts #1320

Closed
icinga-migration opened this issue Jul 31, 2013 · 21 comments

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/4494

Created by mckslim on 2013-07-31 22:10:24 +00:00

Assignee: (none)
Status: Rejected (closed on 2013-10-03 02:12:55 +00:00)
Target Version: (none)
Last Update: 2013-10-03 02:12:55 +00:00 (in Redmine)

Icinga Version: 1.8.4
OS Version: Red Hat Enterprise Linux Server release 5.9 (Tikanga)

Happens on 1.8.4 at least, don't know about other versions.
Looking at the log snippet here (max_attempts on this host check is 5):

[2013-07-31 00:41:31] HOST ALERT: atl-stg-dmscol-01a;UP;HARD;1;sup
[2013-07-31 00:41:51] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;1;sean check 30
[2013-07-31 00:42:41] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;sean check 30
check begin
[2013-07-31 00:43:18] Caught SIGHUP, restarting...
check end
[2013-07-31 00:43:48] HOST ALERT: atl-stg-dmscol-01a;DOWN;HARD;1;sean check 30

See that after attempt 2, I run the check using the web interface (which begins check script execution), then before the check script ends I reload icinga, then after this the check script ends.
See that the next HOST ALERT has 'DOWN;HARD;1', when it should be 'DOWN;SOFT;3'; this initiates a notification before we really expect one to be generated, which causes more grind for our ops people.

Can you do something to get the above to not prematurely go DOWN;HARD ?


Relations:

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-07-31 23:16:05 +00:00

And btw, Nagios 3.5.0 has the same problem.
And, the problem occurs on restart as well as on reload.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-08-05 17:14:06 +00:00

can you attach the configuration sample for that specific host/service, as well as the part from

  • objects.cache
  • status.dat
  • retention.dat (!)

and some debug logs as well for the specific region of that check happening including the reload.

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-08-05 20:47:09 +00:00

ok, here's a bunch of info, let me know if you need something else:


icinga.log:
[2013-08-05 18:56:06] PASSIVE HOST CHECK: atl-stg-dmscol-01a;0;sup
[2013-08-05 18:56:06] HOST ALERT: atl-stg-dmscol-01a;UP;HARD;1;sup
[2013-08-05 18:57:16] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;1;sean check 30
[2013-08-05 18:57:56] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;sean check 30
[2013-08-05 18:59:50] Caught SIGHUP, restarting...
[2013-08-05 19:00:10] HOST ALERT: atl-stg-dmscol-01a;DOWN;HARD;1;sean check 30


host defn:
define host{
use CPE-ACTIVE,PING-HOST
hostgroups DMG-Ping
host_name atl-stg-dmscol-01a
alias atl-stg-dmscol-01a
#address 207.231.140.56
address 1.1.1.1
notification_options n
check_command check_sean
checks_enabled 1
max_check_attempts 5
}


from 'objects.cache':
define host {
host_name atl-stg-dmscol-01a
alias atl-stg-dmscol-01a
address 1.1.1.1
check_command check_sean
contact_groups soc
notification_period 24x7
initial_state o
check_interval 5.000000
retry_interval 4.000000
max_check_attempts 5
active_checks_enabled 1
passive_checks_enabled 1
obsess_over_host 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options o,d,u
freshness_threshold 0
check_freshness 0
notification_options n
notifications_enabled 1
notification_interval 120.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
failure_prediction_enabled 1
notes_url /tools/create_super_report.php
action_url /tools/handle_localservice.php?device=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}


'status.dat', after the down;hard;1 status of the check after reload:
hoststatus {
host_name=atl-stg-dmscol-01a
modified_attributes=2
check_command=check_sean
check_period=
notification_period=24x7
check_interval=5.000000
retry_interval=4.000000
event_handler=
has_been_checked=1
should_be_scheduled=0
check_execution_time=30.059
check_latency=5.701
check_type=0
current_state=1
last_hard_state=0
last_event_id=70
current_event_id=71
current_problem_id=18
last_problem_id=17
plugin_output=sean check 30
long_plugin_output=
performance_data=
last_check=1375729177
next_check=1375729510
check_options=0
current_attempt=5
max_attempts=5
state_type=1
last_state_change=1375729210
last_hard_state_change=1375729210
last_time_up=1375728966
last_time_down=1375729210
last_time_unreachable=0
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
current_down_notification_number=0
current_unreachable_notification_number=0
current_notification_id=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
acknowledgement_end_time=0
active_checks_enabled=0
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1375735428
is_flapping=0
percent_state_change=28.42
scheduled_downtime_depth=0
}


'retention.dat', right after doing the reload in the middle of attempt 3:
host {
host_name=atl-stg-dmscol-01a
modified_attributes=2
check_command=check_sean
check_period=
notification_period=24x7
event_handler=
has_been_checked=1
check_execution_time=30.056
check_latency=8.823
check_type=0
current_state=1
last_state=1
last_hard_state=0
last_event_id=69
current_event_id=70
current_problem_id=18
last_problem_id=17
plugin_output=sean check 30
long_plugin_output=
performance_data=
last_check=1375729039
next_check=1375729172
check_options=0
current_attempt=3
max_attempts=5
normal_check_interval=5.000000
retry_check_interval=5.000000
state_type=0
last_state_change=1375729076
last_hard_state_change=1375728966
last_time_up=1375728966
last_time_down=1375729076
last_time_unreachable=0
notified_on_down=0
notified_on_unreachable=0
last_notification=0
current_notification_number=0
current_down_notification_number=0
current_unreachable_notification_number=0
current_notification_id=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
acknowledgement_end_time=0
active_checks_enabled=0
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
is_flapping=0
percent_state_change=29.21
check_flapping_recovery_notification=0
state_history=1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1
}

'retention.dat', after the check completed after the interim reload:
host {
host_name=atl-stg-dmscol-01a
modified_attributes=2
check_command=check_sean
check_period=
notification_period=24x7
event_handler=
has_been_checked=1
check_execution_time=30.056
check_latency=8.823
check_type=0
current_state=1
last_state=1
last_hard_state=0
last_event_id=69
current_event_id=70
current_problem_id=18
last_problem_id=17
plugin_output=sean check 30
long_plugin_output=
performance_data=
last_check=1375729039
next_check=1375729172
check_options=0
current_attempt=3
max_attempts=5
normal_check_interval=5.000000
retry_check_interval=5.000000
state_type=0
last_state_change=1375729076
last_hard_state_change=1375728966
last_time_up=1375728966
last_time_down=1375729076
last_time_unreachable=0
notified_on_down=0
notified_on_unreachable=0
last_notification=0
current_notification_number=0
current_down_notification_number=0
current_unreachable_notification_number=0
current_notification_id=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
acknowledgement_end_time=0
active_checks_enabled=0
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
is_flapping=0
percent_state_change=29.21
check_flapping_recovery_notification=0
state_history=1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1
}


'icinga.debug':
[2013-08-05 18:57:56.072140] [016.1] [pid=27204] Handling check result for host 'atl-stg-dmscol-01a'...
[2013-08-05 18:57:56.072150] [016.1] [pid=27204] **** Handling async check result for host 'atl-stg-dmscol-01a'...
[2013-08-05 18:57:56.072238] [016.1] [pid=27204] HOST: atl-stg-dmscol-01a, ATTEMPT=2/5, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2013-08-05 18:57:56.072249] [016.1] [pid=27204] Host was DOWN/UNREACHABLE.
[2013-08-05 18:57:56.072254] [016.1] [pid=27204] Host is still DOWN/UNREACHABLE.
[2013-08-05 18:57:56.072260] [016.1] [pid=27204] Pre-handle_host_state() Host: atl-stg-dmscol-01a, Attempt=2/5, Type=SOFT, Final State=1
[2013-08-05 18:57:56.072630] [016.1] [pid=27204] Post-handle_host_state() Host: atl-stg-dmscol-01a, Attempt=2/5, Type=SOFT, Final State=1
[2013-08-05 18:57:56.072641] [016.1] [pid=27204] Checking host 'atl-stg-dmscol-01a' for flapping...
[2013-08-05 18:57:56.072662] [016.1] [pid=27204] Rescheduling next check of host at Mon Aug 5 19:01:56 2013
[2013-08-05 18:57:56.072739] [016.1] [pid=27204] **** Async check result for host 'atl-stg-dmscol-01a' handled: new state=1
...
[2013-08-05 18:59:37.449352] [016.0] [pid=27204] Scheduling a forced, active check of host 'atl-stg-dmscol-01a' @ Mon Aug 5 18:59:32 2013
[2013-08-05 18:59:37.701073] [024.1] [pid=27204] Run a few checks before executing a host check for 'atl-stg-dmscol-01a'.
[2013-08-05 18:59:37.701117] [016.0] [pid=27204] Attempting to run scheduled check of host 'atl-stg-dmscol-01a': check options=1, latency=5.701000
[2013-08-05 18:59:37.701130] [016.0] [pid=27204] **** Running async check of host 'atl-stg-dmscol-01a'...
[2013-08-05 18:59:37.701167] [016.0] [pid=27204] Checking host 'atl-stg-dmscol-01a'...
...
reload [2013-08-05 18:59:50.530934] [004.0] [pid=27204] Original UID/GID: 611/611
...
[2013-08-05 19:00:10.018035] [016.1] [pid=27204] Handling check result for host 'atl-stg-dmscol-01a'...
[2013-08-05 19:00:10.018051] [016.1] [pid=27204] **** Handling async check result for host 'atl-stg-dmscol-01a'...
[2013-08-05 19:00:10.018110] [016.1] [pid=27204] HOST: atl-stg-dmscol-01a, ATTEMPT=1/5, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2013-08-05 19:00:10.018121] [016.1] [pid=27204] Host was DOWN/UNREACHABLE.
[2013-08-05 19:00:10.018126] [016.1] [pid=27204] Host is still DOWN/UNREACHABLE.
[2013-08-05 19:00:10.018132] [016.1] [pid=27204] Pre-handle_host_state() Host: atl-stg-dmscol-01a, Attempt=1/5, Type=HARD, Final State=1
[2013-08-05 19:00:10.018472] [016.1] [pid=27204] Post-handle_host_state() Host: atl-stg-dmscol-01a, Attempt=1/5, Type=HARD, Final State=1
[2013-08-05 19:00:10.018483] [016.1] [pid=27204] Checking host 'atl-stg-dmscol-01a' for flapping...
[2013-08-05 19:00:10.018492] [016.1] [pid=27204] Rescheduling next check of host at Mon Aug 5 19:05:10 2013
[2013-08-05 19:00:10.018608] [016.1] [pid=27204] **** Async check result for host 'atl-stg-dmscol-01a' handled: new state=1
...

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-08-15 19:27:13 +00:00

hello, when might you get to fixing this?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-08-15 21:09:05 +00:00

there's various possibilities

  • you create a possible fix and attach it (preferred way)
  • a core dev looks into the problem, is able to reproduce it, and tries to fix it
  • anyone else shares their ideas/experiences

i for myself don't have any spare time left for working on core issues in the next time.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-08-15 21:09:53 +00:00

furthermore, you're still running 1.8.4 - you should try to reproduce it with 1.9.3 too.

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-08-21 17:26:46 +00:00

dnsmichi wrote:

furthermore, you're still running 1.8.4 - you should try to reproduce it with 1.9.3 too.

Problem occurs on 1.9.3:
$ bin/icinga
Icinga 1.9.3
Copyright © 2009-2013 Icinga Development Team (http://www.icinga.org)

$ cat /opt/icinga/mgd/var/icinga.log | egrep -v 'STATE|EXT' | egrep 'HOST|SIG' | egrep 'stg-dms|SIG' | format_log_ts.pl | tail
[2013-08-21 16:29:29] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;3;(Host Check Timed Out)
[2013-08-21 16:32:20] Caught SIGTERM, shutting down...
[2013-08-21 17:10:41] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;(Host Check Timed Out)
[2013-08-21 17:12:12] Caught SIGHUP, restarting...
[2013-08-21 17:13:02] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;sean check 30
[2013-08-21 17:14:30] Caught SIGHUP, restarting...
[2013-08-21 17:15:30] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;sean check 30
[2013-08-21 17:16:40] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;3;sean check 30
check begin
[2013-08-21 17:19:09] Caught SIGHUP, restarting...
check end
[2013-08-21 17:19:39] HOST ALERT: atl-stg-dmscol-01a;DOWN;HARD;1;sean check 30

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-08-22 19:17:42 +00:00

Hello again, just wondering quickly: how long before someone on your Icinga team might get to working on this problem? thank you

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-08-22 19:51:23 +00:00

we're individuals with a private life and everyone's sharing his/her time with different tasks. i for myself won't look into it soon, unless you convince me that going to grab some beer on the weekend should be exchanged with icinga coding.

for the rest - i don't know. and we usually don't give dates nor timestamps when issues are being worked on. it's hard to even follow the lead of the roadmap and linked issues there given the current low resources.

maybe you'll ask a company providing professional support for icinga fixing the issue faster, providing a patch which we then push upstream and test.

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-08-23 15:01:35 +00:00

Understood, and thanks to you and the others for working on these projects!
Can you give me a little pointer as to where in the code the problem might lie?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-08-23 15:15:23 +00:00

my guess would be that the sighandler for SIGHUP invokes the config re-read and somewhere the counter (current_attempt struct attribute) gets overridden. might happen when the object is re-created in common/objects.c, or some weird behaviour in base/checks.c on the actual check then, resetting it to 1 / hard due to the former state and now checkresult handling of your returned longlasting check.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-09-27 20:48:24 +00:00

hm, the base/checks.c checkresult handlers only mark passive host checks with current_attempt=1 ... but most likely adjust_host_check_attempt_3x() changes current_attempt to 1 before actually rescheduling the next check - and then the checkresult from the long lasting check is being parsed into. and it does not matter whether this is a normal async host check, or a triggered on demand check - both adjust the current attempt counter by calling adjust_host_check_attempt_3x().

some more verbose debug logs would likely help here shed some light into the problem (though, a check lasting 3min+ should be made a cronjob passing a passive checkresult back to the core imho).

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-09-30 22:12:04 +00:00

I reran with (let me know if you need more debug_level):
debug_level=20
debug_verbosity=2

Here's the icinga.log with my added check info:

21:39:32 - check begin (attempt 2 of 4)
[2013-09-30 21:40:07] Caught SIGHUP, restarting...
21:42:17 - check finish
[2013-09-30 21:42:17] HOST ALERT: P00000008028456;DOWN;HARD;1;CRITICAL: Connection timed out to '1.1.1.1' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''.  Attempt 2 of 4.

Here's the debug, winnowed down to relevant lines:

...
[2013-09-30 21:39:32.010804] [016.0] [pid=31452] Scheduling a forced, active check of host 'P00000008028456' @ Mon Sep 30 21:39:27 2013
[2013-09-30 21:39:32.010827] [016.2] [pid=31452] Scheduling new host check event for 'P00000008028456' @ Mon Sep 30 21:39:27 2013
[2013-09-30 21:39:32.262369] [024.1] [pid=31452] Run a few checks before executing a host check for 'P00000008028456'.
[2013-09-30 21:39:32.262413] [024.1] [pid=31452] We're not executing host checks right now, so we'll skip this event.
[2013-09-30 21:39:32.262430] [016.0] [pid=31452] Attempting to run scheduled check of host 'P00000008028456': check options=1, latency=5.262000
[2013-09-30 21:39:32.262446] [016.0] [pid=31452] ** Running async check of host 'P00000008028456'...
[2013-09-30 21:39:32.262465] [016.2] [pid=31452] Adjusting check attempt number for host 'P00000008028456': current attempt=1/4, state=1, state type=0
[2013-09-30 21:39:32.262475] [016.2] [pid=31452] New check attempt number = 2
[2013-09-30 21:39:32.262509] [2320.2] [pid=31452] Raw Command Input: $USER1$/checkSSH.pl --device_id=$_HOSTDEVICEID$ --hostname=$HOSTNAME$ --ip=$HOSTADDRESS$ --username='ibm_hlth_chk' --connect-timeout=160 --use-virtualip=0 --crypt-passphrase='$USER2$' --max-host-attempts=$MAXHOSTATTEMPTS$ --host-attempt=$HOSTATTEMPT$
...
[2013-09-30 21:40:06.169321] [004.2] [pid=31452] Writing status data to temp file '/opt/icinga/mgd/var/icinga.tmpXXXXXX'
[2013-09-30 21:40:07.020861] [004.2] [pid=31452] Writing retention data to temp file '/opt/icinga/mgd/var/icinga.tmp3KXUfG'
[2013-09-30 21:40:07.131151] [004.0] [pid=31452] Original UID/GID: 611/611
[2013-09-30 21:40:07.166896] [016.1] [pid=31452] Checking host 'P00000008027955' for flapping...
[2013-09-30 21:40:07.166921] [016.2] [pid=31452] LFT (low_threshold): 5.00, HFT (high_threshold): 20.00, CPC (curved_percent_change)
: 0.00, PSC (curved_percent_change): 0.00%
...
[2013-09-30 21:40:07.198858] [016.2] [pid=31452] Host 'border_routers' next_check_event populated
[2013-09-30 21:40:07.198915] [004.2] [pid=31452] Writing status data to temp file '/opt/icinga/mgd/var/icinga.tmpXXXXXX'
[2013-09-30 21:40:17.116211] [016.0] [pid=31452] Starting to reap check results.
...
[2013-09-30 21:42:10.223949] [024.1] [pid=31452] Skip event, removing host 'STG000008020418' from list.
[2013-09-30 21:42:17.025552] [016.0] [pid=31452] Starting to reap check results.
[2013-09-30 21:42:17.026044] [016.1] [pid=31452] Starting to read check result queue '/opt/icinga/mgd/var/spool/checkresults'...
[2013-09-30 21:42:17.026271] [016.2] [pid=31452] Found a check result (#1) to handle...
[2013-09-30 21:42:17.026297] [016.1] [pid=31452] Handling check result for host 'P00000008028456'...
[2013-09-30 21:42:17.026307] [016.1] [pid=31452] ** Handling async check result for host 'P00000008028456'...
[2013-09-30 21:42:17.026316] [016.2] [pid=31452]        Check Type:         Active
[2013-09-30 21:42:17.026325] [016.2] [pid=31452]        Check Options:      0
[2013-09-30 21:42:17.026333] [016.2] [pid=31452]        Scheduled Check?:   Yes
[2013-09-30 21:42:17.026341] [016.2] [pid=31452]        Reschedule Check?:  Yes
[2013-09-30 21:42:17.026349] [016.2] [pid=31452]        Exited OK?:         Yes
[2013-09-30 21:42:17.026357] [016.2] [pid=31452]        Exec Time:          161.674
[2013-09-30 21:42:17.026374] [016.2] [pid=31452]        Latency:            28.797
[2013-09-30 21:42:17.026416] [016.2] [pid=31452]        Return Status:      2
[2013-09-30 21:42:17.026425] [016.2] [pid=31452]        Output:             CRITICAL: Connection timed out to '1.1.1.1' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''.  Attempt 2 of 4.   \n\n[sudo: no tty present and no askpass program specified\nUse of uninitialized value in string eq at /opt/icinga/include/UtilsGeneralObj.pm line 1259.] 
[2013-09-30 21:42:17.026456] [016.2] [pid=31452] Parsing check output...
[2013-09-30 21:42:17.026465] [016.2] [pid=31452] Short Output: CRITICAL: Connection timed out to '1.1.1.1' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''.  Attempt 2 of 4.
[2013-09-30 21:42:17.026474] [016.2] [pid=31452] Long Output:  \n[sudo: no tty present and no askpass program specified\nUse of uninitialized value in string eq at /opt/icinga/include/UtilsGeneralObj.pm line 1259.] 
[2013-09-30 21:42:17.026483] [016.2] [pid=31452] Perf Data:    NULL
[2013-09-30 21:42:17.026492] [016.1] [pid=31452] HOST: P00000008028456, ATTEMPT=1/4, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2013-09-30 21:42:17.026501] [016.1] [pid=31452] Host was DOWN/UNREACHABLE.
[2013-09-30 21:42:17.026510] [016.1] [pid=31452] Host is still DOWN/UNREACHABLE.
[2013-09-30 21:42:17.026519] [016.2] [pid=31452] Determining state of host 'P00000008028456': current state=1
[2013-09-30 21:42:17.026528] [016.2] [pid=31452] At least one parent (border_routers) is up, so host is DOWN.
[2013-09-30 21:42:17.026537] [016.1] [pid=31452] Pre-handle_host_state() Host: P00000008028456, Attempt=1/4, Type=HARD, Final State=1
[2013-09-30 21:42:17.026858] [2320.2] [pid=31452] Raw Command Input: /opt/icinga/tools/queue_ticket.pl --notification_time="$SHORTDATETIME$" --type="CPE" --device="$_HOSTDEVICEID$" --ip="$HOSTADDRESS$" --host="$_HOSTMACHINEHOSTNAME$" --customer="$_HOSTCUSTOMERID$" --site="$_HOSTSITEID$" --state="$HOSTSTATE$" --alert="DOWN" --output="$HOSTOUTPUT$" >> /opt/icinga/var/ticket_debug.log
[2013-09-30 21:42:17.026887] [2320.2] [pid=31452] Expanded Command Output: /opt/icinga/tools/queue_ticket.pl --notification_time="$SHORTDATETIME$" --type="CPE" --device="$_HOSTDEVICEID$" --ip="$HOSTADDRESS$" --host="$_HOSTMACHINEHOSTNAME$" --customer="$_HOSTCUSTOMERID$" --site="$_HOSTSITEID$" --state="$HOSTSTATE$" --alert="DOWN" --output="$HOSTOUTPUT$" >> /opt/icinga/var/ticket_debug.log
[2013-09-30 21:42:17.157990] [016.1] [pid=31452] Post-handle_host_state() Host: P00000008028456, Attempt=1/4, Type=HARD, Final State=1
[2013-09-30 21:42:17.158153] [016.1] [pid=31452] Checking host 'P00000008028456' for flapping...
[2013-09-30 21:42:17.158170] [016.2] [pid=31452] LFT (low_threshold): 5.00, HFT (high_threshold): 20.00, CPC (curved_percent_change): 17.30, PSC (curved_percent_change): 17.30%
[2013-09-30 21:42:17.158188] [016.1] [pid=31452] Rescheduling next check of host at Mon Sep 30 21:47:17 2013
[2013-09-30 21:42:17.158258] [016.1] [pid=31452] ** Async check result for host 'P00000008028456' handled: new state=1
[2013-09-30 21:42:17.158398] [016.1] [pid=31452] Deleted check result file '(null)'
[2013-09-30 21:42:17.158503] [016.0] [pid=31452] Finished reaping 1 check results
[2013-09-30 21:42:17.158579] [004.2] [pid=31452] Writing status data to temp file '/opt/icinga/mgd/var/icinga.tmpXXXXXX'

thank you!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-09-30 22:39:00 +00:00

so the adjustment happens indeed before the actual restart happens.

[2013-09-30 21:39:32.262465] [016.2] [pid=31452] Adjusting check attempt number for host 'P00000008028456': current attempt=1/4, state=1, state type=0

given the above log entry, the reset already happened before.

[2013-09-30 21:39:32.262475] [016.2] [pid=31452] New check attempt number = 2

that one means that it just incremented the check attempt like on every other check.

what's puzzling me is

We're not executing host checks right now.

that would indicate that this host does not have any active checks enabled, but only receives passive checks. and that one before even a reload happens. so this could've happened anywhere between the initial startup and the reload - i.e. via modified_attributes.

i quote

'status.dat', after the down;hard;1 status of the check after reload:
hoststatus {
host_name=atl-stg-dmscol-01a
modified_attributes=2
...
active_checks_enabled=0

$ grep '#define MODATTR' include/common.h
#define MODATTR_NONE                            0
#define MODATTR_NOTIFICATIONS_ENABLED           1
#define MODATTR_ACTIVE_CHECKS_ENABLED           2
#define MODATTR_PASSIVE_CHECKS_ENABLED          4
#define MODATTR_EVENT_HANDLER_ENABLED           8
#define MODATTR_FLAP_DETECTION_ENABLED          16
#define MODATTR_FAILURE_PREDICTION_ENABLED      32
#define MODATTR_PERFORMANCE_DATA_ENABLED        64
#define MODATTR_OBSESSIVE_HANDLER_ENABLED       128
#define MODATTR_EVENT_HANDLER_COMMAND           256
#define MODATTR_CHECK_COMMAND                   512
#define MODATTR_NORMAL_CHECK_INTERVAL           1024
#define MODATTR_RETRY_CHECK_INTERVAL            2048
#define MODATTR_MAX_CHECK_ATTEMPTS              4096
#define MODATTR_FRESHNESS_CHECKS_ENABLED        8192
#define MODATTR_CHECK_TIMEPERIOD                16384
#define MODATTR_CUSTOM_VARIABLE                 32768
#define MODATTR_NOTIFICATION_TIMEPERIOD         65536

conclusion to that - someone disabled active host checks at runtime. and something is then triggering the forced active check - either manually, or a freshness check (which seems disabled from objects.cache pov)?

obviously passive checks are treated as hard by default, and that explains the immediate transition from SOFT->HARD and further, the reset of the check attempt, now being a passive check only without the max_check_attempts setting.

i'm not sure what you're trying to do here with the passive checks (is that intended on runtime? someone fiddled something wrong?). passive_host_checks_are_soft=1 would change the default passive state to SOFT, but i doubt that it will solve the original issue.

so, why are these modified attributes set disabling active host checks at runtime?

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-09-30 23:07:20 +00:00

I purposely disabled active checks before running these tests, just to control what was shown in the log - disregard the fact that active checks are disabled. I'm running the checks manually as needed.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-09-30 23:48:23 +00:00

  • Status changed from New to Feedback

i cannot disregard them - having the host checks disabled at runtime causes the observed behaviour with the reset attempt counter on a now passive check.

i cannot see a bug here anymore when you change the host check's behaviour on purpose.

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-10-01 01:28:53 +00:00

I really don't think it becomes a passive check. I am running the check manually using the web's 'Re-schedule the next check of this host' cmd, which is not a passive check, just an active sort of check run on my own schedule.
Here's the log for this forced check cmd:
[2013-09-30 22:40:58] EXTERNAL COMMAND: SCHEDULE_FORCED_HOST_CHECK;P0dfadf;2013-09-30 22:40:58

Looking at 'base/checks.c', this appears to be the important code area:

        /* set the state type */
        /* we've maxed out on the retries */
        if (hst->current_attempt == hst->max_attempts)
          hst->state_type = HARD_STATE;
        /* the host was in a hard problem state before, so it still is now */
        else if (hst->current_attempt == 1) 
          hst->state_type = HARD_STATE;
        /* the host is in a soft state and the check will be retried */
        else
          hst->state_type = SOFT_STATE;

would changing this line:
else if (hst~~current_attempt 1)
to this:
else if (hst->current_attempt 1 && hst
~~>state_type != SOFT_STATE )

fix the issue and work ok in general?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-10-01 07:58:46 +00:00

no, no and no. stop here.

please answer the following questions

  1. you are disabling the host check via gui / external command before running any tests?
  2. you are aware of the fact that "disabling an active host check" will lead into a host with passive checks enabled only?

if 1) and 2) can be answered with yes, read on


a) look into status.dat before doing anything. you will recognize modified_attributes=0 for that host.
b) disable host checks via external command / gui
c) look again into status.dat. you'll now see modified_attributes=2

that in terms means that your host is now a passively checked host (which may still be forced to check, ie. manually).

d) once you force the check, it will handle it as passive.
e) passive checks are always HARD, which means that the host automatically changes to HARD
f) this passive check is "new", so the current check attempt is reset to 1

that means, your active host check was converted into a passive host check, behaving totally different to like known before.

the reload/restart has nothing to do with your problem


now, you've got 2 options


  1. set passive_host_checks_are_soft=1 having the passive host check only a SOFT state, and tolerate max_check_attempts
    OR
  2. do not disable host checks before doing anything. just force reschedule the host check and then doing SIGHUP.

make sure to clean retentation.dat from modified_attributes=2 for that host. set it to modified_attributes=0

# service icinga stop
# vim retention.dat
:%s/modified_attributes=2/modified_attributes=0/
# service icinga start

conclusion

by changing the host check type from active to passive you are changing the core's behaviour on handling the checkresults and state (counters). that is not a bug on the core, but misconfiguration on your side.

@icinga-migration
Copy link
Author

Updated by mckslim on 2013-10-01 23:34:37 +00:00

Wow, that is interesting, thanks for the explanation.
I tried things with host checks disabled and got the down;hard;1.
Then enabled host checks and could not get the 'down;hard;1'.

So, I guess its working as designed. I'll probably try 1.9.3 now to verify it works the same there.
thank you!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-10-03 02:12:55 +00:00

  • Status changed from Feedback to Rejected

it does. and git next with upcoming 1.10.0 does too. if there's anything else, i'll reopen the issue then.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 22:25:55 +00:00

  • Relates set to 7277

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant