[dev.icinga.com #1782] Icinga forgets to schedule non-24x7 checks #712
Comments
Updated by mfriedrich on 2011-08-04 10:41:49 +00:00
|
Updated by mcp on 2011-08-04 10:59:16 +00:00
config example attached. icingastats output attached. I don't have graphs for icingastats, some others attached. |
Updated by mcp on 2011-08-04 11:01:18 +00:00 this is a quadcore machine btw. |
Updated by mcp on 2011-08-05 11:30:51 +00:00 yesterday I tried using use_retained_scheduling_info=0 and today's morning |
Updated by mfriedrich on 2011-08-05 12:10:30 +00:00 well monitoring the overall core performance on the icingastats output over time would give a better idea than just doing wild guesses. did it work with previous versions than 1.4.x ? |
Updated by mcp on 2011-08-05 12:21:41 +00:00 well, the problem isn't a performance problem at all. I wasn't aware of the problem with icinga 1.3.x but I think and please notice, its just for service !24x7. the other ones |
Updated by mfriedrich on 2011-08-05 13:45:22 +00:00 so to speak a timeperiods problem. performance would have been a guess in regards of "forgetting things over time", meaning no space for calculations. which are the system defaults - UTC, something else? did this happen now in DST? and just a wild guess from the example
turn on debugging and check if the event for rescheduling your service. it might also be interesting, which state the current service is in ... just for the record - i've just looked into base/events.c where the logic happens.
so given the todo i would expect that this is exactly the case for you. git blame tells that
so let's have a look into ... |
Updated by mcp on 2011-08-05 13:48:50 +00:00 the services are almost in state OK. |
Updated by mcp on 2011-08-05 14:13:33 +00:00 default system timezone is Europe/Berlin (Fri Aug 5 16:13:11 CEST 2011) |
Updated by mfriedrich on 2011-08-05 15:27:04 +00:00 ok, to trigger the reschedule (if this really happens), the run_event must be set to false. this is the case if
it would be generally interesting from your debuglog if you can catchup with these messages (set the debug level to match both "checks" and "events")
|
Updated by mfriedrich on 2011-08-05 15:42:08 +00:00 after having analyzed this, more event output could be the lead to the topic (in init_timing_loop which is called before starting the event_loop for monitoring things) either on of those 2
|
Updated by mcp on 2011-08-05 16:52:37 +00:00 Hi dnsmichi, ok, I set debug_level to 24, and debug_verbosity to 1. is 1 enough for the verbosity or should I set it to 2? thanks! |
Updated by mfriedrich on 2011-08-05 16:54:58 +00:00 be the most verbose, so 2 ... |
Updated by mcp on 2011-08-06 08:26:13 +00:00 Preferred Time is Invalid In Timeperiod '07-22 Uhr - x7': 1312603714 --> Sat Aug 6 06:08:34 2011 but I think that is because of it should start at 7 am. at startup I see some with "Service check should not be scheduled" but that's OK because at 06.08.2011 06:02:02: [1312603322.977948] [008.2] [pid=31568] Service 'Updates' on host 'loghost' and if I understand this correctly, it is totally wrong, because that and for the Update Service on host loghost: Last Check Time: 2011-08-05 18:12:06 for the other missing 450 for today yet it's the same. |
Updated by mfriedrich on 2011-08-06 10:56:11 +00:00 mcp wrote:
hmm that sounds like a DST problem again.
so the core calculates a check time based on the assumption that DST is not active, which then results in wrong check times which are not executed because their check_period is different. that actually requires debugging ... |
Updated by mfriedrich on 2011-08-06 12:06:18 +00:00 ok, ... the next check time gets calculated from various factors, first, the CIB (current interleave block) plus IBI (interleave block index) multiplied by TIB (total interleave blocks)
creating the mult factor, which is being used to calculate the next_check
which then results in your output. the below code block checks, if the preferred check time is a valid time within the timeperiods.
given your example, it fails and spits out the error msg in debug log.
note: those two entries do not match! but the calculation is wrong either way.
actually it doesn't because the next_check time stays the same afterwards.
so conclusion for this analysis - the calculated next_check time does not fit within the timeperiod throws an error, but the retry of getting a valid check time does something wrong, and probably ignores the timeperiod setting (and/or has DST problems). otherwise the Actual Check Time would be corrected. |
Updated by mfriedrich on 2011-08-06 12:29:40 +00:00 and now for the worse part, reading on nagios bug tracker and lists... http://tracker.nagios.org/view.php?id=31 and seems like we got a winner on an old and long lasting bug :( edit: |
Updated by mfriedrich on 2011-08-06 12:45:07 +00:00 one last question if we can build a proposed patch on that ... which exact version of icinga is installed, 1.4.2 release tarball or something different? testing will require a recompiled source install. |
Updated by mcp on 2011-08-08 06:28:59 +00:00 1.4.2 release tarball. |
Updated by mfriedrich on 2011-08-08 06:50:37 +00:00 ok. as i do think that this hurts your production, and i currently can't reproduce the overall problem, i'd propose a workaround for that problem
regarding the problem, i'd love to see the full debuglog (with level=-1 and verbosity=2) from -1 to +1 in the time window). if it contains not-accurate-for-public info, targz it and send it to my mailbox michael.friedrich (at) univie.ac.at |
Updated by mfriedrich on 2011-08-08 09:14:25 +00:00 i've created a test config, and i can't reproduce it. checks are rescheduled like i want them to be due to the timperiods being set.
|
Updated by mfriedrich on 2011-08-08 13:38:32 +00:00 this thread is very interesting, also the proposed patch. http://thread.gmane.org/gmane.network.nagios.devel/5530/focus=5554 currently, if the next_check time is not a valid time and not in the timeperiod, the check becomes reschedule for the next week (does this happen in here?). by changing that to happen at the next check cycle, the core could retry that once more, when hitting the changed timeperiod / changed time window. |
Updated by mcp on 2011-08-08 14:33:42 +00:00 next_check is 0 (N/A) |
Updated by mcp on 2011-08-15 14:06:32 +00:00 I've narrowed the problem down. If you restart Icinga during the time when the check period should NOT check it won't get rescheduled when it should i.e. when the timeperiod begins again. |
Updated by mcp on 2011-09-02 07:00:56 +00:00 any news on this? |
Updated by mfriedrich on 2011-09-04 16:00:05 +00:00 still, even if narrowed down, i'd love to see the debug file to see exactly what happens. |
Updated by mfriedrich on 2011-09-16 12:35:29 +00:00 possible this valuable patch from andreas could resolve the rescheduling issue with check periods on (changed) timeperiods.
|
Updated by mcp on 2011-09-16 12:49:16 +00:00 ok, will test. some very little modifications, mostly spaces, and it'll apply Will test this evening. More to come ... |
Updated by mcp on 2011-09-19 08:07:23 +00:00 the patch does not fix this problem. |
Updated by mcp on 2011-09-20 15:42:29 +00:00 http://www.nagios-portal.org/wbb/index.php?page=Thread&postID=85322 |
Updated by mfriedrich on 2011-10-22 15:29:32 +00:00 host check viability when running an asynchronous host check, is_valid_time gets set by that, determining if the check should be run and if time_is_valid is true or false.
time_is_valid is used in here /* run a scheduled host check asynchronously */ /* attempt to run the check */ /* an error occurred, so reschedule the check */ /* the host could not be rescheduled properly - set the next check time for next week */ time_is_valid is set to FALSE if the check could not be executed. so this is not the actual check but a timeperiod check before. preferred_time is then calculated too, being matched against as next valid time
the question is, why is next_valid_time == preferred_time and not having a pre-checked valid time from the check attempt to set a reschedule time 1 week in the future. for the services, this is a bit different.
so over here, time_is_valid is FALSE too if the check could not happen, but the next_valid_time calculated from previous get_next_valid_time does not fit in check_time_against_period so it still gets rescheduled. i would suspect a bug in that function, because is_valid_time is always FALSE (so TRUE on the condition, and unless the timeperiod does not match the next_valid_time, the condition is always matched and therefore checks are rescheduled next week, but as a matter of fact, this only occurs if using non 24x7 timeperiods. deeper analysis on both needed
|
Updated by mfriedrich on 2011-10-22 15:31:35 +00:00
|
Updated by mfriedrich on 2011-10-22 15:48:59 +00:00
compared to
which could be true too - the timeperiod checks are not valid and therefore causing within the check viability an error, and afterwards on a reschedule retry they are immediately rescheduled away once more because of probably the same bug. this happens within milliseconds, not to say nanoseconds, to determine
|
Updated by mfriedrich on 2011-10-22 17:48:44 +00:00 get_next_valid_time contains a nested check with check_time_against_period and preferred_time against tperiod.
0 causing get_earliest_time to use
and then taking the timeperiod exclusions into account. i expect that no timeperiod exclusion is used currently? if so this would be called recursively, incrementing the level and also hitting the else tree. why the heck is valid_time +1 over there?
anyhow, the actual logic happens within _get_next_valid_time_per_timeperiod this is where a lot of is_dst=-1 are happening, but that should be set by the system and then being checked and compared by the application, like done with shift? |
Updated by mfriedrich on 2011-10-22 17:50:45 +00:00
can you save dst.c and let it compile somewhere with
and run it with
and post the output? maybe there is a daylight saving problem? |
Updated by mfriedrich on 2011-10-22 17:58:04 +00:00 just checked ...
isdst should be determined with
which is actually done within
even calculating the shift then.
but as a matter of fact, not used in each place where new times are calculated. could be a problem too. |
Updated by mfriedrich on 2011-10-22 18:28:47 +00:00 can i get the full debuglog what's happening there, sent to michael.friedrich(at)univie.ac.at icinga.cfg debug_level=16 |
Updated by mcp on 2011-10-23 09:50:14 +00:00 root@lokalhorst:/root/ # ./dst |
Updated by mcp on 2011-10-25 10:16:52 +00:00 ok debug log is almost ready. I have to replace customer names etc. but when I'm finished I'll send it to you. |
Updated by mcp on 2011-10-25 11:46:52 +00:00 Just sent the debug output via email to you. |
Updated by elagon on 2012-08-27 15:30:52 +00:00
@mcp Can you try the latest stable version and tell us if you notice the same issue? |
Updated by mcp on 2012-08-27 15:56:31 +00:00 I will. |
Updated by mcp on 2012-09-12 16:17:05 +00:00 Moin elagon, nice, after several days I did not have 1 check which wasn't scheduled. No Icinga restarts at 7am and 8am anymore. Thank you very much! Nice work :) |
Updated by mfriedrich on 2012-09-12 17:19:05 +00:00 interesting. i did not change anything in this code region. lemme guess - we are in the wrong timezone (dst)? |
Updated by mfriedrich on 2012-09-12 17:32:35 +00:00 we could test a patch, which was originally done on the nagios tree. |
Updated by mcp on 2012-09-12 18:07:16 +00:00 to test what? breakage again? ;-) |
Updated by mcp on 2012-09-13 08:09:56 +00:00 grmpf, for whatever fucking reason, this morning I had 554 services which were not scheduled :-( |
Updated by mjbrooks on 2012-09-14 00:23:59 +00:00 @mcp please keep your language family friendly. Updates go out to other places such as Twitter. |
Updated by mfriedrich on 2012-09-24 10:42:10 +00:00 as said, we can test that patch. though, i won't take that into 1.8 without longterm tests. |
Updated by mcp on 2012-10-08 10:06:44 +00:00 OK, applied and I will test it. |
Updated by mcp on 2012-10-09 15:13:32 +00:00 the patch does not fix the problem :( |
Updated by shankarpatel on 2013-07-15 06:02:07 +00:00 I am also facing that same issue so I registered 4368 issue for the same and now waiting for reply here as per dnsmichi. |
Updated by mfriedrich on 2014-06-20 09:12:59 +00:00 You might try Icinga 2. The event scheduler handling was (like the rest) rewritten from scratch, and Icinga 2 won't forget about tasks being handled. |
Updated by mfriedrich on 2014-07-19 12:51:08 +00:00 I guess that's a long-term won't fix unless someone comes up with a patch that works here. If there is none in the next months, I'll close the issue. |
Updated by mfriedrich on 2014-08-20 20:45:03 +00:00
|
Updated by mfriedrich on 2015-03-12 19:14:02 +00:00
Closing as wontfix. If you come up with a patch, we can re-open the issue. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/1782
Created by mcp on 2011-08-04 10:34:46 +00:00
Assignee: (none)
Status: Rejected (closed on 2015-03-12 19:14:02 +00:00)
Target Version: (none)
Last Update: 2015-03-12 19:14:02 +00:00 (in Redmine)
Hi,
I'm using Icinga v1.4.2, but had the same problems also with v1.4.1 and v1.4.0,
but it seems to become more worse the past days.
Icinga forgets to schedule checks which are not 24x7.
I have a bunch of checks doing checks from 7am to 10pm and another bunch
of checks from 8am to 6pm, but those checks are not scheduled to start again
when a new day and the timeperiod has begun. Today I had 341 service checks
for 7am and 228 for 8am.
all those checks have "Next Scheduled Check: N/A".
When I restart Icinga after 7am but before 8am all checks which should start
at 7am are scheduled again, but not the ones who should start at 8am. I have
to restart Icinga once more after 8am to get these checks scheduled also.
I'm using the same config as I used for Nagios 3 the past years and didn't
have that problem with Nagios.
any idea?
thanks!
Attachments
Changesets
2011-10-22 18:09:37 +00:00 by mfriedrich 348a795
Relations:
The text was updated successfully, but these errors were encountered: