Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #2688] triggered downtimes for child hosts are missing after icinga restart #1006

Closed
icinga-migration opened this issue Jun 14, 2012 · 19 comments
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2688

Created by mlucka on 2012-06-14 15:39:21 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2013-04-10 18:48:09 +00:00)
Target Version: 1.9
Last Update: 2013-04-10 18:48:09 +00:00 (in Redmine)

Icinga Version: 1.6.0
OS Version: Debian

Hi,

there's an issue on triggered downtime feature, seen on icinga 1.7.0 and nagios 3.2.3 and above...

triggered downtimes (type: fixed, child hosts: schedule triggered downtime for all child hosts) for child hosts will be delete during icinga restart. the downtime on the master host (parent) is not affected.
This should be easy to reproduce with just 2 hosts. If you need further information on that subject, don't hesitate to get touch with me.

Best regards

Michael

Attachments

Changesets

2012-10-30 20:12:47 +00:00 by mfriedrich 2b671f4

add test case refs #2688
@icinga-migration
Copy link
Author

Updated by mlucka on 2012-06-14 15:42:00 +00:00

I did not test this on earlier icinga versions, but nagios 3.0.6 and 3.2.1 doesn't have this issue. Maybe this information will help you a little bit while investigating...

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-06-14 15:45:43 +00:00

please provide some sample configs, as well as logs generated out of this.

@icinga-migration
Copy link
Author

Updated by mlucka on 2012-06-14 15:48:58 +00:00

This could be the reason/solution: http://tracker.nagios.org/view.php?id=338

Found some seconds ago...

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-06-14 16:07:35 +00:00

  • Target Version deleted 1.7

no. nagios 3.4.x took an icinga patch, from 2 years ago which has been rewritten ever since in icinga upstream.

icinga handles a restart (and therefore not being in effect downtime) differently, see common/downtime.c starting with

        /* else we are just starting the scheduled downtime */
        else {
...
                /* this happens after restart of icinga */
                if (temp_downtime->is_in_effect != TRUE) {

that patch addresses hosts in downtime not being persistent anymore after restart.

you are talking about child hosts triggered by the parent, which is a different story. so please provide your configs, and logs (plus debug logs in that special case) in order to see if your bug report is valid and reproducable.

@icinga-migration
Copy link
Author

Updated by mlucka on 2012-06-15 14:54:43 +00:00

  • File added icinga-bug-2688.tgz

Hi,

please find attached sample config, logs and screen shots...

I reproduced the behavior as follows (debian squeeze, up2date, 32bit):

  • setup fresh icinga 1.7.0 installation from backports-squeeze
  • adjusted icinga config for test setup, forced checks, stopped icinga, cleaned-up logs, started icinga again
  • saved status.dat into status.dat_after_first_start
  • scheduled a fixed downtime for localhost, triggered for all child hosts (test in this case)
  • saved status.dat into status.dat_before_first_stop
  • stopped icinga
  • saved retention.dat into retention.dat_after_first_stop
  • started icinga
  • the triggered host downtime (for child host test) was missing on the downtime page (icinga-bug-2688-02.jpg) but not on the host itself (icinga-bug-2688-01.jpg)
  • triggered downtime was also included in the current status file (status.dat_after_first_start)
  • i stopped icinga again and saved retention.dat into retention.dat_after_second_stop
  • the triggered downtime child host test was still included in retention.dat_after_second_stop
  • started up icinga again
  • triggered downtime for child host test was missing on downtime page (icinga-bug-2688-04.jpg) and the host itself (icinga-bug-2688-03.jpg)
  • saved status.dat into status.dat_after_second_start, triggered downtime for child host test was missing here as well
  • stopped icinga
  • saved retention.dat into retention.dat_after_third_stop, just one downtime from the parent included

I think you can reproduce it by yourself easily. An icinga 1.6.2 installation was tested as well showing the same results.

Best regards

Michael

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-06-15 15:33:31 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich

thanks for the detailed report, i'll put it on my todo list after 1.7.1 is out, plus when i am a puppet master.

@icinga-migration
Copy link
Author

Updated by mlucka on 2012-10-01 17:20:59 +00:00

Hi,

is there any schedule available, when this issue could be fixed?

Best regards

Michael

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-10-02 08:06:32 +00:00

  • Icinga Version set to 1
  • OS Version set to Debian

haven't had the time yet. hopefully others have - otherwise it will remain a todo.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-10-24 18:37:05 +00:00

  • Target Version set to 1.9

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-10-30 20:10:35 +00:00

  • File added 2688.cfg
  • File added status_retention_dat_2688.zip

testing with f78e443 as latest commit.

status_dat_before_first_stop

hoststatus {
        host_name=2688localhost-test

        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p1

        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p1-2


        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p2

        scheduled_downtime_depth=1
        }

retention_dat_after_first_stop

hostdowntime {
host_name=2688localhost-test
downtime_id=63
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=0
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test-p2
downtime_id=64
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test-p1-2
downtime_id=65
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test-p1
downtime_id=66
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}

status_dat_after_first_start

hoststatus {
        host_name=2688localhost-test

        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p1

        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p1-2


        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p2

        scheduled_downtime_depth=1
        }

retention_dat_after_second_stop

hostdowntime {
host_name=2688localhost-test-p1
downtime_id=66
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test-p1-2
downtime_id=65
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test-p2
downtime_id=64
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=63
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}
hostdowntime {
host_name=2688localhost-test
downtime_id=63
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=0
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}

status_dat_after_second_start

hoststatus {
        host_name=2688localhost-test

        scheduled_downtime_depth=1
        }

hoststatus {
        host_name=2688localhost-test-p1

        scheduled_downtime_depth=0
        }

hoststatus {
        host_name=2688localhost-test-p1-2

        scheduled_downtime_depth=0
        }


hoststatus {
        host_name=2688localhost-test-p2

        scheduled_downtime_depth=0
        }

retention_dat_after_third_stop

hostdowntime {
host_name=2688localhost-test
downtime_id=63
entry_time=1351624983
start_time=1351624960
end_time=1351628560
triggered_by=0
fixed=1
duration=3600
is_in_effect=1
author=icingademo
comment=test2688
trigger_time=1351624983
}

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-10-30 20:23:19 +00:00

so, as it's a bit late today - i can reproduce and see it, but i am not sure where this exactly is being hit, or ignored. might need some deep down debug sessions.

@icinga-migration
Copy link
Author

Updated by mlucka on 2013-02-28 15:44:21 +00:00

  • File added 99_fix_triggered_downtimes.dpatch

Hallo,

anbei die gesammelten Werke eines Kollegen, der von Nagios auf Icinga 1.7.1 (Debian 7) umzustellen versucht. Mit der Bitte um Prüfung und Integration des Patches.

Grüße, Micha.

beigefügt ein Patch, der das Downtime Problem bei Icinga
behebt (gegen icinga 1.7.1, die entsprechende Stelle ist
im aktuellen Git aber identisch).

Problem ist:

  • Eine Child-Downtime wird beim Einlesen der retention.dat/status.dat
    nicht übernommen, wenn die Parent-Downtime ("Trigger ID" in der
    klassischen GUI) nicht existiert.

Zitat common/downtime.c:add_downtime
/* don't add triggered downtimes that don't have a valid parent */

  • Die Downtimes werden zeitlich aufsteigend und ansonsten "ungünstig"
    sortiert und nicht -- wie im Nagios bisher -- nach der downtime_id
    aufsteigend.
    Deshalb werden die Child- vor den Parent-Downtimes in die
    status.dat gespeichert (sieht man auch in der klassischen UI).

Die retention.dat/status.dat wird aber nur sequentiell gelesen
und bearbeitet. Deshalb wird versucht, die Cild-Downtimes zuerst
zu erstellen.

Annahmen (des Patches):

  • Parent- und Child-Downtimes haben die gleiche Start-Zeit
  • downtime_id wird numerisch aufsteigend vergeben, Parent/Trigger ID
    vor dem Child

Randbemerkung: Warum die Sortier-Funktion so programmiert war
wie sie war, ist etwas unverständlich:
Statt
(d1~~start_time < d2>start_time) ? 1 : (d1>start_time - d2->start_time);
hätte auch
(d1
start_time < d2~~>start_time)
ausgereicht.

@icinga-migration
Copy link
Author

Updated by mlucka on 2013-02-28 16:41:49 +00:00

KORREKTUR

Statt
(d1~~start_time < d2>start_time) ? 1 : (d1>start_time - d2->start_time);
hätte auch
(d1
start_time~~ d2->start_time)
ausgereicht

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-03-04 19:58:27 +00:00

that looks like a hell of an idea, thanks.

once i get a little more dev time, i'll try re-think and test it.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-03-10 14:51:59 +00:00

  • File added icinga_1.9_fix_child_downtimes_2688.png

currently exists in my mfriedrich/core dev branch.

a final test after committing shows that the child triggered downtimes are still there.

icinga_1.9_fix_child_downtimes_2688.png

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-03-13 23:18:34 +00:00

  • Status changed from Assigned to 7
  • Done % changed from 0 to 70

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-03-13 23:20:41 +00:00

used the wrong commit id
https://dev.icinga.org/projects/icinga-core/repository/revisions/161d5117fa585e6cbbbef27b51a6701dfa2a8eeb

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-04-06 22:02:32 +00:00

@MLucka

are you able to test current git master/next?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-04-10 18:48:09 +00:00

  • Status changed from 7 to Resolved
  • Done % changed from 70 to 100

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant