New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #11382] Downtimes are not always activated/expired on restart #4031
Comments
Updated by mfriedrich on 2016-03-15 07:55:33 +00:00
|
Updated by mfriedrich on 2016-03-15 07:55:46 +00:00
|
Updated by mfriedrich on 2016-03-15 10:41:55 +00:00
|
Updated by mfriedrich on 2016-03-15 10:42:45 +00:00 NMS & #458875 |
Updated by mfriedrich on 2016-03-15 11:06:51 +00:00
|
Updated by mfriedrich on 2016-03-15 15:13:36 +00:00 Some questions:
|
Updated by mfriedrich on 2016-03-15 15:14:26 +00:00
|
Updated by ClemensBW on 2016-03-15 15:38:28 +00:00 Same problem here: dnsmichi wrote:
object Endpoint "$master" { object Zone "master" { object Endpoint "987" { object Zone "987" { //////////////////////// object Endpoint "123" { object Zone "456" {
----------------------------- ----------------------------- -----------------------------
|
Updated by delgaty on 2016-03-15 18:31:39 +00:00 dnsmichi wrote:
463/1453
object Endpoint "i2master" { object Zone "master" { object Zone "i2slave" { object Zone "global-templates" {
I have removed my recurring downtime file until I resolve this issue. I have been scheduling them thru icinga-web and icingaweb2 with the same result. |
Updated by mfriedrich on 2016-03-18 10:21:20 +00:00
|
Updated by jcarterch on 2016-03-21 21:13:33 +00:00 I am encountering what seems to be the same problem. A way I was able to consistently to reproduce this is that all active downtimes (scheduled via API) are lost/disabled when a secondary master restarts and the primary configuration master is down (zone of two masters total). Downtime events remain present in database (icinga_scheduleddowntime), but are not removed by API requests, though DEL_HOST_DOWNTIME and DEL_SVC_DOWNTIME commands are logged in debug to be written to icinga_externalcommands. No errors are logged in icinga for a failed external command. Downtimes are visible in icingaweb2, but are not "in effect", and do not return to being in effect after forced service checks are executed. Bringing the primary up again does not return the downtimes to a active state. |
Updated by mfriedrich on 2016-03-22 21:32:24 +00:00 Notes:
RemoveDowntimeInternal() does not only delete the entry but updates the downtime history and status tables (downtime_depth).
It certainly would help to get a debug log containing all "IdoMysql" entries before and after the restart to dig deeper. |
Updated by mfriedrich on 2016-03-23 09:15:41 +00:00 Question to all the reporters here: Once you've added a downtime or a comment you should see the (legacy)_id (note that one from inside your database dump). The table entry also includes the object_id (note that one as well). Better: Enable the Core REST API which is independent from the faulty DB IDO updates. Then you'll know the current downtime values. Pick a legacy_id which is pretty huge, not the first or tenth one. Note the __name field and write it down. Also the legacy_id. Then fire a restart a restart. Call the REST API again and look for the __name. Is the legacy_id the same or did it change? Post your findings here including all outputs here, please. (OTOH: I guess the DELETE query using the object_id and internal_downtime_id in the where clause is incorrect because the downtime_id (legacy_id) changed over restarts. In 2.4 we changed the way downtimes are inserted (delete up front) but never looked into the delete statement. Icinga 1.x omits the legacy_id as it could change over time). |
Updated by mfriedrich on 2016-03-23 13:08:20 +00:00
This could be the culprit when the DB IDO Delete-then-insert happens with the new legacy_id leaving the old behind. I haven't found a clean way to reproduce it yet (I always have the same number of downtimes). But I've found a possible fix by re-adding the legacy_id field to the state file. Please test the current snapshot packages from git master.
|
Updated by delgaty on 2016-04-01 17:08:23 +00:00 I have been running the snapshot in my test environment for over a week and have had no issues with downtime. Both manual and recurring downtimes are activating and expiring as expected. The issue seems to be resolved in this snapshot. Thanks. |
Updated by mfriedrich on 2016-04-04 14:05:28 +00:00
Ok thanks for your kind feedback on your tests. I'd be happy to see the other reporters for this issue also firing up their test stages. Kind regards, |
Updated by mfriedrich on 2016-04-07 08:16:15 +00:00
|
Updated by gbeutner on 2016-04-20 08:15:54 +00:00
|
Updated by mfriedrich on 2016-05-02 13:27:26 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/11382
Created by delgaty on 2016-03-14 19:22:33 +00:00
Assignee: mfriedrich
Status: Resolved (closed on 2016-04-07 08:16:14 +00:00)
Target Version: 2.4.5
Last Update: 2016-05-02 13:27:26 +00:00 (in Redmine)
Hello,
I am running icinga-2.4.3-1 with icinga-web 1.13.1 on Fedora 23 in a distributed environment. I am able to execute most external commands such as acknowledgements, schedule checks and disable notifications with out incident. But I am having trouble with downtimes. I am able to schedule a downtime and see it in the downtime pane fine. The problem is that the downtime isn't always activating. If it is fixed for 2 hours and the service goes down within that timeframe alerts are not getting supressed and the downtime icon is not appearing for the service. I am also seeing downtimes that should have expired still in the downtime pane. Deleting them is not working either. I have run debug and do not see any errors.
When deleting the downtime that did not activate this is the entry in the log:
[2016-03-13 13:35:16 -0600] information/ExternalCommandListener: Executing external command: [1457897716] DEL_SVC_DOWNTIME;1
[2016-03-13 13:35:16 -0600] debug/IdoMysqlConnection: Query: INSERT INTO icinga_externalcommands (command_args, command_name, command_type, endpoint_object_id, entry_time, instance_id) VALUES ('1', 'DEL_SVC_DOWNTIME', '79', 7626, FROM_UNIXTIME(1457897716), 1)
But the entry in the database never actually gets deleted.
Nor does the corresponding file in /var/lib/icinga2/api/packages/_api/xxx.xxx.xxx-1457480846-0/conf.d/downtimes.
/var/lib/icinga2/api/packages/_api/xxx.xxx.xxx-1457480846-0/conf.d/downtimes
-rw-r-
r-1 icinga icinga 344 Mar 13 13:22 cloudweb07xxx.xxx.xxx-1457896976-0.confobject Downtime "xxx.xxx.xxx-1457896976-0" ignore_on_error {
author = "jen"
comment = "test 1"
config_owner = ""
duration = 7200.000000
end_time = 1457904211.000000
fixed = true
host_name = "cloudweb07"
scheduled_by = ""
service_name = "CPU"
start_time = 1457897011.000000
triggered_by = ""
version = 1457896976.122614
}
Sometimes however, the downtimes do work. This is info if from one that did activate:
-rw-r-
r-1 icinga icinga 350 Mar 14 12:36 LinuxSMTPxxx.xxx.xxx-1457980581-0.confobject Downtime "xxx.xxx.xxxl-1457980581-0" ignore_on_error {
author = "jen"
comment = "test"
config_owner = ""
duration = 7200.000000
end_time = 1457981430.000000
fixed = true
host_name = "LinuxSMTP"
scheduled_by = ""
service_name = "procs"
start_time = 1457980590.000000
triggered_by = ""
version = 1457980581.536455
}
I cannot find anything in common with the downtimes that are not working correctly. Thanks for any insight.
Changesets
2016-03-23 13:05:09 +00:00 by mfriedrich 0447e81
2016-04-20 08:07:22 +00:00 by mfriedrich 521580f
Relations:
The text was updated successfully, but these errors were encountered: