Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11173] Notification for hosts/Services in downtime after config reload #3944

Closed
icinga-migration opened this issue Feb 17, 2016 · 15 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11173

Created by Reavermaster on 2016-02-17 10:45:40 +00:00

Assignee: (none)
Status: Closed (closed on 2017-01-09 15:06:11 +00:00)
Target Version: (none)
Last Update: 2017-01-09 15:06:11 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Not yet backported
Include in Changelog: 1

I've monitored that icinga2 sends out notifications for hosts or services in downtime after a config reload was startet.

Sometimes Notifications was also resend from the other cluster node after a reload.

System:
CentOS 7.2.1511
icinga2 v2.4.1


Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-24 19:51:08 +00:00

  • Status changed from New to Feedback
  • Assigned to set to Reavermaster

Do you happen to have more details, e.g. (debug) logs providing more insights on why the downtime is ignored for these services, allowing to send notifications?

@icinga-migration
Copy link
Author

Updated by Reavermaster on 2016-02-29 08:06:28 +00:00

Okay here are some lines from the icinga2.log:

[2016-02-29 08:44:25 +0100] information/Application: Got reload command: Starting new instance.
[2016-02-29 08:44:26 +0100] information/Application: Received request to shut down.
[2016-02-29 08:44:26 +0100] information/Application: Shutting down...
[2016-02-29 08:44:26 +0100] information/CheckerComponent: Checker stopped.
[2016-02-29 08:44:28 +0100] critical/Socket: accept() failed with error code 9, "Bad file descriptor"
[2016-02-29 08:44:28 +0100] critical/LivestatusListener: Cannot accept new connection.
[2016-02-29 08:44:28 +0100] warning/IcingaStatusWriter: This feature was deprecated in 2.4 and will be removed in future Icinga 2 releases.
Context:
        (0) Activating object 'icinga-status' of type 'IcingaStatusWriter'

[2016-02-29 08:44:28 +0100] information/DbConnection: Resuming IDO connection: ido-mysql
[2016-02-29 08:44:28 +0100] information/LivestatusListener: Created UNIX socket in '/run/icinga2/cmd/livestatus'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Copying 186 zone configuration files for zone 'location1' to '/var/lib/icinga2/api/zones/location1'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Copying 17 zone configuration files for zone 'location2' to '/var/lib/icinga2/api/zones/location2'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Copying 20 zone configuration files for zone 'configurations' to '/var/lib/icinga2/api/zones/configurations'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Adding new listener on port '5665'
[2016-02-29 08:44:29 +0100] information/ConfigItem: Activated all objects.
[2016-02-29 08:44:29 +0100] information/JsonRpcConnection: Reconnecting to API endpoint 'icingahost2.my.location.com' via host '10.20.30.3' and port '5665'
[2016-02-29 08:44:29 +0100] information/JsonRpcConnection: Reconnecting to API endpoint 'icingahost3.my.location.com' via host '10.20.30.4' and port '5665'
[2016-02-29 08:44:29 +0100] information/ConfigCompiler: Compiling config file: /var/lib/icinga2/modified-attributes.conf
[2016-02-29 08:44:29 +0100] information/ApiListener: New client connection for identity 'icingahost2.my.location.com'
[2016-02-29 08:44:29 +0100] information/ApiListener: Sending config updates for endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing zone 'meschede' to endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing zone 'nuttlar' to endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing global zone 'configurations' to endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing runtime objects to endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Finished sending config updates for endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Sending replay log for endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Replayed 8 messages.
[2016-02-29 08:44:29 +0100] information/ApiListener: Finished sending replay log for endpoint 'icingahost2.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: New client connection for identity 'icingahost3.my.location.com'
[2016-02-29 08:44:29 +0100] information/ApiListener: Sending config updates for endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing zone 'nuttlar' to endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing global zone 'configurations' to endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Syncing runtime objects to endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Finished sending config updates for endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Sending replay log for endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:29 +0100] information/ApiListener: Finished sending replay log for endpoint 'icingahost3.my.location.com'.
[2016-02-29 08:44:30 +0100] warning/ApiListener: Ignoring config update. 'api' does not accept config.
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'icingaadmin'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user1'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user2'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user3'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user4'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user5'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user6'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user7'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user8'
[2016-02-29 08:44:33 +0100] information/Notification: Sending notification 'checkhost!twinnotification' for user 'user9'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:33 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:34 +0100] information/Notification: Completed sending notification 'checkhost!twinnotification' for checkable 'checkhost'
[2016-02-29 08:44:39 +0100] information/Checkable: Checking for configured notifications for object 'checkhost!ping'
[2016-02-29 08:44:55 +0100] information/DbConnection: Pausing IDO connection: ido-mysql

The Problem is: the "checkhost" is in downtime - here the database export:

scheduleddowntime_id    instance_id downtime_type   object_id   entry_time  author_name comment_data    internal_downtime_id    triggered_by_id is_fixed    duration    scheduled_start_time    scheduled_end_time  was_started actual_start_time   actual_start_time_usec  is_in_effect    trigger_time    endpoint_object_id  name
105529  2   2   12169   26.02.2016 13:41    user2   Legacy  194 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:45    374527  1   26.02.2016 13:45    5   checkhost!icingahost1.my.location.com-1456490504-65
105532  2   1   12190   26.02.2016 13:41    user2   Legacy  195 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:44    271035  1   26.02.2016 13:44    5   checkhost!CPU-load!icingahost1.my.location.com-1456490504-66
105535  2   1   30659   26.02.2016 13:41    user2   Legacy  196 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:41    423257  1   26.02.2016 13:41    5   checkhost!Citrix-Services!icingahost1.my.location.com-1456490504-67
105538  2   1   30650   26.02.2016 13:41    user2   Legacy  197 0   1   0   26.02.2016 13:41    05.03.2016 15:41    0   0000-00-00 00:00:00 0   0   0000-00-00 00:00:00 5   checkhost!DNS-Check!icingahost1.my.location.com-1456490504-68
105541  2   1   12187   26.02.2016 13:41    user2   Legacy  198 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:42    573000  1   26.02.2016 13:42    5   checkhost!HDD-C!icingahost1.my.location.com-1456490504-69
105544  2   1   12178   26.02.2016 13:41    user2   Legacy  199 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:44    961209  1   26.02.2016 13:44    5   checkhost!NSClient-Version!icingahost1.my.location.com-1456490504-70
105547  2   1   12193   26.02.2016 13:41    user2   Legacy  200 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:43    142214  1   26.02.2016 13:43    5   checkhost!Office-Scan-Service!icingahost1.my.location.com-1456490504-71
105550  2   1   12184   26.02.2016 13:41    user2   Legacy  201 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:46    949558  1   26.02.2016 13:46    5   checkhost!RAM-Usage!icingahost1.my.location.com-1456490504-72
105553  2   1   32115   26.02.2016 13:41    user2   Legacy  202 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:42    612967  1   26.02.2016 13:42    5   checkhost!SNMP-Service!icingahost1.my.location.com-1456490504-73
105556  2   1   12196   26.02.2016 13:41    user2   Legacy  203 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:44    621216  1   26.02.2016 13:44    5   checkhost!Standard-Checks!icingahost1.my.location.com-1456490504-74
105559  2   1   12172   26.02.2016 13:41    user2   Legacy  204 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:43    594219  1   26.02.2016 13:43    5   checkhost!ping!icingahost1.my.location.com-1456490504-75
105562  2   1   12181   26.02.2016 13:41    user2   Legacy  205 0   1   0   26.02.2016 13:41    05.03.2016 15:41    1   26.02.2016 13:45    153484  1   26.02.2016 13:45    5   checkhost!uptime!icingahost1.my.location.com-1456490504-76

Do you need more Information?

Edit: The Update to icinga2 version 2.4.3 did not fix this problem. The Log and SQL output is from the updated Version btw.

@icinga-migration
Copy link
Author

Updated by essener61 on 2016-03-03 09:05:18 +00:00

We can confirm the problem . All non acknowledged services are alertet with each reload.
This applies not only to services in downtime but also for other services

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-03 09:09:25 +00:00

  • Status changed from Feedback to Assigned
  • Assigned to changed from Reavermaster to mfriedrich

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:35:57 +00:00

  • Parent Id set to 11311

@icinga-migration
Copy link
Author

Updated by phsc on 2016-03-09 15:30:15 +00:00

We do have the same issue with reloading icinga2 services in a cluster setup consisting of 2 nodes and 4 satellite zones (each zone has 2 satellite servers).

This is the procedure how I started the cluster:

  • Start icinga2 service on the first node
  • Active endpoint is the first node (obviously)
  • Copy icinga2.state file from the first to second node
  • Start icinga2 service on the second node
  • After a few seconds the active endpoint changes from the first to the second node

When I reload the icinga2 service on the first node, from where I distribute my config, the downtimes remain how they should. But when I reload the icinga2 service on the second node all downtimes turn ineffective. I can see them in "Downtimes" in icingaweb2 and in the database though.
We use a custom dashboard which filters out handled services and hosts which now shows the downtimed items as well (I think they would be also notified as essener61 mentioned, but at the moment we don't use notification). Additionally I'm not able to delete configured downtimes at all. I neither get an error message on the web interface nor in the log files.

A few seconds after stopping the icinga2 service on the second node, the first node becomes the active endpoint. At this time all the previous configured downtimes turned effective again and deleting works again.

Please let me know if you need more testing or log files.

Thanks
Phil

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-09 15:34:02 +00:00

  • Relates set to 11012

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-02 13:23:18 +00:00

  • Category changed from Notifications to Cluster

A guess from reading the comments - the secondary node does not know anything about the runtime created comments/downtimes objects. Once it reloads and takes over the active/active enable_ha IDO, it will flush/remove the visible downtimes/comments (as the core thinks it does not exist).

@phcs
Can you check whether your secondary node has all the downtimes synced over to /var/lib/icinga2/api/packages/_api? I would assume they are not there.

Changing the category to "cluster" as it seems this is only affecting HA setups.

Using a local standalone setup this is not reproducible.

  1. "disk" is warning
  2. scheduled downtime
  3. watch it triggered
  4. kill -HUP $(pidof icinga2)
  5. tail -f var/log/icinga2/icinga2.log

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-05 13:58:45 +00:00

Tried to reproduce the issue with the snapshot packages, but I am not able to reproduce any notifications upon ha cluster node restart. Similar issue is here: https://dev.icinga.org/issues/11012#note-12

Can you please deploy the current snapshot packages in your environment and check whether your problem is solved?

@icinga-migration
Copy link
Author

Updated by phsc on 2016-08-29 12:47:40 +00:00

@dnsmichi
At the moment I run my 2-node cluster with icinga2.service disabled on the second node. If I start icinga2.service on the second node the "active endpoint" switches from the first to the second node. At this moment most of the downtimes are going inactive and the downtimed problems appear unhandled. When I compare /var/lib/icinga2/api/packages/_api on both nodes I can see that the downtimes were not synchronized from the first to the second node. After stopping icinga2.service on the second node the "active endpoint" switches back to the first node and after a while the downtimes are in effect again.

I guess as well that it's only an issue in a cluster scenario since I don't encounter any problems with downtimes with only one active node.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-29 13:34:43 +00:00

  • Status changed from Assigned to Feedback
  • Assigned to deleted mfriedrich

Ah. So the culprit is that downtimes are not in sync between two HA nodes. Anything else which causes trouble (notifications, etc.) is probably just related to this problem. If you re-sync that directory on both nodes, does it work again?

@icinga-migration
Copy link
Author

Updated by phsc on 2016-08-29 14:52:49 +00:00

Ok, I copied the downtime files manually from the first to the second node, adjusted the permissions and started icinga2.service on the second node. After Icinga2 switched the active endpoint to the second node, I can see that the downtimes for which I have manually copied the files to the second host, are still in effect. This seems to work.
But other downtimed hosts and services appear now to be unhandled. So I checked the icinga_scheduleddowntime table in the database and I can see a big difference compared to the downtime files (on both nodes, obviously). The database table contains a lot of scheduled downtimes which I can't find any downtime file for. How does downtime handling exactly works?
What appears weird to me is that the nodes in the cluster handle downtimes differently, but have the same downtime files (after copied manually) and use the same database (Galera Cluster).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-09 14:55:33 +00:00

  • Parent Id deleted 11311

@icinga-migration
Copy link
Author

Updated by phsc on 2016-12-23 07:34:05 +00:00

After upgrading all cluster members to 2.6 and cleaning up all downtimes files and database entries manually, the problem seems to be solved. Thanks!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2017-01-09 15:06:11 +00:00

  • Status changed from Feedback to Closed

Ok, thanks for the feedback!

Kind regards,
Michael

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant