Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11387] IDO: historical contact notifications table column notification_id is off-by-one #4035

Closed
icinga-migration opened this issue Mar 15, 2016 · 15 comments
Labels
area/db-ido Database output blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11387

Created by elippmann on 2016-03-15 09:41:05 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-03-29 13:15:03 +00:00)
Target Version: 2.4.5
Last Update: 2016-05-06 14:50:39 +00:00 (in Redmine)

Icinga Version: 2.4.3
Backport?: Already backported
Include in Changelog: 1

Hi,

Icinga 2 has pending updates to contactnotifications not written "immediately" into the IDO after a notification has been sent. The contacts of the first notification are always written into the database. But the contacts of the second notification are not written into the database until a third notification has been sent and so on. This can be observed independently for every object.

Best regards,
Eric

Changesets

2016-03-23 16:58:38 +00:00 by mfriedrich df2adb1

IDO: Properly cache notification_id for contactnotifications

1) We cannot "pass" the notification insert id directly
to the following contactnotifications queries. Instead
we need to properly initialize the cache and later fetch
the correct notification_id
2) There's the possibility that 2+ notifications happen
for the same notification. In that case the old notification_id
cache is invalid as it does not respect the timestamp. This patch
adds such a cache.

3) We need to clean unused notification_id entries
from the cache. Otherwise we would leak memory holding references
to the object pointer and timestamp.

refs #11387

2016-03-29 13:12:24 +00:00 by mfriedrich 83e0bcd

IDO: Ensure that the notification insert id is passed to contactnotification inserts

We cannot add the notification insert id directly
to the following contactnotifications queries. Instead
we need to pass a DbValue reference which gets set after
the notification insert id is calculated.
The contactnotifications insert queries are fired as query
group calling CanExecuteQuery() and FieldToEscapedString().
If the notification id does not exist yet, the queries are
moved to the end of the queue.
Once the contactnotifications queries are fired the DbValue
reference is destroyed.

This patch also removes the old notification insert id cache
which was broken in many ways.

fixes #11387

2016-04-20 08:07:24 +00:00 by mfriedrich 0cbedf4

IDO: Ensure that the notification insert id is passed to contactnotification inserts

We cannot add the notification insert id directly
to the following contactnotifications queries. Instead
we need to pass a DbValue reference which gets set after
the notification insert id is calculated.
The contactnotifications insert queries are fired as query
group calling CanExecuteQuery() and FieldToEscapedString().
If the notification id does not exist yet, the queries are
moved to the end of the queue.
Once the contactnotifications queries are fired the DbValue
reference is destroyed.

This patch also removes the old notification insert id cache
which was broken in many ways.

fixes #11387

Relations:

@icinga-migration
Copy link
Author

Updated by elippmann on 2016-03-15 09:41:26 +00:00

https://rt.netways.de/Ticket/Display.html?id=457655

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-15 09:59:00 +00:00

  • Parent Id set to 11311

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 10:08:00 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich

I need to look into that.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 10:08:49 +00:00

  • Parent Id changed from 11311 to 11412

@icinga-migration
Copy link
Author

Updated by elippmann on 2016-03-21 09:24:03 +00:00

FYI: When restarting Icinga 2 the "queue" is lost.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-21 09:56:55 +00:00

  • Relates set to 5103

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-21 09:57:30 +00:00

  • Relates set to 5265

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-21 10:40:01 +00:00

Hm, I think I found it. The notification_id is off by one which causes the Web2 query to fail. Once you restart the core the notification_id cache is invalid and it starts again.

MariaDB [icinga]> select * from icinga_notifications order by start_time desc limit 10;
+-----------------+-------------+-------------------+---------------------+-----------+---------------------+-----------------+---------------------+---------------+-------+------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------+--------------------+
| notification_id | instance_id | notification_type | notification_reason | object_id | start_time          | start_time_usec | end_time            | end_time_usec | state | output                                                                                                     | long_output | escalated | contacts_notified | endpoint_object_id |
+-----------------+-------------+-------------------+---------------------+-----------+---------------------+-----------------+---------------------+---------------+-------+------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------+--------------------+
|              17 |           1 |                 1 |                   8 |       248 | 2016-03-21 11:37:24 |          569113 | 2016-03-21 11:37:24 |        569113 |     3 | IDO type 'IdoPgsqlConnection' is invalid.                                                                  |             |         0 |                 2 |                  3 |
|              16 |           1 |                 1 |                   8 |       248 | 2016-03-21 11:36:56 |          252660 | 2016-03-21 11:36:56 |        252660 |     3 | IDO type 'IdoPgsqlConnection' is invalid.                                                                  |             |         0 |                 2 |                  3 |
|              15 |           1 |                 1 |                   8 |       248 | 2016-03-21 11:36:41 |          709774 | 2016-03-21 11:36:41 |        709774 |     3 | IDO type 'IdoPgsqlConnection' is invalid.                                                                  |             |         0 |                 2 |                  3 |
|              14 |           1 |                 1 |                   8 |       269 | 2016-03-21 10:50:36 |          246917 | 2016-03-21 10:50:36 |        246917 |     1 | DISK WARNING - free space: / 26227 MB (11% inode=11%):                                                     |             |         0 |                 2 |                  3 |
|              13 |           1 |                 1 |                   8 |       269 | 2016-03-21 10:50:09 |          949886 | 2016-03-21 10:50:09 |        949886 |     1 | DISK WARNING - free space: / 26227 MB (11% inode=11%):                                                     |             |         0 |                 2 |                  3 |
|              12 |           1 |                 1 |                   8 |       269 | 2016-03-21 10:49:42 |          173005 | 2016-03-21 10:49:42 |        173005 |     1 | DISK WARNING - free space: / 26227 MB (11% inode=11%):                                                     |             |         0 |                 2 |                  3 |
|              11 |           1 |                 1 |                   8 |       269 | 2016-03-21 10:48:30 |          557801 | 2016-03-21 10:48:30 |        557801 |     2 | DISK CRITICAL - free space: / 25436 MB (10% inode=10%):                                                    |             |         0 |                 2 |                  3 |
|              10 |           1 |                 1 |                   8 |       269 | 2016-03-21 10:41:39 |          824237 | 2016-03-21 10:41:39 |        824237 |     2 | DISK CRITICAL - free space: / 25316 MB (10% inode=10%):                                                    |             |         0 |                 2 |                  3 |
|               8 |           1 |                 1 |                   0 |       269 | 2016-03-21 10:38:58 |          138411 | 2016-03-21 10:38:58 |        138411 |     2 | DISK CRITICAL - free space: / 25318 MB (10% inode=10%):                                                    |             |         0 |                 2 |                  3 |
|               9 |           1 |                 1 |                   0 |       286 | 2016-03-21 10:38:58 |          146011 | 2016-03-21 10:38:58 |        146011 |     2 | DISK CRITICAL - free space: / 25318 MB (10% inode=10%): /Volumes/Install GitKraken 142 MB (36% inode=36%): |             |         0 |                 2 |                  3 |
+-----------------+-------------+-------------------+---------------------+-----------+---------------------+-----------------+---------------------+---------------+-------+------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------+--------------------+
10 rows in set (0.00 sec)

MariaDB [icinga]> select * from icinga_contactnotifications order by start_time desc limit 10;
+------------------------+-------------+-----------------+-------------------+---------------------+-----------------+---------------------+---------------+--------------------+
| contactnotification_id | instance_id | notification_id | contact_object_id | start_time          | start_time_usec | end_time            | end_time_usec | endpoint_object_id |
+------------------------+-------------+-----------------+-------------------+---------------------+-----------------+---------------------+---------------+--------------------+
|                     33 |           1 |              16 |               220 | 2016-03-21 11:37:24 |          569113 | 2016-03-21 11:37:24 |        569113 |               NULL |
|                     32 |           1 |              16 |               219 | 2016-03-21 11:37:24 |          569113 | 2016-03-21 11:37:24 |        569113 |               NULL |
|                     31 |           1 |              15 |               220 | 2016-03-21 11:36:56 |          252660 | 2016-03-21 11:36:56 |        252660 |               NULL |
|                     30 |           1 |              15 |               219 | 2016-03-21 11:36:56 |          252660 | 2016-03-21 11:36:56 |        252660 |               NULL |
|                     29 |           1 |              15 |               219 | 2016-03-21 11:36:41 |          709774 | 2016-03-21 11:36:41 |        709774 |               NULL |
|                     28 |           1 |              15 |               220 | 2016-03-21 11:36:41 |          709774 | 2016-03-21 11:36:41 |        709774 |               NULL |
|                     27 |           1 |              14 |               219 | 2016-03-21 10:50:36 |          246917 | 2016-03-21 10:50:36 |        246917 |               NULL |
|                     25 |           1 |              12 |               219 | 2016-03-21 10:50:09 |          949886 | 2016-03-21 10:50:09 |        949886 |               NULL |
|                     26 |           1 |              12 |               220 | 2016-03-21 10:50:09 |          949886 | 2016-03-21 10:50:09 |        949886 |               NULL |
|                     23 |           1 |              11 |               219 | 2016-03-21 10:49:42 |          173005 | 2016-03-21 10:49:42 |        173005 |               NULL |
+------------------------+-------------+-----------------+-------------------+---------------------+-----------------+---------------------+---------------+--------------------+
10 rows in set (0.00 sec)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-21 12:30:21 +00:00

At first glance the notification is inserted and the notification_id is properly stored for the queries into contactnotifications, so the relationship in icingaweb2 works.

[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_externalcommands (command_args, command_name, command_type, endpoint_object_id, entry_time, instance_id) VALUES ('mbmif.int.netways.de;ido-pgsql;2;icingaadmin;1', 'SEND_CUSTOM_SVC_NOTIFICATION', '160', 3, FROM_UNIXTIME(1458563092), 1)
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_servicestatus SET current_notification_number = '668',  last_notification = FROM_UNIXTIME(1458563092),  next_notification = FROM_UNIXTIME(1458554929) WHERE instance_id = 1 AND service_object_id = 248
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_notifications (contacts_notified, end_time, end_time_usec, endpoint_object_id, escalated, instance_id, long_output, notification_reason, notification_type, object_id, output, start_time, start_time_usec, state) VALUES ('2', FROM_UNIXTIME(1458563092), '616389', 3, '0', 1, '', '8', '1', 248, 'IDO type \'IdoPgsqlConnection\' is invalid.', FROM_UNIXTIME(1458563092), '616389', '3')
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (3, FROM_UNIXTIME(1458563092), '621438', 1, 'SERVICE NOTIFICATION: icingaadmin;mbmif.int.netways.de;ido-pgsql;CUSTOM (UNKNOWN);ido;IDO type \'IdoPgsqlConnection\' is invalid.;icingaadmin;1', FROM_UNIXTIME(1458563092), '524288', 248)
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (3, FROM_UNIXTIME(1458563092), '624157', 1, 'SERVICE NOTIFICATION: icingaadmin2;mbmif.int.netways.de;ido-pgsql;CUSTOM (UNKNOWN);ido;IDO type \'IdoPgsqlConnection\' is invalid.;icingaadmin;1', FROM_UNIXTIME(1458563092), '524288', 248)
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: COMMIT
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: Query: BEGIN
[2016-03-21 13:24:53 +0100] debug/IdoMysqlConnection: saving contactnotification notification_id=18
[2016-03-21 13:24:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_contactnotifications (contact_object_id, end_time, end_time_usec, instance_id, notification_id, start_time, start_time_usec) VALUES (220, FROM_UNIXTIME(1458563092), '616389', 1, 18, FROM_UNIXTIME(1458563092), '616389')
[2016-03-21 13:24:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_contactnotifications (contact_object_id, end_time, end_time_usec, instance_id, notification_id, start_time, start_time_usec) VALUES (219, FROM_UNIXTIME(1458563092), '616389', 1, 18, FROM_UNIXTIME(1458563092), '616389')

This time the notification_id is incremented and stored after the contactnotifications are inserted. That way the notification does not have any related contactnotifications.

[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_externalcommands (command_args, command_name, command_type, endpoint_object_id, entry_time, instance_id) VALUES ('mbmif.int.netways.de;ido-pgsql;2;icingaadmin;2', 'SEND_CUSTOM_SVC_NOTIFICATION', '160', 3, FROM_UNIXTIME(1458563154), 1)
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_servicestatus SET current_notification_number = '670',  last_notification = FROM_UNIXTIME(1458563154),  next_notification = FROM_UNIXTIME(1458554929) WHERE instance_id = 1 AND service_object_id = 248
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_notifications (contacts_notified, end_time, end_time_usec, endpoint_object_id, escalated, instance_id, long_output, notification_reason, notification_type, object_id, output, start_time, start_time_usec, state) VALUES ('2', FROM_UNIXTIME(1458563154), '126420', 3, '0', 1, '', '8', '1', 248, 'IDO type \'IdoPgsqlConnection\' is invalid.', FROM_UNIXTIME(1458563154), '126420', '3')
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_contactnotifications (contact_object_id, end_time, end_time_usec, instance_id, notification_id, start_time, start_time_usec) VALUES (220, FROM_UNIXTIME(1458563154), '126420', 1, 18, FROM_UNIXTIME(1458563154), '126420')
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_contactnotifications (contact_object_id, end_time, end_time_usec, instance_id, notification_id, start_time, start_time_usec) VALUES (219, FROM_UNIXTIME(1458563154), '126420', 1, 18, FROM_UNIXTIME(1458563154), '126420')
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (3, FROM_UNIXTIME(1458563154), '129637', 1, 'SERVICE NOTIFICATION: icingaadmin2;mbmif.int.netways.de;ido-pgsql;CUSTOM (UNKNOWN);ido;IDO type \'IdoPgsqlConnection\' is invalid.;icingaadmin;2', FROM_UNIXTIME(1458563154), '524288', 248)
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (3, FROM_UNIXTIME(1458563154), '129802', 1, 'SERVICE NOTIFICATION: icingaadmin;mbmif.int.netways.de;ido-pgsql;CUSTOM (UNKNOWN);ido;IDO type \'IdoPgsqlConnection\' is invalid.;icingaadmin;2', FROM_UNIXTIME(1458563154), '524288', 248)
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: COMMIT
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: Query: BEGIN
[2016-03-21 13:25:54 +0100] debug/IdoMysqlConnection: saving contactnotification notification_id=19

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-21 19:24:41 +00:00

The culprit is that we do not have any notification object similar to hosts, host groups where we could easily store and fetch the insert id in a "common" way. The previous implementation in #5103 and #5265 worked fine as long as the queries are executed one after the other. The recent changes for the IdoMysqlConnection driver using client multi statements does not necessarily set the notification_id after the notifications table was populated (and leaves the old insert id in place).

Having the old notification_id when inserting the contact notifications explains the current buggy behaviour. The fix is rather ugly - invalidate the cache for the current notification object and its insert id before the notifications insert query is run, and update it afterwards. This happens inside the whole ExecuteQuery "transaction". That way any other query for contact notifications depending on the notification_id will be added to the end of the query queue. One problem which could arise are two notifications fired at the same time for the same notification object.

 # echo "[1458572444] SEND_CUSTOM_SVC_NOTIFICATION;mbmif.int.netways.de;ido-pgsql;2;icingaadmin;dfsdfsdsffds" > /usr/local/icinga2/var/run/icinga2/cmd/icinga2.cmd ; echo "[1458572444] SEND_CUSTOM_SVC_NOTIFICATION;mbmif.int.netways.de;ido-pgsql;2;icingaadmin;zzzzzzzz" > /usr/local/icinga2/var/run/icinga2/cmd/icinga2.cmd

Again breaks the queries.

So in addition to fixing and invalidating the cache we need a way to determine for one notification with a different historical timeline and multiple notifications happening in a short amount of time.

I found a way to pass the current timestamp ("now" as double) next to the notification object to the query which then is used as unique key pair inside the m_NotificationInsertID map. In order to resolve the value later (and to check whether it is valid or not) I've also added a new DbValue which is then used in FieldToEscapedString().

The culprit lies in the fact the FieldToEscapedString only accepts the Value type, so we cannot pass an std::pair here for the notification object and timestamp. The ugly workaround is to use a Dictionary which is then used to select the accurate notification insert id.

One thing I need to check with that cache - we may leak memory over time as the m_NotificationInsertID list grows. The m_ObjectInsertID list does not grow historically, compared to the notification id being an historical event. It probably makes sense to clear the list in a timer and remove timestamps older than X seconds.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-22 15:23:30 +00:00

  • Subject changed from IDO: contactnotifications have pending updates per object to IDO: historical contact notifications table column notification_id is off-by-one
  • Priority changed from Normal to High
  • Target Version set to 2.4.5

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-29 13:15:04 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset 83e0bcd.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-04-06 16:53:52 +00:00

  • Duplicated set to 11537

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-04-20 08:15:59 +00:00

  • Backport? changed from Not yet backported to Already backported

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-05-06 14:50:40 +00:00

  • Parent Id deleted 11412

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/db-ido Database output labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.5 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant