Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8714] Add priority queue for disconnect/programstatus update events #2746

Closed
icinga-migration opened this issue Mar 12, 2015 · 55 comments
Labels
area/db-ido Database output blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8714

Created by TheSerapher on 2015-03-12 13:05:24 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2015-12-10 16:10:03 +00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:58:14 +00:00 (in Redmine)

Icinga Version: 2.3.0
Backport?: Already backported
Include in Changelog: 1

The master is not updating the IDO table and set itself as running on time (which may explain why we can't run two masters at the same time):

Process Info

Program Start Time  12/03/2015 1:35 PM
Last Status Update  3m 14s ago
Last External Command Check 3m 14s ago
Last Log File Rotation  N/A
Global Service Event Handler    N/A
Global Host Event Handler   N/A
icinga is not running

Interestingly the Last Status Update time is jumping down sometimes. Only by a few seconds though, as if the last update was done but with a large delay. Running the query confirms that the time is off by about 2.5 minutes:

mysql> select * from icinga_programstatus \G select now();
*************************** 1. row ***************************
              programstatus_id: 51516
                   instance_id: 1
               program_version: r2.3.0-1
            status_update_time: 2015-03-12 14:00:03
            program_start_time: 2015-03-12 13:49:52
              program_end_time: 0000-00-00 00:00:00
                 endpoint_name: icinga2-master-1.monitor.preview.fra1.xing.com
          is_currently_running: 1
                    process_id: 45334
                   daemon_mode: 1
            last_command_check: 2015-03-12 14:00:03
             last_log_rotation: 0000-00-00 00:00:00
         notifications_enabled: 1
     disable_notif_expire_time: 0000-00-00 00:00:00
 active_service_checks_enabled: 1
passive_service_checks_enabled: 1
    active_host_checks_enabled: 1
   passive_host_checks_enabled: 1
        event_handlers_enabled: 1
        flap_detection_enabled: 1
    failure_prediction_enabled: 0
      process_performance_data: 1
             obsess_over_hosts: 0
          obsess_over_services: 0
      modified_host_attributes: 0
   modified_service_attributes: 0
     global_host_event_handler: NULL
  global_service_event_handler: NULL
       config_dump_in_progress: 0
            endpoint_object_id: NULL
1 row in set (0.01 sec)

+---------------------+
| now()               |
+---------------------+
| 2015-03-12 14:02:39 |
+---------------------+
1 row in set (0.01 sec)

What would be the cause? Is the master not running the SQL queue fast enough so updates are inserted with a delay? We have about 24.000 service checks on 1500 hosts that are executed by 4 checkers. Maybe this causes issues?

I was able to confirm that the database has no significant load or any system issues.

Changesets

2015-12-10 16:06:00 +00:00 by (unknown) 15ca998

Implement support for priorities in the WorkQueue class

fixes #8714

2015-12-14 09:34:12 +00:00 by (unknown) 372cf07

Clean up the FinishAsyncQueries code a bit

refs #8714

2015-12-15 10:58:50 +00:00 by mfriedrich da3d210

Fix priority ordering for IDO queries

fixes #10829
refs #8714

2016-02-23 08:09:06 +00:00 by (unknown) a40fc65

Implement support for priorities in the WorkQueue class

fixes #8714

2016-02-23 08:09:06 +00:00 by (unknown) 02184ad

Clean up the FinishAsyncQueries code a bit

refs #8714

2016-02-23 08:09:06 +00:00 by mfriedrich 2bc1d32

Fix priority ordering for IDO queries

fixes #10829
refs #8714

Relations:

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 13:07:28 +00:00

This is just a guess, but this could explain the Deadlocks and Lock Timeouts we have using IDO DB with two masters running. If the real master is not updating it's status properly, the secondary master (using the status table if the actual master is active or not) may enable the IDO connection since the active master has not reported as being active in time?

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 13:12:12 +00:00

As I wrote that, the master finally "caught up" and has updated it's status:

Process Info

Program Start Time  12/03/2015 1:49 PM
Last Status Update  0m 13s ago
Last External Command Check 0m 13s ago
Last Log File Rotation  N/A
Global Service Event Handler    N/A
Global Host Event Handler   N/A
icinga has been up and running with PID 45334 since 21m 54s

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 15:57:02 +00:00

Did some strace on the server to see if it's inserting to that table, even though it did, the timestamps being inserted are already way off:

[pid 30814] write(15, "7\0\0\0\3DELETE FROM icinga_programstatus WHERE instance_id = 1", 59) = 59
[pid 30814] write(15, "h\2\0\0\3INSERT INTO icinga_programstatus (active_host_checks_enabled, active_service_checks_enabled, daemon_mode, endpoint_name, event_handlers_enabled, flap_detection_enabled, instance_id, is_currently_running, last_command_check, notifications_enabled, passive_host_checks_enabled, passive_service_checks_enabled, process_id, process_performance_data, program_start_time, program_version, status_update_time) VALUES ('1', '1', '1', 'icinga2-master-1.monitor.dc1', '1', '1', 1, '1', FROM_UNIXTIME(1426175471), '1', '1', '1', '30798', '1', FROM_UNIXTIME(1426175130), 'r2.3.1-1', FROM_UNIXTIME(1426175471))", 620 

Checking against the table, the FROM_UNIXTIME matches the inserted record:

mysql> select * from icinga2_preview_ido.icinga_programstatus \G select now(); select FROM_UNIXTIME(1426175471);
*************************** 1. row ***************************
              programstatus_id: 52572
                   instance_id: 1
               program_version: r2.3.1-1
            status_update_time: 2015-03-12 16:52:41
            program_start_time: 2015-03-12 16:45:30
              program_end_time: 0000-00-00 00:00:00
                 endpoint_name: icinga2-master-1.monitor.dc1
          is_currently_running: 1
                    process_id: 30798
                   daemon_mode: 1
            last_command_check: 2015-03-12 16:52:41
             last_log_rotation: 0000-00-00 00:00:00
         notifications_enabled: 1
     disable_notif_expire_time: 0000-00-00 00:00:00
 active_service_checks_enabled: 1
passive_service_checks_enabled: 1
    active_host_checks_enabled: 1
   passive_host_checks_enabled: 1
        event_handlers_enabled: 1
        flap_detection_enabled: 1
    failure_prediction_enabled: 0
      process_performance_data: 1
             obsess_over_hosts: 0
          obsess_over_services: 0
      modified_host_attributes: 0
   modified_service_attributes: 0
     global_host_event_handler: NULL
  global_service_event_handler: NULL
       config_dump_in_progress: 0
            endpoint_object_id: NULL
1 row in set (0.01 sec)

+---------------------+
| now()               |
+---------------------+
| 2015-03-12 16:55:56 |
+---------------------+
1 row in set (0.01 sec)

+---------------------------+
| FROM_UNIXTIME(1426175471) |
+---------------------------+
| 2015-03-12 16:51:11       |
+---------------------------+
1 row in set (0.01 sec)

This seems to point at a full SQL Queue in the Master process. How can I confirm or check the existing master queues for outstanding queries? Is there a way to improve this?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-12 16:06:51 +00:00

The `icinga` check provides performance data for queued ido items inside the feature objects. Might be a good idea to check on that.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 16:25:32 +00:00

dnsmichi wrote:

The `icinga` check provides performance data for queued ido items inside the feature objects. Might be a good idea to check on that.

Thanks, I will certainly check that out tomorrow. Hopefully this will give some useful information.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 08:18:18 +00:00

Any chance you have some documentation for us to read? We are unsure how to add this check.

Scratch that, Grep FTW :-)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-13 08:36:33 +00:00

There's a search on docs.icinga.org as well now.

http://docs.icinga.org/icinga2/latest/doc/module/icinga2/toc\#!/icinga2/latest/doc/module/icinga2/chapter/icinga-template-library#itl-icinga

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 08:51:55 +00:00

So we enabled this module but it doesn't look like the output is telling us how long the IDO SQL queue is:

Performance data    
Label   Value   Min Max Warning Critical
active_host_checks  5.75    -   -   -   -
active_host_checks_1min 48.00   -   -   -   -
active_host_checks_5min 48.00   -   -   -   -
active_host_checks_15min    48.00   -   -   -   -
active_service_checks   29.94   -   -   -   -
active_service_checks_1min  250.00  -   -   -   -
active_service_checks_5min  250.00  -   -   -   -
active_service_checks_15min 250.00  -   -   -   -
max_latency 13.09   -   -   -   -
avg_latency 0.01    -   -   -   -
max_execution_time  13.09   -   -   -   -
avg_execution_time  0.77    -   -   -   -
num_services_ok 22,857.00   -   -   -   -
num_services_warning    13.00   -   -   -   -
num_services_critical   281.00  -   -   -   -
num_services_unknown    1,814.00    -   -   -   -
num_services_pending    1,538.00    -   -   -   -
num_services_flapping   12.00   -   -   -   -
uptime  8.50    -   -   -   -
num_hosts_up    1,910.00    -   -   -   -
num_hosts_down  12.00   -   -   -   -

At the same time, we have now reduced the amount of checkers. The idea behind this: When starting checkers, we see high loads and (assuming) all service checks being triggered (see https://dev.icinga.org/issues/8670) which we'd assume are sent to the master for processing results and inserting them into the DB. To reduce this initial load on the master, we shut down the entire cluster, started only the master and added two hardware checkers into it. As expected those triggered the service checks and ran under higher load, but this time the master was able to keep up and since then status updates have been fine. Over night, we had a 3 - 4 minute gap between actual time and master updates. Looks okay now:

Process Info

Program Start Time  13/03/2015 9:35 AM
Last Status Update  0m 4s ago
Last External Command Check 0m 4s ago
Last Log File Rotation  N/A
Global Service Event Handler    N/A
Global Host Event Handler   N/A
icinga has been up and running with PID 45551 since 

I kept reloading the status information to see if there is any delay adding up but it looked okay so far. One time it was until 20s for it to refresh back down to 5s, but then caught up completely again.

Re-adding one additional checker and checking the status updates again, we are seeing the same issues again:

Process Info

Program Start Time  13/03/2015 9:35 AM
Last Status Update  1m 19s ago
Last External Command Check 1m 19s ago
Last Log File Rotation  N/A
Global Service Event Handler    N/A
Global Host Event Handler   N/A
icinga is not running

This is a smaller VM than our hardware but this should not affect the status update intervals on the master.

I am not sure what else I can do to support here, but it's pretty apparent that something isn't working right with many checker nodes active. In this current state, we can't switch our monitoring to Icinga2 until we can see proper performance from our masters.

EDIT: Three checkers seems to be the magical border. Once we add the third checker, the master updates start lagging and getting worse. Removing the checker doesn't recover the status updates but at least they are not getting worse anymore. Re-starting the master does not fix this - maybe because the still running checkers are posting check results once it's back online.

EDIT2: Started master node with debug log enabled, it seems that a starting master is inserting the entire system state into the IDO DB when starting up. No checkers have been turned on but we see tons of queries (INSERT/UPDATE):

[2015-03-13 10:13:09 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_conninfo (instance_id, connect_time, last_checkin_time, agent_name, agent_version, connect_type, data_start_time) VALUES (1, NOW(), NOW(), 'icinga2 db_ido_mysql', 'r2.3.1-1', 'INITIAL', NOW())
[...]
[2015-03-13 10:13:09 +0100] debug/IdoMysqlConnection: Query: DELETE FROM icinga_contact_addresses WHERE instance_id = 1
[...]
[2015-03-13 10:13:09 +0100] debug/IdoMysqlConnection: Query: DELETE FROM icinga_contact_notificationcommands WHERE instance_id = 1
[...]
[2015-03-13 10:13:10 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_endpoints (config_type, endpoint_object_id, identity, instance_id, node) VALUES ('1', 4, 'icinga2-checker-1.monitor.preview.dc1', 1, 'icinga2-master-1.monitor.preview.dc1')
[...]
[2015-03-13 10:14:40 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_customvariables (config_type, has_been_modified, instance_id, is_json, object_id, varname, varvalue) VALUES ('1', '0', 1, '0', 4934, 'critical', '800')
[2015-03-13 10:14:40 +0100] debug/DbObject: object customvar key: 'impact' value: 'high' overridden: 0
[2015-03-13 10:14:40 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_customvariables (config_type, has_been_modified, instance_id, is_json, object_id, varname, varvalue) VALUES ('1', '0', 1, '0', 4934, 'impact', 'high')
[2015-03-13 10:14:40 +0100] debug/DbObject: object customvar key: 'parameter' value: 'jvm:threads:count' overridden: 0
[2015-03-13 10:14:40 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_customvariables (config_type, has_been_modified, instance_id, is_json, object_id, varname, varvalue) VALUES ('1', '0', 1, '0', 4934, 'parameter', 'jvm:threads:count')
[2015-03-13 10:14:40 +0100] debug/DbObject: object customvar key: 'warning' value: '500' overridden: 0

After the startup finally completes (which takes a while with that many hosts and services) the master is running and starts updating the status table with correct timestamps:

[2015-03-13 10:18:04 +0100] debug/IdoMysqlConnection: Query: DELETE FROM icinga_programstatus WHERE instance_id = 1
[2015-03-13 10:18:04 +0100] notice/WorkQueue: #3 tasks: 2
[2015-03-13 10:18:04 +0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_programstatus (active_host_checks_enabled, active_service_checks_enabled, daemon_mode, endpoint_name, event_handlers_enabled, flap_detection_enabled, instance_id, is_currently_running, last_command_check, notifications_enabled, passive_host_checks_enabled, passive_service_checks_enabled, process_id, process_performance_data, program_start_time, program_version, status_update_time) VALUES ('1', '1', '1', 'icinga2-master-1.monitor.dc1', '1', '1', 1, '1', FROM_UNIXTIME(1426238284), '1', '1', '1', '8708', '1', FROM_UNIXTIME(1426237983), 'r2.3.1-1', FROM_UNIXTIME(1426238284))

Adding a single checker confirms what I suspected, it starts triggering the plugins increasing system load and sending all results to the master. This again will take quite some time. At this time, we are not seeing any updates to the status table starting to cause the delays on the interface too. Once the checker completes this initial booting phase, the master updates the status table again as expected.

Since the active master is not updating the status, the passive one assumes it's down and would take over? We have disabled the secondary master since we had issues with that as soon as we added our service checks.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 11:16:03 +00:00

This might not solve your problem but maybe it makes diagnosing this a bit easier: The latest snapshots have a new check for this:

Graphite:

Config:

root@mon:~# cat /etc/icinga2/features-enabled/ido-mysql.conf
/**
 * The db_ido_mysql library implements IDO functionality
 * for MySQL.
 */

library "db_ido_mysql"

object IdoMysqlConnection "ido-mysql" {
  user = "icinga"
  password = "none-of-your-beeswax"
  host = "127.0.0.1"
  database = "icinga"
}

object Service "ido-mysql" {
  import "generic-service"

  host_name = NodeName

  check_command = "ido"
  vars.ido_type = "IdoMysqlConnection"
  vars.ido_name = "ido-mysql"
}
root@mon:~#

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 11:19:39 +00:00

Excellent, we will look at this in more detail starting Monday. So far one thing that's odd is the fact that the master is not doing any updates as soon as a client connects via API and pumps initial data. I am sure that's not intended since it would cause a secondary master to take over even there is no reason to do so.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 12:00:54 +00:00

As far as I can see we're updating the icinga_customvariablestatus table even though the custom variables haven't been changed. This seems to cause an insane amount of redundant queries.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 12:40:51 +00:00

gunnarbeutner wrote:

As far as I can see we're updating the icinga_customvariablestatus table even though the custom variables haven't been changed. This seems to cause an insane amount of redundant queries.

That's a great start. Granted, our master may just be not powerful enough but looking at the load on it ( 13:29:02 up 7 days, 5:19, 1 user, load average: 0.22, 0.20, 0.22 ) and CPU usage (iostat: 93.07% IDLE) this should really not be an issue. Just for fun I will run the master in debug mode and check the queries again.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 12:50:24 +00:00

  • Relates set to 8736

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 13:04:59 +00:00

Here my quick and dirty run over a debug log of about 4minutes 30 seconds when starting up the master:

# gawk -f queries.awk /var/log/icinga2/debug.log
Updates : 301281
Inserts : 182243
Deletes : 357
Tables  : 36

icinga_hostgroups accessed 484 times
icinga_scheduleddowntime accessed 1 times
icinga_customvariablestatus accessed 260664 times
icinga_programstatus accessed 18 times
icinga_contacts accessed 1 times
icinga_host_contacts accessed 1 times
icinga_hostgroup_members accessed 19341 times
icinga_endpoints accessed 15 times
icinga_contact_notificationcommands accessed 1 times
icinga_host_contactgroups accessed 1 times
icinga_services accessed 24965 times
icinga_timeperiods accessed 3 times
icinga_contactgroup_members accessed 319 times
icinga_logentries accessed 24 times
icinga_contactstatus accessed 1 times
icinga_servicestatus accessed 83318 times
icinga_commands accessed 120 times
icinga_conninfo accessed 1 times
icinga_objects accessed 4 times
icinga_timeperiod_timeranges accessed 16 times
icinga_hostdependencies accessed 1 times
icinga_host_parenthosts accessed 1 times
icinga_customvariables accessed 80429 times
icinga_comments accessed 1 times
icinga_hoststatus accessed 9897 times
icinga_runtimevariables accessed 45 times
icinga_contactgroups accessed 317 times
icinga_hosts accessed 1922 times
icinga_servicegroups accessed 3 times
icinga_endpointstatus accessed 17 times
icinga_contact_addresses accessed 1 times
icinga_statehistory accessed 19 times
icinga_servicegroup_members accessed 1927 times
icinga_servicedependencies accessed 1 times
icinga_service_contacts accessed 1 times
icinga_service_contactgroups accessed 1 times

And the same filter with master being active for 10 minutes before checking and gathering 4minutes 30 seconds of data:

# gawk -f queries.awk /var/log/icinga2/debug.log
Updates : 384349
Inserts : 28984
Deletes : 373
Tables  : 15

icinga_customvariablestatus accessed 257931 times
icinga_programstatus accessed 56 times
icinga_contacts accessed 1 times
icinga_contactgroup_members accessed 318 times
icinga_services accessed 6963 times
icinga_logentries accessed 42 times
icinga_contactstatus accessed 1 times
icinga_servicestatus accessed 115864 times
icinga_objects accessed 2 times
icinga_customvariables accessed 14381 times
icinga_hoststatus accessed 17651 times
icinga_runtimevariables accessed 140 times
icinga_endpointstatus accessed 2 times
icinga_contactgroups accessed 317 times
icinga_statehistory accessed 37 times

Maybe this helps too :-)

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 13:11:06 +00:00

I've triggered a manual rebuild for the snapshot packages (https://build.icinga.org/jenkins/job/icinga2-rpm-snapshot-packages/). Can you re-test this problem with the latest snapshots (once the build is done)?

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 13:14:23 +00:00

gunnarbeutner wrote:

I've triggered a manual rebuild for the snapshot packages (https://build.icinga.org/jenkins/job/icinga2-rpm-snapshot-packages/). Can you re-test this problem with the latest snapshots (once the build is done)?

Running on Ubuntu 14.04 :/-

Found them, will test!

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 13:40:49 +00:00

Startup phase prior to running the first update to icinga_programstatus. The time we see the SELECT on the table until we write the first time was about 3.5 Minutes:

Updates : 146124
Inserts : 182189
Deletes : 341
Tables  : 36

icinga_hostgroups accessed 484 times
icinga_scheduleddowntime accessed 1 times
icinga_customvariablestatus accessed 160854 times
icinga_programstatus accessed 2 times
icinga_contacts accessed 1 times
icinga_host_contacts accessed 1 times
icinga_hostgroup_members accessed 19342 times
icinga_endpoints accessed 15 times
icinga_contact_notificationcommands accessed 1 times
icinga_host_contactgroups accessed 1 times
icinga_services accessed 24965 times
icinga_timeperiods accessed 3 times
icinga_contactgroup_members accessed 319 times
icinga_logentries accessed 20 times
icinga_contactstatus accessed 1 times
icinga_servicestatus accessed 34179 times
icinga_commands accessed 121 times
icinga_conninfo accessed 1 times
icinga_objects accessed 4 times
icinga_timeperiod_timeranges accessed 16 times
icinga_hostdependencies accessed 1 times
icinga_host_parenthosts accessed 1 times
icinga_customvariables accessed 80427 times
icinga_comments accessed 1 times
icinga_hoststatus accessed 3686 times
icinga_runtimevariables accessed 5 times
icinga_contactgroups accessed 317 times
icinga_hosts accessed 1922 times
icinga_servicegroups accessed 3 times
icinga_endpointstatus accessed 17 times
icinga_contact_addresses accessed 1 times
icinga_statehistory accessed 12 times
icinga_servicegroup_members accessed 1927 times
icinga_servicedependencies accessed 1 times
icinga_service_contacts accessed 1 times
icinga_service_contactgroups accessed 1 times

Runtime information about 4 minutes after first status update:

Updates : 185111
Inserts : 265
Deletes : 80
Tables  : 6

icinga_programstatus accessed 80 times
icinga_logentries accessed 36 times
icinga_servicestatus accessed 157936 times
icinga_hoststatus accessed 27175 times
icinga_runtimevariables accessed 200 times
icinga_statehistory accessed 29 times

Looks way, way better! Nice work :-)

Now back to the initial startup issues. I would expect a master to do status updates to the program status table no matter what else is going on. In the end, that process will determine if a secondary master will take over or not. If we are not putting any status updates into the database until the initial startup is completed due to wiping and re-filling the tables, another master that may already be running could take over leading into lock waits or deadlocks?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-13 13:52:38 +00:00

  • Category set to DB IDO

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-13 13:58:34 +00:00

TheSerapher wrote:

Now back to the initial startup issues. I would expect a master to do status updates to the program status table no matter what else is going on. In the end, that process will determine if a secondary master will take over or not. If we are not putting any status updates into the database until the initial startup is completed due to wiping and re-filling the tables, another master that may already be running could take over leading into lock waits or deadlocks?

In Icinga 1.x we had the 'config_dump_in_progress' column as workaround to signal the interfaces that something blocks the program status updates. Your description sounds familiar in terms of the query queue blocking the program status updates.

Since we rely on the program status update time on DB IDO HA mode, this becomes a critical problem, you're absolutely right about that.

Probably the program status updates should be sent through a different connection (or a priority queue) always reaching the database in time. Not sure about the right implementation though.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 14:04:57 +00:00

I can't get the IDO plugin to work:

critical/config: Error: Invalid value: Object 'ido-mysql' (Type: 'Service') at /var/lib/icinga2/api/zones/checker/services/icinga2.conf:43 -> Key check_command: Object 'ido' of type 'CheckCommand' does not exist.

# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: v2.3.0-30-g100eb0c)

##### Master
# grep ido /etc/icinga2/zones.d/checker/services/icinga2.conf
apply Service "ido-mysql" {
  check_command = "ido

# service icinga2 checkconfig
 * checking Icinga2 configuration                                                                                             [ OK ]

# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: v2.3.0-30-g100eb0c)

EDIT Being silly again trying to run this on a checker, so moved to the master but it's in state pending. Will try to enable the checker feature on our master and hope it checks properly then.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 14:30:52 +00:00

Here the result from the check command, this is after the server restart and as soon as the first status update was completed:

Performance data    
Label   Value   Min Max Warning Critical
queries 369.28  -   -   -   -
queries_1min    22,157.00   -   -   -   -
queries_5mins   22,157.00   -   -   -   -
queries_15mins  22,157.00   -   -   -   -
pending_queries 51,409.00   -   -   -   -

51k outstanding queries :-/

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 14:35:06 +00:00

And here just a few intervals later:

Performance data    
Label   Value   Min Max Warning Critical
queries 1,615.63    -   -   -   -
queries_1min    96,938.00   -   -   -   -
queries_5mins   288,059.00  -   -   -   -
queries_15mins  288,059.00  -   -   -   -
pending_queries 154,201.00  -   -   -   -

Performance data    
Label   Value   Min Max Warning Critical
queries 1,346.37    -   -   -   -
queries_1min    80,782.00   -   -   -   -
queries_5mins   370,279.00  -   -   -   -
queries_15mins  370,279.00  -   -   -   -
pending_queries 113,254.00  -   -   -   -

Performance data    
Label   Value   Min Max Warning Critical
queries 1,216.58    -   -   -   -
queries_1min    72,995.00   -   -   -   -
queries_5mins   420,917.00  -   -   -   -
queries_15mins  444,210.00  -   -   -   -
pending_queries 36,829.00   -   -   -   -

Performance data    
Label   Value   Min Max Warning Critical
queries 762.93  -   -   -   -
queries_1min    45,776.00   -   -   -   -
queries_5mins   383,830.00  -   -   -   -
queries_15mins  491,088.00  -   -   -   -
pending_queries 1.00    -   -   -   -

Process Info

Program Start Time  13/03/2015 3:25 PM
Last Status Update  0m 6s ago
Last External Command Check 0m 6s ago
Last Log File Rotation  N/A
Global Service Event Handler    N/A
Global Host Event Handler   N/A
icinga has been up and running with PID 29712 since 

So it is save to save that a server/cluster restart will cause issues when it comes to status updates.

Is there any way that these status updates are running outside the regular SQL Queue and that these run all the time as soon as the IDO connection is established? That would help regarding active/passive master setups. And what about those service checks on each node that are sent to the master after reloads and restarts? Could these maybe skipped in a clustered scenario until a master asks for an active check at a checker node?

Just throwing in ideas here. We really want this setup to work since it's far superior to any other setup we had so far. Ironing those issues out would make this system usable for us!

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 14:49:15 +00:00

Tried to do a failover by restarting Icinga2 on the active master. As expected, the master stopped running status updates, after 1 minute the passive master took over while the old active was still starting up and doing queries.

New Master

# Takeover after 1 Minute timeout
[2015-03-13 15:43:34 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.13.0')
[...]
[2015-03-13 15:44:25 +0100] critical/IdoMysqlConnection: Error "Lock wait timeout exceeded; try restarting transaction" when executing query "DELETE FROM icinga_customvariables WHERE instance_id = 1"
Context:
    (0) Reconnecting to MySQL IDO database 'ido-mysql'

[2015-03-13 15:44:25 +0100] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2015-03-13 15:44:25 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.13.0')
[...]
[2015-03-13 15:45:16 +0100] critical/IdoMysqlConnection: Error "Lock wait timeout exceeded; try restarting transaction" when executing query "DELETE FROM icinga_customvariables WHERE instance_id = 1"
Context:
    (0) Reconnecting to MySQL IDO database 'ido-mysql'

[2015-03-13 15:45:16 +0100] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2015-03-13 15:45:16 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.13.0')
[...]
[2015-03-13 15:46:07 +0100] critical/IdoMysqlConnection: Error "Lock wait timeout exceeded; try restarting transaction" when executing query "DELETE FROM icinga_customvariables WHERE instance_id = 1"
Context:
    (0) Reconnecting to MySQL IDO database 'ido-mysql'

[2015-03-13 15:46:07 +0100] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2015-03-13 15:46:07 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.13.0')

It looks like the master/master setup is not working as soon as we are dealing with long startup times and initialisation phases. Even a simple restart or reload on an active or passive master can cause the lock waits. And these lock waits even make kinda sense, we don't want the restarting master to delete all entries from all startup tables since the new master is updating those entries!

Hopefully there will be a solution to this.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 15:14:00 +00:00

Totally missed your update there dnsmichi!

I agree that rhis is indeed a critical issue and makes clusters with a lot of services and checkers difficult. As far as I can tell we have two issues here

  • Status updates not given high priority over all other updates
  • Startups of nodes completely dropping all database content when initializing

I think for the first issue, a high priority queue would be a good idea. This can also be used for other important cross system messages that rely on up to date information.

On the second issue, this is way more complicated but would benefit cluster restart times and recoveries. I can see why data is wiped and inserted (service changes or removals)) but maybe a first step could be to batch inserts. Instead of a single query for every insert make it one query inserting many rows. Or find a way to compare database information with actual configuration objects and remove/change those in the DB trust haven been modified/removed from the Config.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-13 15:32:15 +00:00

TheSerapher wrote:

I think for the first issue, a high priority queue would be a good idea. This can also be used for other important cross system messages that rely on up to date information.

Feel free to create a patch :)

On the second issue, this is way more complicated but would benefit cluster restart times and recoveries. I can see why data is wiped and inserted (service changes or removals)) but maybe a first step could be to batch inserts. Instead of a single query for every insert make it one query inserting many rows. Or find a way to compare database information with actual configuration objects and remove/change those in the DB trust haven been modified/removed from the Config.

I don't see this as part of the original problem. That's by design, and if you consider changing or discussing it, please open a new issue. The config dump itself, and also initial state dump is a different story.

Imho fixing the status updates not reaching the programstatus update table, as well as other performance tweaks (yet-to-be-found) would help solve the problem.

Other than that, we could just signal the query processing thread on the queue to fire a new program status update before executing any other query. Workaround for your problem - an ugly one - external cronjob on the box updating the status_update_time and nodename from a separate connection (but only on the active master, and control that with pacemaker or similar).

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 15:38:33 +00:00

If I would be able to code anything in C/C** I would jump in - but I really don't and don't wanna mess things up ;-)

I agree that the status update being handled properly would alread help a lot. But what about a proper failover too? Let that init phase take a while but at least fail over to a second nose if available during that time. Masters could also use the API to update their states and only if both API and IDO status are negative, do the failover.

Yet I think the initial load needs addressing since no updates will be done until the load completes. Will make a separate issue for that then.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 19:30:35 +00:00

  • Relates set to 8738

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 19:32:37 +00:00

I guess we could prioritize certain kinds of queries.

On another note #8738 should improve query performance for IDO (MySQL only for now) significantly as it eliminates round-trips for most database queries.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 19:43:38 +00:00

If you got another snapshot I will try to give it a whirl :) It kinda is what I was thinking of just instead of a INSERT INTO .... VALUES (...), (...), (...), [...] you went with INSERT INTO ...; INSERT INTO ....;

The former would probably make it compliant with Postgres as well while doing the same thing.

EDIT Word

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-03-13 20:34:47 +00:00

FYI the Ubuntu trusty packages are built for each commit. So by now you should already have updated packages on packages.icinga.org.

Implementing this for PostgreSQL will probably be a bit more difficult unfortunately because they don't seem to support multiple statements (well, they do, but you only get a result object for the last statement which makes things a bit awkward).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-14 09:25:54 +00:00

Reducing the config dump to a minimum is always desired, but even if we solve it for pgsql as well, it wouldn't solve the original issue with the db ido ha feature failure.

I'm not sure if a priority queue is the best solution here, although it will indicate when the entire connection is blocking (i.e. db server slow). A different idea would be to run programststatus table updates in a separate thread and its own connection. It has to wait for the instance_id, and could check for the query rate != 0 (and a signal that config dump is running) for example. As long as icinga2 is running the table gets updated and illustrates what the main process is doing.

Not sure though on the impact that has for other tables and delayed updates (interfaces might rely on programstatus only). Web2 visualizes that by last/next check and interval minus now() - so not a problem with stale data.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-14 09:35:25 +00:00

Yes the Config dump is a desperate issue. Haven't had a chance to open a new ticket yet.

I think for the status update it should be a combination of high priority IDO queries for masters and the API in the Config-ha-zone. We have a heartbeat system between masters in place using the API already, why not expand that to focus on the API as means for masters to communicate state information. Using IDO could be an addition to that to avoid a split brain scenario in case the API failed for some reason.

And yes, the status table is used by frontends so it still would require updates unless frontends can retrieve that information from a process directly (Public API?).

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-16 08:34:59 +00:00

Created a ticket for server startup config dump: https://dev.icinga.org/issues/8756

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-16 10:26:06 +00:00

gunnarbeutner wrote:

I guess we could prioritize certain kinds of queries.

On another note #8738 should improve query performance for IDO (MySQL only for now) significantly as it eliminates round-trips for most database queries.

Updated masters to latest snapshot and seeing the UPDATE ...; UPDATE ...; queries in the strace now.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-19 09:53:28 +00:00

  • Subject changed from Master IDO Status Update Delayed to Prioritize db queries for programstatus table (DB IDO HA)
  • Status changed from New to Assigned
  • Assigned to set to gbeutner
  • Target Version set to 2.3.3
  • Estimated Hours set to 8

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-23 13:58:15 +00:00

  • Target Version changed from 2.3.3 to 2.3.4

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-04-15 07:19:34 +00:00

  • Target Version changed from 2.3.4 to 2.3.5

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-04-15 07:22:21 +00:00

We'll re-schedule these issues in favor of tagging 2.3.4 in the next couple of days.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-05-13 07:53:10 +00:00

  • Target Version deleted 2.3.5

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-23 13:26:54 +00:00

  • Target Version set to Backlog

@icinga-migration
Copy link
Author

Updated by dgoetz on 2015-07-09 09:39:10 +00:00

I see the same problem here especially caused by the customvariables which get dropped and refilled during startup/reload.
(So perhaps the comment would fit better to #8756)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-10-28 08:48:56 +00:00

  • Subject changed from Prioritize db queries for programstatus table (DB IDO HA) to _Add priority queue for disconnect/programstatus update events _
  • Status changed from Assigned to New
  • Assigned to deleted gbeutner
  • Priority changed from Normal to High

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-11-03 09:39:32 +00:00

  • Estimated Hours changed from 8 to 32

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-12-02 18:25:23 +00:00

I think it might be reasonable to add several update types to the IDO:

  • program status
  • historical events
  • checkable status

status should be able to merge updates, e.g. not every update is queued, but only the fact that a update type has to be refreshed. To merge multiple updates into one, when still in queue.

@icinga-migration
Copy link
Author

Updated by Anonymous on 2015-12-10 16:10:03 +00:00

  • Status changed from New to Resolved
  • Done % changed from 0 to 100

Applied in changeset 15ca998.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-12-10 16:17:26 +00:00

  • Assigned to set to gbeutner
  • Target Version changed from Backlog to 2.4.2

Note that this patch does not merge duplicate queries. This is definitely something we should implement though (i.e., next week).

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-12-10 16:17:57 +00:00

  • Relates set to 10822

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-14 09:22:09 +00:00

  • Relates set to 10829

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-15 09:39:52 +00:00

  • Relates set to 10840

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-15 10:54:36 +00:00

  • Relates set to 8976

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-15 12:00:25 +00:00

  • Relates set to 10765

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-15 16:08:59 +00:00

  • Relates set to 10855

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-18 09:47:09 +00:00

  • Relates set to 10883

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-08 10:02:04 +00:00

  • Relates set to 11103

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-23 09:58:14 +00:00

  • Backport? changed from Not yet backported to Already backported

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/db-ido Database output labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.2 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant