[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

icinga-migration · 2015-11-26T14:35:40Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10731

Created by vytenis on 2015-11-26 14:35:40 +00:00

Assignee: (none)
Status: Closed (closed on 2016-03-18 15:30:17 +00:00)
Target Version: (none)
Last Update: 2016-03-30 16:45:03 +00:00 (in Redmine)

Icinga Version: 2.4.0
Backport?: Not yet backported
Include in Changelog: 1

Icinga2 locks up after initialization and state dump with IDO enabled. API, livestatus is still responsive but no checks are made, passive check results are not processed, no database queries are executed.
Attaching gdb `bt`/`thread apply all bt` output. Debug log looks like this:

[2015-11-26 14:22:06 +0000] notice/ThreadPool: Pool #1: Pending tasks: 1127; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] information/ApiListener: New client connection (no client certificate)
[2015-11-26 14:22:09 +0000] notice/ApiListener: New HTTP client
[2015-11-26 14:22:09 +0000] debug/HttpRequest: line: GET /v1/status HTTP/1.1, tokens: 3
[2015-11-26 14:22:09 +0000] information/HttpServerConnection: Request: GET /v1/status (nagdowntime)
[2015-11-26 14:22:09 +0000] notice/LivestatusListener: Client connected
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] warning/TlsStream: TLS stream was disconnected.

[2015-11-26 14:22:21 +0000] notice/ThreadPool: Pool #1: Pending tasks: 1129; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-11-26 14:22:22 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2

Attachments

gdb.txt vytenis - 2015-11-26 14:30:57 +00:00 - bt and thread apply all bt output
gdb2.txt vytenis - 2015-11-26 15:16:58 +00:00
Screen Shot 2015-11-26 at 18.20.17.png vytenis - 2015-11-26 16:34:42 +00:00
Screen Shot 2015-11-26 at 20.03.59.png vytenis - 2015-11-26 18:10:25 +00:00
gdb_bt_full.txt vytenis - 2015-11-26 19:54:17 +00:00
gdb_bt_full.txt.gz vytenis - 2015-11-26 19:57:43 +00:00

The text was updated successfully, but these errors were encountered:

icinga-migration · 2015-11-26T15:17:07Z

Updated by vytenis on 2015-11-26 15:17:07 +00:00

File added gdb2.txt

Same with a recent snapshot:

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.0-23-g9705e08)

Copyright (c) 2012-2015 Icinga Development Team (https://www.icinga.org)
License GPLv2+: GNU GPL version 2 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Debian GNU/Linux
  Platform version: "8 (jessie)
  Kernel: Linux
  Kernel version: 3.18.19
  Architecture: x86_64

icinga-migration · 2015-11-26T16:34:45Z

Updated by vytenis on 2015-11-26 16:34:45 +00:00

File added Screen Shot 2015-11-26 at 18.20.17.png

So the issue is due to the fact that IDO queue is full. However, for some reason it is never draining - SQL execution just stops.

icinga-migration · 2015-11-26T18:11:38Z

Updated by gbeutner on 2015-11-26 18:11:38 +00:00

Status changed from New to Feedback
Assigned to set to vytenis

Can you attach gdb to the icinga2 process once it has locked up and post the output of "thread apply all bt full" here?

icinga-migration · 2015-11-26T18:16:49Z

Updated by vytenis on 2015-11-26 18:16:49 +00:00

File added Screen Shot 2015-11-26 at 20.03.59.png

Bumping queue from 500k to 5m allows Icinga2 to start successfully, however with 400k services (passive mostly), a lot of host/service groups it apparently takes tens of millions of queries to do it's job. However, if the queue got full still during init, everything would block (no checks would be performed)

icinga-migration · 2015-11-26T19:54:18Z

Updated by vytenis on 2015-11-26 19:54:18 +00:00

File added gdb_bt_full.txt

Attaching `bt full` output

icinga-migration · 2015-11-26T19:57:59Z

Updated by vytenis on 2015-11-26 19:57:59 +00:00

File added gdb_bt_full.txt.gz

Apparently even if you turn off pagination, gdb still thinks you're insane to output it all in one go. Real, full backtrace is >20mb w/o compression

icinga-migration · 2015-12-17T09:41:18Z

Updated by mfriedrich on 2015-12-17 09:41:18 +00:00

Category set to DB IDO

Can you re-test this with the latest git master, please? :)

icinga-migration · 2015-12-20T20:18:17Z

Updated by vytenis on 2015-12-20 20:18:17 +00:00

Much better performance (less queries) on restart, but once config is big enough to hit queue limit, icinga2 is still non-responsive.

icinga-migration · 2015-12-20T21:56:13Z

Updated by vytenis on 2015-12-20 21:56:13 +00:00

Nevermind, my bad - bad config with way too many objects made it freeze.

[2015-12-20 21:54:18 +0000] information/CheckerComponent: Pending checkables: 76; Idle checkables: 79239; Checks/s: 3.4
[2015-12-20 21:54:18 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 8989.32/s (539359/min 2780669/5min 5190445/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 21:54:18 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:19 +0000] information/ThreadPool: Thread pool; current: 4; adjustment: -1
[2015-12-20 21:54:19 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: -1
[2015-12-20 21:54:20 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: 2
[2015-12-20 21:54:20 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:21 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:21 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:22 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: 2
[2015-12-20 21:54:23 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:23 +0000] information/CheckerComponent: Pending checkables: 0; Idle checkables: 79315; Checks/s: 60
[2015-12-20 21:54:23 +0000] information/ThreadPool: Thread pool; current: 4; adjustment: -1

Stats are updating correctly now, even icingaweb2 thinks icinga is alive, although most of the data is out of data

icinga-migration · 2015-12-20T22:02:39Z

Updated by vytenis on 2015-12-20 22:02:39 +00:00

This is weird though:

[2015-12-20 21:54:33 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 10748.7/s (644924/min 2851429/5min 5365788/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 21:54:33 +0000] information/ThreadPool: Thread pool; current: 2; adjustment: -1
[2015-12-20 21:54:34 +0000] information/ThreadPool: Thread pool; current: 1; adjustment: 2
...
[2015-12-20 21:54:44 +0000] information/IdoMysqlConnection: Finished reconnecting to MySQL IDO database in 599.299 second(s).
...
[2015-12-20 21:54:48 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 10102.5/s (606148/min 2798867/5min 5493930/15min); empty in infinite time, your database isn't able to keep up
...
[2015-12-20 22:00:40 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 8654.23/s (519254/min 2859120/5min 8423944/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 22:00:40 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1

However everything works, and is probably related to the fact that I previously created 10+M objects that have not been cleaned up

icinga-migration · 2016-02-24T21:40:27Z

Updated by mfriedrich on 2016-02-24 21:40:27 +00:00

At some point the work queue is full (when the database is not able to process more queries than are pushed into from the core). Then the check result processing will be blocked for example. Suggestions on how to solve slow database on the core side welcome.

icinga-migration · 2016-03-18T15:30:17Z

Updated by mfriedrich on 2016-03-18 15:30:17 +00:00

Status changed from Feedback to Closed
Assigned to deleted ~~vytenis~~

This is merely a problem with your database not being able to keep up. Once the query queue is full (you've already modified the upstream source from 1mio to 5mio) threads lock up and wait for the queue to get empty again. I'm closing this issue for now, please look into your database performance.

icinga-migration · 2016-03-30T15:57:26Z

Updated by lewiseason on 2016-03-30 15:57:26 +00:00

Is there any tooling available to get information about the query queue, or, flush it completely? Occasionally this is the easiest way to get back into a good state, and I don't mind losing a few minutes of checks.

icinga-migration · 2016-03-30T16:02:32Z

Updated by mfriedrich on 2016-03-30 16:02:32 +00:00

Either define the "ido" Check (see the docs for details) which sends performance data metrics for the queue size too. Or fetch it from the REST API, /v1/status.

icinga-migration · 2016-03-30T16:45:03Z

Updated by lewiseason on 2016-03-30 16:45:03 +00:00

Do either of those methods require the query queue/my database to be responsive?

At the moment, I have a cron job that looks at the log file and gets the last Query Queue length, which is at least quite robust. What I'd really like to be able to do is say "ok icinga, just drop everything in the queue and carry on starting from now" (at least until I can get the performance issues in check). Is this something I can do?

icinga-migration closed this as completed Mar 18, 2016

icinga-migration added bug Something isn't working area/db-ido Database output labels Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Dec 17, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Feb 24, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 30, 2016

icinga-migration commented Mar 30, 2016

icinga-migration commented Mar 30, 2016

[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

Comments

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Nov 26, 2015

icinga-migration commented Dec 17, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Dec 20, 2015

icinga-migration commented Feb 24, 2016

icinga-migration commented Mar 18, 2016

icinga-migration commented Mar 30, 2016

icinga-migration commented Mar 30, 2016

icinga-migration commented Mar 30, 2016