Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10731] Icinga2 locks up after initialization and state dump with IDO enabled #3717

Closed
icinga-migration opened this issue Nov 26, 2015 · 15 comments
Labels
area/db-ido Database output bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10731

Created by vytenis on 2015-11-26 14:35:40 +00:00

Assignee: (none)
Status: Closed (closed on 2016-03-18 15:30:17 +00:00)
Target Version: (none)
Last Update: 2016-03-30 16:45:03 +00:00 (in Redmine)

Icinga Version: 2.4.0
Backport?: Not yet backported
Include in Changelog: 1

Icinga2 locks up after initialization and state dump with IDO enabled. API, livestatus is still responsive but no checks are made, passive check results are not processed, no database queries are executed.
Attaching gdb `bt`/`thread apply all bt` output. Debug log looks like this:

[2015-11-26 14:22:06 +0000] notice/ThreadPool: Pool #1: Pending tasks: 1127; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:07 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:08 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] information/ApiListener: New client connection (no client certificate)
[2015-11-26 14:22:09 +0000] notice/ApiListener: New HTTP client
[2015-11-26 14:22:09 +0000] debug/HttpRequest: line: GET /v1/status HTTP/1.1, tokens: 3
[2015-11-26 14:22:09 +0000] information/HttpServerConnection: Request: GET /v1/status (nagdowntime)
[2015-11-26 14:22:09 +0000] notice/LivestatusListener: Client connected
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-11-26 14:22:09 +0000] warning/TlsStream: TLS stream was disconnected.

[2015-11-26 14:22:21 +0000] notice/ThreadPool: Pool #1: Pending tasks: 1129; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-11-26 14:22:22 +0000] notice/ThreadPool: Thread pool; current: 16; adjustment: 2

Attachments

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-11-26 15:17:07 +00:00

  • File added gdb2.txt

Same with a recent snapshot:

icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.0-23-g9705e08)

Copyright (c) 2012-2015 Icinga Development Team (https://www.icinga.org)
License GPLv2+: GNU GPL version 2 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Debian GNU/Linux
  Platform version: "8 (jessie)
  Kernel: Linux
  Kernel version: 3.18.19
  Architecture: x86_64

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-11-26 16:34:45 +00:00

  • File added Screen Shot 2015-11-26 at 18.20.17.png

So the issue is due to the fact that IDO queue is full. However, for some reason it is never draining - SQL execution just stops.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-11-26 18:11:38 +00:00

  • Status changed from New to Feedback
  • Assigned to set to vytenis

Can you attach gdb to the icinga2 process once it has locked up and post the output of "thread apply all bt full" here?

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-11-26 18:16:49 +00:00

  • File added Screen Shot 2015-11-26 at 20.03.59.png

Bumping queue from 500k to 5m allows Icinga2 to start successfully, however with 400k services (passive mostly), a lot of host/service groups it apparently takes tens of millions of queries to do it's job. However, if the queue got full still during init, everything would block (no checks would be performed)

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-11-26 19:54:18 +00:00

  • File added gdb_bt_full.txt

Attaching `bt full` output

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-11-26 19:57:59 +00:00

  • File added gdb_bt_full.txt.gz

Apparently even if you turn off pagination, gdb still thinks you're insane to output it all in one go. Real, full backtrace is >20mb w/o compression

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-17 09:41:18 +00:00

  • Category set to DB IDO

Can you re-test this with the latest git master, please? :)

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-12-20 20:18:17 +00:00

Much better performance (less queries) on restart, but once config is big enough to hit queue limit, icinga2 is still non-responsive.

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-12-20 21:56:13 +00:00

Nevermind, my bad - bad config with way too many objects made it freeze.

[2015-12-20 21:54:18 +0000] information/CheckerComponent: Pending checkables: 76; Idle checkables: 79239; Checks/s: 3.4
[2015-12-20 21:54:18 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 8989.32/s (539359/min 2780669/5min 5190445/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 21:54:18 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:19 +0000] information/ThreadPool: Thread pool; current: 4; adjustment: -1
[2015-12-20 21:54:19 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: -1
[2015-12-20 21:54:20 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: 2
[2015-12-20 21:54:20 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:21 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:21 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:22 +0000] information/ThreadPool: Thread pool; current: 3; adjustment: 2
[2015-12-20 21:54:23 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1
[2015-12-20 21:54:23 +0000] information/CheckerComponent: Pending checkables: 0; Idle checkables: 79315; Checks/s: 60
[2015-12-20 21:54:23 +0000] information/ThreadPool: Thread pool; current: 4; adjustment: -1

Stats are updating correctly now, even icingaweb2 thinks icinga is alive, although most of the data is out of data

@icinga-migration
Copy link
Author

Updated by vytenis on 2015-12-20 22:02:39 +00:00

This is weird though:

[2015-12-20 21:54:33 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 10748.7/s (644924/min 2851429/5min 5365788/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 21:54:33 +0000] information/ThreadPool: Thread pool; current: 2; adjustment: -1
[2015-12-20 21:54:34 +0000] information/ThreadPool: Thread pool; current: 1; adjustment: 2
...
[2015-12-20 21:54:44 +0000] information/IdoMysqlConnection: Finished reconnecting to MySQL IDO database in 599.299 second(s).
...
[2015-12-20 21:54:48 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 10102.5/s (606148/min 2798867/5min 5493930/15min); empty in infinite time, your database isn't able to keep up
...
[2015-12-20 22:00:40 +0000] information/IdoMysqlConnection: Query queue items: 5000000, query rate: 8654.23/s (519254/min 2859120/5min 8423944/15min); empty in infinite time, your database isn't able to keep up
[2015-12-20 22:00:40 +0000] information/ThreadPool: Thread pool; current: 5; adjustment: -1

However everything works, and is probably related to the fact that I previously created 10+M objects that have not been cleaned up

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-24 21:40:27 +00:00

At some point the work queue is full (when the database is not able to process more queries than are pushed into from the core). Then the check result processing will be blocked for example. Suggestions on how to solve slow database on the core side welcome.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 15:30:17 +00:00

  • Status changed from Feedback to Closed
  • Assigned to deleted vytenis

This is merely a problem with your database not being able to keep up. Once the query queue is full (you've already modified the upstream source from 1mio to 5mio) threads lock up and wait for the queue to get empty again. I'm closing this issue for now, please look into your database performance.

@icinga-migration
Copy link
Author

Updated by lewiseason on 2016-03-30 15:57:26 +00:00

Is there any tooling available to get information about the query queue, or, flush it completely? Occasionally this is the easiest way to get back into a good state, and I don't mind losing a few minutes of checks.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-30 16:02:32 +00:00

Either define the "ido" Check (see the docs for details) which sends performance data metrics for the queue size too. Or fetch it from the REST API, /v1/status.

@icinga-migration
Copy link
Author

Updated by lewiseason on 2016-03-30 16:45:03 +00:00

Do either of those methods require the query queue/my database to be responsive?

At the moment, I have a cron job that looks at the log file and gets the last Query Queue length, which is at least quite robust. What I'd really like to be able to do is say "ok icinga, just drop everything in the queue and carry on starting from now" (at least until I can get the performance issues in check). Is this something I can do?

@icinga-migration icinga-migration added bug Something isn't working area/db-ido Database output labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant