Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10963] high load and memory consumption on icinga2 agent v2.4.1 #3835

Closed
icinga-migration opened this issue Jan 13, 2016 · 32 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10963

Created by elabedzki on 2016-01-13 15:36:54 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2016-02-23 09:59:37 +00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:59:53 +00:00 (in Redmine)

Icinga Version: v2.4.1
Backport?: Already backported
Include in Changelog: 1

Hi guys,

we noticed a high load and memory consumption problem with some icinga2 agents ( in version 2.4.1 ), isn't really clear what is going on behind the scenes.

One of our customer has a hugh setup, described as follows...

  • 1494 running agents
  • all agents are installed with v2.4.11 ( released build )
  • the operation system where some agents are installed is a mixed environment between some redhat distributions (RHEL 6.5; 6.6; 7.1; 7.2)
  • a agent has between 13 to 20 checks definied/loaded ( some of them are apply-rules and dedicated checks )
  • the masters are installed as a standard cluster with the icinga2 v2.4.1 snapshot package "snapshot201601112014.el7.centos "
  • The behavior of the master systems is not directly tangible due to the known performance issues. There, however, no such load / memory problems have occurred since importing the snapshots.

Has anyone similar problems on his setup?

The CPU load is about, along with the extremely high memory utilization.

Icinga2 eats up to 70% of 2GB RAM and generates a load of 12 on a one core system.

At the same time we noticed in the log file on the masters that all agents are trying to reconnect, as you can see:

[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 106 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 105 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 104 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 103 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 102 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 101 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 100 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 99 API clients left.
[2016-01-13 11:42:59 +0100] information/JsonRpcConnection: No messages for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain' have been received in the last 60 seconds.
[2016-01-13 11:42:59 +0100] warning/JsonRpcConnection: API client disconnected for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 98 API clients left.
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:00 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:03 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-01.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:03 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-01.lxprod.obsfucated.customer.domain'
[2016-01-13 11:43:03 +0100] information/ApiListener: New client connection for identity 'mon-icingamaster-01.lxprod.obsfucated.customer.domain'

Anyone some ideas what's going on here?

Best
Enrico

Attachments

Changesets

2016-01-15 09:11:52 +00:00 by jflach cb70d97

Plug two memory leaks

refs #10963

2016-01-19 14:24:17 +00:00 by (unknown) d50c8e1

Improve debug support for analyzing memory leaks

refs #10963

2016-01-19 15:24:07 +00:00 by (unknown) b1aa6cc

Decrease memory usage for the Object class

refs #10963

2016-01-19 15:24:12 +00:00 by (unknown) e4b7111

Check the certificate name when reconnecting to an instance

refs #10963

2016-01-19 15:43:46 +00:00 by (unknown) db0c6ef

Only build leak detection code when I2_LEAK_DEBUG is set

refs #10963

2016-01-19 16:25:28 +00:00 by (unknown) 55f0c58

Skip log replay for endpoints with log_duration = 0

refs #10963

2016-01-20 13:07:07 +00:00 by (unknown) e48ed33

Add missing SetSyncing() call

refs #10963

2016-01-21 09:37:47 +00:00 by (unknown) 72c3b6d

Make sure we're not running command_endpoint-based checks more than once

refs #10963

2016-01-21 12:02:53 +00:00 by (unknown) 6d88d90

Remove redundant log messages

refs #10963

2016-01-21 15:37:52 +00:00 by (unknown) 6ca054e

Ensure that checks are not scheduled for command_endpoint fake hosts

refs #10963

2016-02-12 13:15:24 +00:00 by mfriedrich 04a4049

Increase query queue size for testing

refs #10963

2016-02-16 12:08:21 +00:00 by (unknown) 9e9298f

Add -pthread to build flags

refs #10963

2016-02-23 08:57:40 +00:00 by jflach e80b335

Plug two memory leaks

refs #10963

2016-02-23 08:57:49 +00:00 by (unknown) abfacd9

Improve debug support for analyzing memory leaks

refs #10963

2016-02-23 09:46:13 +00:00 by (unknown) badeea7

Decrease memory usage for the Object class

refs #10963

2016-02-23 09:46:17 +00:00 by (unknown) b227dc7

Check the certificate name when reconnecting to an instance

refs #10963

2016-02-23 09:46:17 +00:00 by (unknown) 087ad3f

Only build leak detection code when I2_LEAK_DEBUG is set

refs #10963

2016-02-23 09:46:17 +00:00 by (unknown) 3cfa871

Skip log replay for endpoints with log_duration = 0

refs #10963

2016-02-23 09:46:18 +00:00 by (unknown) 80fdccc

Add missing SetSyncing() call

refs #10963

2016-02-23 09:46:18 +00:00 by (unknown) 7985e93

Make sure we're not running command_endpoint-based checks more than once

refs #10963

2016-02-23 09:46:18 +00:00 by (unknown) c415dd3

Remove redundant log messages

refs #10963

2016-02-23 09:46:18 +00:00 by (unknown) fc90265

Ensure that checks are not scheduled for command_endpoint fake hosts

refs #10963

2016-02-23 09:46:19 +00:00 by mfriedrich f6378c9

Increase query queue size for testing

refs #10963

2016-02-23 09:46:19 +00:00 by (unknown) c998665

Add -pthread to build flags

refs #10963

Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-13 15:56:59 +00:00

  • Priority changed from Normal to Urgent
  • Target Version set to 2.4.2
  • Estimated Hours set to 48

@icinga-migration
Copy link
Author

Updated by elabedzki on 2016-01-14 09:07:43 +00:00

Our customer told me that memory leak now happend also on the masters.

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-14 18:25:57 +00:00

Blind guess: 957cf31

Line 54:

char *outbuf = new char[input.GetLength()];

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by elabedzki on 2016-01-14 19:07:54 +00:00

Hi Tom,

tgelf wrote:

Blind guess: 957cf31

Line 54:

char *outbuf = new char[input.GetLength()];

yes, i can't find a delete on that char buffer.

You found it.

Best
Enrico

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-14 20:00:09 +00:00

Keep in mind that 1) the leak exists in 2.4.1 stable (that git commit is from master) 2) Base64 is only used for rest api auth, which isn't enabled on clients

I guess there are more possible leaks, Valgrind will hopefully unveil them.

@icinga-migration
Copy link
Author

Updated by elabedzki on 2016-01-14 20:02:41 +00:00

I am confident

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-14 23:13:17 +00:00

Didn't know that we cultivate more of them ;) What about lib/base/tlsutility.cpp, RandomString seems to be missing delete [] bytes if RAND_bytes succeeds. Not sure whether this happens often enough to result in a serious leak...

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-15 10:20:41 +00:00

Created a pull request for those mentioned above: #61

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-15 10:30:05 +00:00

Race condition :D

@icinga-migration
Copy link
Author

Updated by jflach on 2016-01-15 11:42:32 +00:00

tgelf wrote:

Race condition :D

I won :D

We still have to test whether this fixes the issue

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-01-15 15:56:46 +00:00

I think there is still a memory leak. Here is a diff between two pmap $(pidof icinga2):

$ diff pmap_icinga2_1452869816 pmap_icinga2_1452871984 -y | grep '>'
                                  > 00007fce2ff52000   1724K rw---   [ anon ]
                                  > 00007fce303b3000   1808K rw---   [ anon ]
                                  > 00007fce305bb000   2396K rw---   [ anon ]
                                  > 00007fce30812000   2928K rw---   [ anon ]
                                  > 00007fce30d1d000   1800K rw---   [ anon ]
                                  > 00007fce30f5a000   1456K rw---   [ anon ]
                                  > 00007fce31121000   1636K rw---   [ anon ]
                                  > 00007fce3134b000   3188K rw---   [ anon ]
                                  > 00007fce3173c000   1972K rw---   [ anon ]
                                  > 00007fce31b21000   2164K rw---   [ anon ]
                                  > 00007fce31f06000   2136K rw---   [ anon ]
                                  > 00007fce321fd000   2132K rw---   [ anon ]
                                  > 00007fce32479000   2496K rw---   [ anon ]
                                  > 00007fce326e9000   1268K rw---   [ anon ]
                                  > 00007fce3284c000   1520K rw---   [ anon ]
                                  > 00007fce32ade000   2008K rw---   [ anon ]
                                  > 00007fce32ec5000   2272K rw---   [ anon ]
                                  > 00007fce33136000   1736K rw---   [ anon ]
                                  > 00007fce332e8000   2384K rw---   [ anon ]
                                  > 00007fce33701000   2108K rw---   [ anon ]
                                  > 00007fce3392b000   1884K rw---   [ anon ]
                                  > 00007fce33b02000   2532K rw---   [ anon ]
                                  > 00007fce33f03000   1244K rw---   [ anon ]
                                  > 00007fce34050000   1316K rw---   [ anon ]
                                  > 00007fce3423a000   1760K rw---   [ anon ]
                                  > 00007fce3452d000   1660K rw---   [ anon ]
                                  > 00007fce34a60000   1920K rw---   [ anon ]
                                  > 00007fce34c64000   1364K rw---   [ anon ]
                                  > 00007fce3c69a000   1144K rw---   [ anon ]

$ sudo icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.1-107-gcb70d97)
[...]

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-01-18 07:11:41 +00:00

While these leaks are definitely bugs RandomString isn't used in any code paths that are reachable via the 'daemon' CLI command. Also, the changes for the base64 functions weren't introduced until after 2.4.1 was released.

@icinga-migration
Copy link
Author

Updated by itbess on 2016-01-18 20:18:14 +00:00

We are running in the same problem on 2.3.11.
We have a Cluster Setup with 5 Machines
1 Database
1 Icinga2 master with all the config << This one has the memory leak
2 Icinga2 Checkers
1 Icingaweb2 Server.

in pmap over time it generates more and more of these

00007f85b4000000 65536K 60512K 60512K 60512K 5024K rw-p [anon] 00007f85b8000000 65536K 62440K 62440K 62440K 3096K rw-p [anon] 00007f85bc000000 131072K 61412K 61412K 61412K 4124K rw-p [anon] 00007f85c4000000 131072K 123256K 123256K 123256K 7816K rw-p [anon] 00007f85cc000000 131072K 123548K 123548K 123548K 7524K rw-p [anon] 00007f85d4000000 65536K 122936K 122936K 122936K 8136K rw-p [anon]

@
fitl08v232:~ # icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: v2.3.11)
@

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-01-19 15:26:13 +00:00

  • Status changed from New to Assigned
  • Assigned to set to gbeutner

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-20 12:45:28 +00:00

Hi Tobias!

tobiasvdk wrote:

I think there is still a memory leak. Here is a diff between two pmap $(pidof icinga2):

Could you please give the latest snapshot (version: v2.4.1-116-g55f0c58, commit: 55f0c58) a try? We are not sure why, but that one mitigated the problems for us - no more memory leak, much less CPU load.

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-20 13:08:15 +00:00

STOP :) In case you are using the Icinga Agent you should better wait a little bit - the current master introduced another issue, will be fixed immediately...

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-01-21 15:24:53 +00:00

Still leaking with r2.4.1-123-g72c3b6d:

tvonderkrone@fkb-icinga2:~$ diff pmap_icinga2_1453389328 pmap_icinga2_1453389795 -y | grep '>'
                                  > 00007fc249eba000   6064K rw---   [ anon ]
                                  > 00007fc24a721000   6292K rw---   [ anon ]
                                  > 00007fc24b955000   6828K rw---   [ anon ]
                                  > 00007fc26027b000   6880K rw---   [ anon ]
                                  > 00007fc260d66000   5828K rw---   [ anon ]
                                  > 00007fc2613de000   1372K rw---   [ anon ]
                                  > 00007fc2615b5000   4136K rw---   [ anon ]

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-01-22 08:53:20 +00:00

  • File added icinga2_memory_r2.4.1-123-g72c3b6d.png

Maybe I was too fast yesterday because it was just after the restart. Here is a graph:
icinga2_memory_r2.4.1-123-g72c3b6d.png

$ free -m
             total       used       free     shared    buffers     cached
Mem:          3012       2934         77        261         47        432
-/+ buffers/cache:       2454        558
Swap:         2107        464       1643

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 14:57:56 +00:00

  • Relates set to 10758

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 16:31:21 +00:00

  • Relates set to 7287

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-04 12:00:35 +00:00

@tobiasvdk: Can you retest this with the latest snapshot?

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-02-04 12:21:56 +00:00

  • File added icinga2_10963.png

Only my master is leaking, the satellites are ok. In our config only the satellites are connecting to the master.
icinga2_10963.png

$ sudo icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.4.1-167-gcaf3380)

$ diff -y icinga2_pmap_1454573034 icinga2_pmap_1454586995 | grep '|'
00007ff634000000  52872K rw---   [ anon ]             | 00007ff634000000  65536K rw---   [ anon ]
00007ff644000000  18092K rw---   [ anon ]             | 00007ff644000000  65268K rw---   [ anon ]
00007ff6451ab000  47444K -----   [ anon ]             | 00007ff647fbd000    268K -----   [ anon ]
00007ff67c000000  54200K rw---   [ anon ]             | 00007ff67c000000  54212K rw---   [ anon ]
00007ff67f4ee000  11336K -----   [ anon ]             | 00007ff67f4f1000  11324K -----   [ anon ]
00007ff68c000000   1772K rw---   [ anon ]             | 00007ff68c000000  65536K rw---   [ anon ]
00007ff694000000   3672K rw---   [ anon ]             | 00007ff694000000  20260K rw---   [ anon ]
00007ff694396000  61864K -----   [ anon ]             | 00007ff6953c9000  45276K -----   [ anon ]
00007ff6a8000000    568K rw---   [ anon ]             | 00007ff6a8000000  65536K rw---   [ anon ]
00007ff6a808e000  64968K -----   [ anon ]             | 00007ff6ac000000   2368K rw---   [ anon ]
00007ff6ac000000   2332K rw---   [ anon ]             | 00007ff6ac250000  63168K -----   [ anon ]
00007ff6b8000000  11108K rw---   [ anon ]             | 00007ff6b8000000  65536K rw---   [ anon ]
00007ff6b8ad9000  54428K -----   [ anon ]             | 00007ff6bc000000  65536K rw---   [ anon ]
00007ff6c8000000  43724K rw---   [ anon ]             | 00007ff6c8000000  65536K rw---   [ anon ]
00007ff6caab3000  21812K -----   [ anon ]             | 00007ff6cc000000   2352K rw---   [ anon ]
00007ff6cc000000    392K rw---   [ anon ]             | 00007ff6cc24c000  63184K -----   [ anon ]
00007ff6d4000000  48516K rw---   [ anon ]             | 00007ff6d4000000  65536K rw---   [ anon ]
00007ff708000000   1988K rw---   [ anon ]             | 00007ff708000000   2288K rw---   [ anon ]
00007ff7081f1000  63548K -----   [ anon ]             | 00007ff70823c000  63248K -----   [ anon ]
00007ff714000000   1876K rw---   [ anon ]             | 00007ff714000000   2344K rw---   [ anon ]
00007ff7141d5000  63660K -----   [ anon ]             | 00007ff71424a000  63192K -----   [ anon ]
00007ff73c000000   1996K rw---   [ anon ]             | 00007ff73c000000   2316K rw---   [ anon ]
00007ff73c1f3000  63540K -----   [ anon ]             | 00007ff73c243000  63220K -----   [ anon ]
 total          4731524K                      |  total          7818132K

@Shroud: should I run some gdb commands?

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-02-04 13:04:43 +00:00

Maybe it's because the database currently cannot handle the load:

Query queue items: 500000, query rate: 1814.53/s (108872/min 542683/5min 1633198/15min); empty in infinite time, your database isn't able to keep up

I will deactivate the ido feature and test again.

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-02-04 15:02:46 +00:00

  • File added icinga2_10963_noido.png

Ok looks good:
icinga2_10963_noido.png

So the memory "leak" is maybe the check results which are queuing up because the database cannot write fast enough. Are the check results queued in memory?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-04 15:20:35 +00:00

The query queue holds all the remaining updates, that would explain your problem.

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-02-04 20:29:16 +00:00

dnsmichi wrote:

The query queue holds all the remaining updates, that would explain your problem.

But the queue has a length of 500000 which was already reached. Were are the other results being held? I need to have a look into the code.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-04 22:08:05 +00:00

  • Relates set to 11096

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-10 07:10:33 +00:00

tobiasvdk: Once the WorkQueue's size limit is reached the Enqueue() method blocks - which generally means other parts of Icinga become unresponsive. I'm not really happy with this behavior but there really are only a few options:

  1. Block Enqueue() calls until there is more capacity in the queue. That's what we're doing right now.
  2. Remove the queue size limit. Assuming your database isn't able to catch up with the queries you'll end up with OOM at some point.
  3. Throw away (some?) of the tasks that are being enqueued. This would really only be possible for certain types of tasks like status updates.
  4. De-duplicate queries (e.g. status updates for the same host/service).

@icinga-migration
Copy link
Author

Updated by tobiasvdk on 2016-02-10 15:23:37 +00:00

gunnarbeutner wrote:

tobiasvdk: Once the WorkQueue's size limit is reached the Enqueue() method blocks - which generally means other parts of Icinga become unresponsive. I'm not really happy with this behavior but there really are only a few options:

  1. Block Enqueue() calls until there is more capacity in the queue. That's what we're doing right now.
    In the git master or last (stable) release? Because this explains why I saw the "no messages for 60 seconds" cluster log messages. But now (with the version from git master) I don't see it anymore.
  2. Remove the queue size limit. Assuming your database isn't able to catch up with the queries you'll end up with OOM at some point.
    That's what I have now (with the git master version).
  3. Throw away (some?) of the tasks that are being enqueued. This would really only be possible for certain types of tasks like status updates.
    Yepp, couldn't find a ticket here.
  4. De-duplicate queries (e.g. status updates for the same host/service).
    Yepp, #10822

Also good would be to allow multiple connections #10953 ;)
Another idea would be to store the enqueued tasks on disk although this would move the problem from RAM to disk. Nice would be that the "ido" check command return a critical state if the database cannot keep up so the user is (more) aware of the problem.

@icinga-migration
Copy link
Author

Updated by vytenis on 2016-02-17 14:50:29 +00:00

We also noticed very bad behaviour with IDO queue and had to bump it to work in our setup, as 500k was not nearly enough (SSDs+mysql tuning alone is not sufficient for 100k+ object setups) - see #10731 - while blocking on Enqueue() does not lead to hard freeze as it used to be back in 2.4.0, it will still happen eventually if the DB cannot keep up beyond the initial query load. Naturally, the queries do take up a LOT more RAM than Icinga itself requires. :) TBH, the IDO could be a lot more efficient - there's like ~10 queries per monitored object that have to be executed - the recent changes in git master really reduced the runtime load, though, especially if you do not care about history.
However, even something as a missing index would make Icinga2 startup impossible as it would eventually lock up, as described in #11133.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-23 09:59:37 +00:00

  • Status changed from Assigned to Resolved

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-02-23 09:59:53 +00:00

  • Backport? changed from Not yet backported to Already backported

@icinga-migration icinga-migration added Urgent bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.2 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant