Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8670] Reloads causing high load #2725

Closed
icinga-migration opened this issue Mar 10, 2015 · 24 comments
Closed

[dev.icinga.com #8670] Reloads causing high load #2725

icinga-migration opened this issue Mar 10, 2015 · 24 comments
Labels
bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8670

Created by gtorre on 2015-03-10 15:15:02 +00:00

Assignee: (none)
Status: Closed (closed on 2016-08-08 15:49:54 +00:00)
Target Version: (none)
Last Update: 2016-11-09 14:59:54 +00:00 (in Redmine)

Icinga Version: 2.2.4
Backport?: Not yet backported
Include in Changelog: 1

We recently upgraded to 2.2.4, shortly after we started seeing extremely high load (load average: 527.16, 277.33, 173.62) when Icinga2 reloads. There are about 4,200 nodes in production, with about 94,000 checks running. We have the following Icinga2 features enabled: checker, command, compatlog, icingastatus, levitates, and notification.

Attachments

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-10 15:25:05 +00:00

  • Status changed from New to Feedback
  • Assigned to set to gtorre

What's "levitates"?

Apart from that, attach the logs from what your monitoring core is doing at that time. And any additional information which would help the user outside your environment to understand your issue. With that little information provided I would rather start a discussion in the community support channels.

Furthermore please re-test that against 2.3.0.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-11 08:13:38 +00:00

Hey *,

We are also seeing this on our current test setup. We have had 2.2.4 running for a while and have been seeing these incredible load spikes with reloads of the core. Now with 2.3 this is still happening.

What we discovered so far:

  • Nodes reloading start running all checks available causing over 1000 running processes (24.000 Service checks total on 1900 hosts)
  • Nodes run these checks for a while keeping load very high
  • Nodes load goes down and starts running as a regular checker

If you need more details, please give some ideas on what logs to attach and if you need debug log enabled (warning: very high volume!). We'd like to see this fixed because more often than not a high load node does not recover properly and does not run any checks at all. Restarting Icinga2 can fix this if the high load phase is passed properly.

EDIT: Here the overview of a partially recovered cluster of checkers that have been reloaded. Some nodes show an expected load but one (icinga2-checker-1.dc1) has no load at all and does not run any checks:

$ knife ssh "roles:icinga2_checker_preview" "uptime"
icinga2-checker-2.dc2 09:26:14 up 4 days, 19:01,  1 user,  load average: 1.47, 1.88, 1.77
icinga2-checker-1.dc1 09:26:14 up 6 days, 22:41,  1 user,  load average: 0.01, 0.03, 0.05
icinga2-checker-1.dc2 09:26:14 up 6 days, 22:41,  2 users,  load average: 1.73, 1.48, 1.61
icinga2-checker-2.dc1 09:26:14 up 4 days, 19:12,  1 user,  load average: 3.09, 3.05, 3.27

Log of that node also shows no more output since the process got stuck. Restarting the node causes high load again with a lot of checks running at the same time (see attached proc_ps.txt).

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-11 08:36:38 +00:00

  • File added proc_ps.txt

File attachment for process list.

@icinga-migration
Copy link
Author

Updated by gtorre on 2015-03-11 14:39:08 +00:00

@dnsmichi that was an auto-correct, I meant "livestatus".

As TheSerapher mentioned, our logs are very verbose and I've analyzed them carefully... I didn't see anything that stood out. I will say that during the periods of high load, the node is still very responsive and there doesn't seem to be any negative impact on the checks, etc. I'm hesitant upgrading to 2.3.0 since TheSerapher hasn't had any luck with that version.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-11 14:42:52 +00:00

Hm, from an idea I had during sleep: The reload spawns a new child doing the configuration validation and then taking over. If that reload period takes longer (seconds to minutes in large environments) while the "normal" checks and other events are fired by the parent, I could imagine that this causes load peaks. Furthermore the configuration validation has been improved with 2.3 and is now faster (much more parallel, but also taking more resources).

It probably would help see graphs where load peaks and reload times are put together.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-11 14:55:03 +00:00

  • File added gdb.txt

dnsmichi wrote:

Hm, from an idea I had during sleep: The reload spawns a new child doing the configuration validation and then taking over. If that reload period takes longer (seconds to minutes in large environments) while the "normal" checks and other events are fired by the parent, I could imagine that this causes load peaks. Furthermore the configuration validation has been improved with 2.3 and is now faster (much more parallel, but also taking more resources).

It probably would help see graphs where load peaks and reload times are put together.

That is interesting. So from what I read, the main process keeps running as before and executes the regular checks while a new child is doing the validation of the config and then takes over from the parent?

I have yet another checker that just seemed to have stopped working (no load and no checks fired). It's log only shows:

[2015-03-11 14:41:10 +0100] information/Checkable: Notifications are disabled for service 'elastic-proxy-4.search.fra1!chef-client-status'.

And nothing after that. I attached the gdb.txt for this node.

Now lets check the load figures:

# uptime
 15:49:07 up 7 days,  5:04,  1 user,  load average: 0.11, 0.08, 2.79

No load at all as expected, sind this node isn't doing anything anymore. According to the idea you had, this node should continue to not have any load since even though we are spawning a new process, the checks have stopped in the parent:

# /etc/init.d/icinga2 reload
 * checking Icinga2 configuration                                                                                                                                             [ OK ]
 * Reloading icinga2 monitoring daemon icinga2
# ps afx
[...]
63145 ?        SNsl   1:05 /usr/sbin/icinga2 --no-stack-rlimit daemon -d -e /var/log/icinga2/icinga2.err --reload-internal 63055
62897 ?        SNl    0:14  \_ /usr/sbin/icinga2 --no-stack-rlimit daemon -d -e /var/log/icinga2/icinga2.err --reload-internal 63145

The reload indeed took a few seconds with a new process hanging at 100% CPU usage for a few seconds. But since the parent has not done any checks, we did not see any change in load. Good!

Now to restart this node. We kill the process and start new:

# /etc/init.d/icinga2 restart
 * checking Icinga2 configuration                                                                                                                                             [ OK ]
[...]
# uptime
 15:52:43 up 7 days,  5:08,  1 user,  load average: 565.44, 213.32, 79.40
# ps afx | grep -c perl
435
# ps afx | grep -c icmp
8
# ps afx | grep -c ping
569

Instant spike in load! Not only that, but we can also see a lot of checks being run simultaneously again. I am not sure if this is intended (flush state caches and re-load service and host states by checking them all?), but I'd think that a Checker should only run any checks if triggered by a Master via API? Or is our (single) master sending all these requests to run checks?

If you need any more details let me know.

EDIT: Load is now back to normal and checks are running:

# uptime
 16:03:27 up 7 days,  5:18,  1 user,  load average: 2.30, 61.39, 78.90
# ps afx | grep -c perl
16
# ps afx | grep -c icmp
1
# ps afx | grep -c ping
84

@icinga-migration
Copy link
Author

Updated by gtorre on 2015-03-11 14:55:37 +00:00

  • File added reload_icinga.log
  • File added Screenshot 2015-03-11 10.46.31.png

We run the icinga reload on a cron every 10 minutes.. I've attached logs showing what time the reload script kicked off. Note, I only provided the time stamps when Icinga reloaded and there were configuration changes in our repo.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-11 15:10:09 +00:00

Small update, all nodes seem to be running normal right now:

± knife ssh "role:icinga2_checker_preview" -x root "uptime"
icinga2-checker-1.dc2  16:08:42 up 7 days,  5:24,  1 user,  load average: 2.41, 22.84, 56.85
icinga2-checker-2.dc2  16:08:42 up 5 days,  1:44,  1 user,  load average: 1.62, 9.20, 24.53
icinga2-checker-2.dc1  16:08:42 up 5 days,  1:55,  1 user,  load average: 3.73, 47.44, 63.14
icinga2-checker-1.dc1  16:08:42 up 7 days,  5:23,  2 users,  load average: 3.95, 9.82, 16.41
± knife ssh "role:icinga2_checker_preview" -x root "ps afx | grep -c perl"
icinga2-checker-2.dc2 16
icinga2-checker-1.dc1 39
icinga2-checker-1.dc2 24
icinga2-checker-2.dc1 27
± knife ssh "role:icinga2_checker_preview" -x root "ps afx | grep -c icmp"
icinga2-checker-2.dc1 2
icinga2-checker-1.dc1 2
icinga2-checker-2.dc2 2
icinga2-checker-1.dc2 2
± knife ssh "role:icinga2_checker_preview" -x root "ps afx | grep -c ping"
icinga2-checker-1.dc1 153
icinga2-checker-2.dc1 167
icinga2-checker-2.dc2 66
icinga2-checker-1.dc2 93

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 14:42:08 +00:00

We have just deployed one checker to a beefy hardware machine (Dual Hexacore with Hyperthreading, 96 GB RAM) to see how a more powerful machine behaves. We are not seeing the load spikes as much. Baseline load is about 0.3 when running regular checks, when restarting Icinga2 the load goes up to 4.0. I would say that a lot of cores help here, but it's not the solution.

Scratch my previous comment, it happened on the hardware box too. We stopped the Icinga2 core and started it again, system went into 100% CPU Usage on all 24 cores. Load spiked to 70 but was, obviously, dealt with much faster than a 4 core node. Then I tried to delete the icinga2.state file by stoping the service, removing the file and starting it again. Same behaviour (which I rather expected). Then stopped the service, not touching the state file, and started again. Same thing.

For starters, it would be good to know why Icinga spawns that many processes when starting up even though there seems to be no reason to do that in a checker, that should wait for masters to tell them what to check.

Edit: Still seen on 2.3.1/2.3.2

@icinga-migration
Copy link
Author

Updated by gtorre on 2015-03-12 15:50:26 +00:00

So there might have been something external causing this high load.. we have no idea. Someone might have been running something against the box which caused this. We haven't seen the crazy spikes since my last post. If it happens again I'll update the thread.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-13 08:56:18 +00:00

It's noting external for us, those loads happen pretty with each reloaded and restart.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-16 12:24:15 +00:00

I have checked if the checker nodes are sending any of these results to the master but it doesn't look like they do. One startup is done and the regular checks start running, we can see results coming in to the master and being stored in the IDO.

I think it would be good if there is a way for a checker node to not produce these excessive checks during startup. If you need any further help on working on this let me know, I can run snapshots with live data.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-19 10:01:04 +00:00

  • Status changed from Feedback to Assigned
  • Assigned to changed from gtorre to mfriedrich

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-23 13:26:17 +00:00

  • Target Version set to Backlog

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-06 20:34:37 +00:00

  • Category set to libbase

@icinga-migration
Copy link
Author

Updated by gtorre on 2016-01-21 15:12:58 +00:00

We've been experiencing the same issue for the last few months. Is there anything we can do to help with the investigation process?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-25 00:23:15 +00:00

Re-test with 2.4.3.

@icinga-migration
Copy link
Author

Updated by mnardin on 2016-02-29 17:27:09 +00:00

I can confirm this behavior as well running 2.4.3.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:30:34 +00:00

  • Parent Id set to 11310

@icinga-migration
Copy link
Author

Updated by kunsjef on 2016-04-20 10:37:56 +00:00

We have the same issue in version r2.4.4-1. One master, 4 checkers. All checkers have 8 CPU cores and 16GB RAM. We have 2500 hosts and 8000 services. All checkers spike when we reload, and the load is in the thousands before checkers start crashing. We can't even start Icinga2 with all our hosts monitored. If we reduce the number to 600-700 hosts, the spikes last just for a short while and then the load normalizes.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-11 06:46:33 +00:00

  • Status changed from Assigned to Feedback
  • Assigned to changed from mfriedrich to gtorre

This might have been solved by fixing #8137. Please re-test with the current master.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-08 15:49:54 +00:00

  • Status changed from Feedback to Closed
  • Assigned to deleted gtorre
  • Target Version deleted Backlog

Please re-test with the current snapshot packages and/or release packages once 2.5 is finally released. Although I do believe that #8137 should already tackle your issue.

@icinga-migration
Copy link
Author

Updated by saurabh_hirani on 2016-08-10 09:14:27 +00:00

Thanks for the productive discussion on this issue everyone. I am seeing similar scenarios when we do an icinga2 reload, the validation processes cause a major load spike (going from 2 to 11 on a 12 core machine). Will keep this thread posted with observations and steps taken.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-09 14:59:54 +00:00

  • Parent Id deleted 11310

@icinga-migration icinga-migration added bug Something isn't working libbase labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant