[dev.icinga.com #2617] status.cgi time out when displaying hostgroups in large environments #976
Comments
Updated by mzac on 2012-05-16 16:14:34 +00:00 We just downgraded back down to 1.6.1 and the problem has gone away, looks like a bug |
Updated by mfriedrich on 2012-05-16 16:27:51 +00:00
i can't reproduce that on my rhel 5.8 rpm build testbox. status.cgi?hostgroup=all&style=hostdetail&hoststatustypes=12&hostprops=0 status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader but the logs are clean. can you please debug the cgis on the shell, as described here? |
Updated by mzac on 2012-05-16 16:46:07 +00:00
Gets stuck there and then after about a minute spits out the rest of the page correctly
|
Updated by mfriedrich on 2012-05-16 16:55:29 +00:00 cgi.cfg settings? |
Updated by mfriedrich on 2012-05-16 16:56:03 +00:00
|
Updated by mzac on 2012-05-16 17:00:46 +00:00 Here is my cgi.cfg, when I run status.cgi from 1.6.1 it runs right away, but 1.7.0 hangs. any gdb commands you can give me to see what's going on when it hangs?
|
Updated by mfriedrich on 2012-05-16 17:07:50 +00:00 i was hoping to see a sigsegv which could be a cause for premature end of script headers. but timeouts cannot be traced with gdb .. strace might work better, or valgrind - maybe the cgi leaks memory in loop which causes everything else to suffer. |
Updated by mfriedrich on 2012-05-16 17:09:40 +00:00 what os/distribution is that and which package source for icinga? |
Updated by mzac on 2012-05-16 17:10:46 +00:00 strace from where it starts to hang and then starts up again, it's hanging during all those stats on /etc/localtime
|
Updated by mzac on 2012-05-16 17:20:07 +00:00 Compared to strace on 1.6.1:
|
Updated by mfriedrich on 2012-05-16 17:24:54 +00:00 hm, a blind shot is the javascript compare when to actually shoot the refresh on the page, causing max calls to (possible unsafe) localtime calls. could you try to change the refresh method back to http headers with the new cgi.cfg setting?
so refresh_type=0 with 1.7 then. maybe this is the root cause? |
Updated by mzac on 2012-05-16 17:28:37 +00:00 No difference when setting refresh_type=0, same thing :( dnsmichi wrote:
|
Updated by mfriedrich on 2012-05-16 17:29:20 +00:00
|
Updated by mzac on 2012-05-16 17:34:59 +00:00 Sorry didn't see this one: Red Hat Enterprise Linux Server release 6.2 (Santiago)
We built the RPM from source dnsmichi wrote:
|
Updated by mfriedrich on 2012-05-16 17:37:13 +00:00 pfuh. it would be interesting if we could get a static status.dat and objects.cache copy of yours and test that "real" against our cgis. if applicable, mailto michael.friedrich (at) univie.ac.at, gzipped. i might forward to ricardo as well, confidential as usual. tomorrow is a holiday here, but one might have time to debug further. |
Updated by mzac on 2012-05-16 17:45:36 +00:00 I'll check with my manager and if it's ok I'll send it off to you. I might have to zap out some lines in them of course. dnsmichi wrote:
|
Updated by mfriedrich on 2012-05-16 20:13:16 +00:00 steps to reproduce: $ cd path/to/icinga/icinga-core put the received status and objects.cache in there. then add 2 files
now tell the cgi the changed cgi.cfg location plus the other env vars, run it.
|
Updated by mfriedrich on 2012-05-16 20:16:57 +00:00 so indeed, it hangs doing an strace
then hogs /etc/localtime after a while, in 5 to 30 sec interval.
after ~2min, it's done with the privately provided status.dat and objects.cache. |
Updated by mfriedrich on 2012-05-16 20:26:17 +00:00 some strace stats with -CT
plus timing -CTttt
in performance regards, the status.cgi consumes a single core of my four. start: 1337199809 99 seconds (!) |
Updated by ricardo on 2012-05-16 21:57:53 +00:00 Hi, can you switch "show_partial_hostgroups" to 1 and try again? Thank you. |
Updated by ricardo on 2012-05-16 23:09:11 +00:00 hi, just to let you know. found the problem. mzac it's your insane amount of hosts in hostgroups. Just delete some hosts and you are fine! Just kidding! The problem is in checking if a certain host belongs to a hostgroup to display. Have to change the processing as I already did for servicegroups. Will provide a patch tomorrow night. This occurs only in large environments. And I will keep in mind to bug you to do some more tests after I changed/added something to Classic UI ;-) Cheers Ricardo |
Updated by mzac on 2012-05-17 00:49:34 +00:00 Believe it or not we have about half the number of hostgroups as we did before, so I was sure that wasn't the problem! :) We have set it up to be quite complex but it does help us drill down to a specific building and closet since we have a lot of buildings and closets spread around the McGill campus. In terms of testing I'd be able to help for sure, the server we're currently running our Icinga instance on is a: IBM x3650 M3 When I did the upgrade today I had about 20 people hitting the server with combination of the webgui and Nagstamon and all 16 cores went to 100% and the load average went up to 20! I'm very happy you were able to find the problem. As for the patch, will you just release it as a patch or do you think it warrants a 1.7.1 release? Thanks again! ricardo wrote:
|
Updated by mfriedrich on 2012-05-17 10:31:12 +00:00 at least one week of tests on git and nightly builds. plus we need to collect more bugs possibly reported and to be resolved (already got one). either way, if you can share ressources for testing, your input is very welcome - please apply as icinga padawan for testing then info@icinga.org so things get organized bit better |
Updated by mfriedrich on 2012-05-17 11:12:40 +00:00
|
Updated by ricardo on 2012-05-17 21:06:04 +00:00
Hi, can you try this patch? It works for me, but I wanted to check if it works with your environment too. Thanks a lot. Cheers Ricardo |
Updated by ricardo on 2012-05-18 09:08:20 +00:00
|
Updated by mfriedrich on 2012-05-18 11:56:31 +00:00
better :-D i've prepared a merge base for you in dev/cgis with the latest stuff from the 1.7 release (next). please put that patch after merge into your tree, on top of the recent ones. and make sure to collect now 1.7.1 patches, and then focus on the 1.8 tree. |
Updated by mfriedrich on 2012-05-18 12:01:31 +00:00
|
Updated by mzac on 2012-05-18 18:39:16 +00:00 Hi, just tested it out and it works perfectly when 'show_partial_hostgroups=0'. I saw a mention of this that it might be better to be 1, so I will change this in my config since I am limiting users to specific host groups. I'm just wondering though why the default of this option is 0, would it not make more sense to have the default 1? or is it 0 so that we see everything? Thanks, Zachary ricardo wrote:
|
Updated by ricardo on 2012-05-18 19:31:14 +00:00 Hi, I'm glad it works. I'm really sorry that this slipped through. This is what you get if don't be consequent all the way. But keep in mind, that "status.cgi?hostgroup=all" will only show hosts which are in a hostgroup. If a host is without any hostgroup it won't show up in this view. This was broken in Nagios for a looong time. Already thinking about adding a compatibility cgi option. Hope you can now enjoy the features of 1.7 and this fix will hit 1.7.1. When I switched "show_partial_hostgroups" to "1" it worked for me better. Thats why I asked to test it. show_partial_hostgroups=0: user has to be authorized for the whole hostgroup to see any host. If the user is authorized for one or more host in the group, but not for the group itself, the user won't see any hosts in this group. show_partial_hostgroups=1: if user is authorized for one host in the group, the user will see just this one and no other host in the group. If you think we should set it to "1" by default, then we should open an Issue to discuss this. Cheers Ricardo |
Updated by mfriedrich on 2012-06-12 16:12:03 +00:00
|
Updated by mfriedrich on 2014-12-08 09:42:45 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/2617
Created by mzac on 2012-05-16 16:14:03 +00:00
Assignee: ricardo
Status: Resolved (closed on 2012-06-12 16:12:03 +00:00)
Target Version: 1.7.1
Last Update: 2014-12-08 09:42:45 +00:00 (in Redmine)
We just upgraded our icinga installation to 1.7.0 and status.cgi is timing out when users are using nagstamon.
Nagstamon fetches this url:
http://servercgi-bin/status.cgi?hostgroup=all&style=hostdetail&hoststatustypes=12&hostprops=0
Getting this in my http logs:
Seems that on previous versions that the url would only return the problems, not sure what it's doing now.
Note we are monitoring 7000 hosts and 16100 services so we have a high load.
Attachments
Changesets
2012-05-18 19:44:19 +00:00 by ricardo 0f0722e
The text was updated successfully, but these errors were encountered: