Version 1.5.2 bad erformance in large environment
|Category:||Database & Queries|
hostgroup_name Hosts Services
hostgroup1 1202 1300
hostgroup2 89 1021
hostgroup3 171 1434
hostgroup4 72 913
the members of group 2 and 4 mostly are also members of group 3
We configured 3 user groups in icinga-web
usergroup1 has principals configured to have access to hostgroup3 hosts
usergroup2 has principals configured to have access to hostgroups hostgroup2 and hostgroup4
usergroup3 has principals configured to have access to hostgroup1
The issue now is:
If one of the users that are member in one of the usergroups the webinterface gets extremely slow (response time of the icinga-web interface increases to more than a minute) and the load
on the MySQL DB Server gets to 100% CPU
What we changed besides the upgrade of icinga-core from 1.4.1 to 1.5.1 and icinga-web from 1.4.1 to 1.5.2 is that we re-created the icinga core database with MyISAM engine.
The reason was a very large ib_data1 file (15 gigabytes)
So the questions are:
Why is it so slow when using principals?
Which database engine is better for icinga-core DB InnoDB or MyISAM - and if it is innodb how to avoid the huge ibdata file?
Why are the filters (hosts not up) now much slower than in 1.4.1? (this has nothing to do with the groups issue)
Thanks in advace for your help
#1 Updated by jmosshammer about 5 years ago
- Priority changed from High to Normal
First: I'm not a mysql performance specialiast. But afaik, transaction management in innodb is done much more efficent in time (but not in space) and in situations where you have many concurrent inserts and selects it should therefore perform better than MyISAM (and you have lots of inserts in your enviroment). It's hard to track down whether it's 1.5 or MyISAM, but i could imagine that MyISAM hat a big factor on this, as it has limited/no support of row locking.
Can you use compression in innodb?
Though 1.5. got a new database layer, the queries are almost the same at this time, so it shouldn't have such a big impact.
#2 Updated by tuxifier about 5 years ago
So we switched back to innodb thoroughly without any performance improvements. The performance of "normal views (standard cronks)" with a user without any principals is acceptable but if we use a user with principals (allowed to see only specific hostgroups) and add a filter like "services not ok" icinga-web doesn't respond for minutes.
Do the SQL queries get that complex in this case? Is there a sensible way to trace them for further analysis?
Our next step is to try innodb compression to see if that'll make an improvement. How are the experiences with postgres on high workloads on icinga? Might it perform better? Would it make sense to give it a try?
I also found out that before switching to icinga and icinga-web 1.5.x the principals in icinga-web weren't ever tried. So I doubt for now the reason has anything to do with the differences between 1.4 and 1.5
#3 Updated by tuxifier about 5 years ago
Got some pretty new results from research:
The thing that causes the freeze of the icinga-web interface is an SQL Query that (I assume) is executed by or for the <div class="float-container clearfix icinga-monitor-performance"> (which is the icinga performance statistics information on the top right)
If I log into icinga-web with a user without principals and sees everything the query looks like this:
- Query_time: 10 Lock_time: 0 Rows_sent: 1 Rows_examined: 93962
SELECT DISTINCT MIN AS i4__0, MIN AS i4__1, MAX AS i4__2, MAX AS i4__3, AVG AS i4__4, AVG AS i4__5, MIN AS i4__6, MIN AS i4__6, MAX AS i4__7, MAX AS i4__7, AVG AS i4__8, AVG AS i4__8 FROM icinga_objects i LEFT JOIN icinga_objects i2 ON i.object_id = i2.object_id AND (i.objecttype_id = 2) LEFT JOIN icinga_services i3 ON i2.object_id = i3.service_object_id LEFT JOIN icinga_servicestatus i4 ON i3.service_object_id = i4.service_object_id INNER JOIN icinga_objects i5 ON (((i5.object_id = i.object_id AND i.objecttype_id = 1) OR i5.object_id = i3.host_object_id)) INNER JOIN icinga_hosts i6 ON i5.object_id = i6.host_object_id LEFT JOIN icinga_hoststatus i7 ON i6.host_object_id = i7.host_object_id WHERE (i6.config_type = '1');
If I login to icinga-web with a user that has attached principals (2 hostgroups in this case it looks like this
- Query_time: 195 Lock_time: 0 Rows_sent: 1 Rows_examined: 42593249
SELECT DISTINCT MIN AS i4__0, MIN AS i4__1, MAX AS i4__2, MAX AS i4__3, AVG AS i4__4, AVG AS i4__5, MIN AS i4__6, MIN AS i4__6, MAX AS i4__7, MAX AS i4__7, AVG AS i4__8, AVG AS i4__8 FROM icinga_objects i LEFT JOIN icinga_objects i2 ON i.object_id = i2.object_id AND (i.objecttype_id = 2) LEFT JOIN icinga_services i3 ON i2.object_id = i3.service_object_id LEFT JOIN icinga_servicestatus i4 ON i3.service_object_id = i4.service_object_id INNER JOIN icinga_objects i5 ON (((i5.object_id = i.object_id AND i.objecttype_id = 1) OR i5.object_id = i3.host_object_id)) INNER JOIN icinga_hosts i6 ON i5.object_id = i6.host_object_id LEFT JOIN icinga_hoststatus i7 ON i6.host_object_id = i7.host_object_id LEFT JOIN icinga_hostgroup_members i9 ON (i6.host_object_id = i9.host_object_id) LEFT JOIN icinga_hostgroups i8 ON i8.hostgroup_id = i9.hostgroup_id LEFT JOIN icinga_objects i10 ON i8.hostgroup_object_id = i10.object_id WHERE (i6.config_type = '1' AND (i10.name1 LIKE 'Fileserver-Node' OR i10.name1 LIKE 'Compute-Node'));
The Query time information is from the mysql slow queries log. The point is:
If I add principals to a user the examined rows to calculate the internal performance stats of icinga (MIN,MAX,AVG) increase from 100.000 to 42.000.000 which results in an increasing query time from 10s to 195s.
The div "icinga-monitor-performance" seems to be deployed by icinga-web/app/modules/Cronks/templates/System/ MonitorPerformanceSuccess.php and by default is automagically refreshed every 60s if I read the code correctly. Which might result in an endless query loop if the query doesn't finish within 60s?
So is this a "works as designed" behaviour or may there be an issue with the SQL Statement itself?
Next Question is, how could I avoid that this query is executed on login? Can I completely disable the data query for the icinga-monitor-performance div to do further testing?
The database itself has absolutely no performance probs - no I/O, no mem but the query eats 100% of one cpu - core as long as it runs (which seems to be ok for me)
#4 Updated by tgelf about 5 years ago
I can confirm this. And I'm pretty sure that these queries could also be done without all those JOINs (or at least not with so many of them). The main issue in the shown queries is this part:
INNER JOIN icinga_objects i5 ON ((
(i5.object_id = i.object_id AND i.objecttype_id = 1)
i5.object_id = i3.host_object_id
MySQL seems to be unable to make use of it's indexes as of this construct. This leads to harmful copying-to-tmp-table operations. You can find a similar construct in the notification view, causing unacceptably bad performance for that grid. Indexes added in the current master make this better, but the notification view still remains unusable. I helped myself with throwing away the IF-construct there, notifications are still slow and show NULL for host notifications - but that way they are no longer killing my DB servers :p
What also makes 1.5.2 seem to performance bad is the fact that it does never ever execute two requests for a single session in parallel. The root cause of this is the complete lack of session file management: it's left up to PHP to close them, and PHP doesn't do so before the end of a single request. Switching to DB-based sessions will improve this - at the cost of slightly more DB traffic. Garbage collection will kick your session from time to time as of another bug - you'll find a possible fix in the related issue.
There are also some grids whose rows are not filtered once you are using a custom filter in your grid view. This can lead to hundreds and thousands of grid lines your browser has to paginate by itself. FF will inform you that a script is running crazy, IE will probably completely lock up.
Installing APC is also a good idea to make Icinga-Web faster. Provide MySQL with a large InnoDB buffer pool, innodb_file_per_table will help avoiding ever-growing ibdata files. Using that option you can track down problematic tables, search the root cause, try to fix it and recreate just that single table. Choosing MyISAM for Icinga is a bad idea.
Well, fixed all those issues Icinga-Web 1.5.2 runs with acceptable performance also in large environments. Good luck ;-)
NB: I didn't fix the statistics query yet. It is causing harm here too, but you'll not notice it anymore once you worked around the session locking issue. However I did some testing: it is possible to rewrite those queries in a way to make them run in far less than a single second. But right now i have no idea how to do so in such a hyper-abstracted way Icinga-Web is asking for.