[dev.icinga.com #2972] Proposal - new SLA reporting strategy #27
Comments
Updated by tgelf on 2012-08-12 22:55:00 +00:00
I attached a sample report package making use of this function and a sample PDF generated by this package. There is only one single "Availability" report to be found in there. It allows you to choose a single or multiple hostgroups and to either show the availability for the last week/month/year or to use a custom timerange. Cheers, |
Updated by tgelf on 2012-08-13 07:18:35 +00:00
NB: New version of the sample reporting package. The former one had two references to ressources in the current Icinga reports. |
Updated by berk on 2012-08-13 07:36:47 +00:00
|
Updated by croft on 2012-08-14 10:54:14 +00:00
Added output from show_single_events.sql |
Updated by tgelf on 2012-08-14 11:01:38 +00:00 Hi Lara, how did you call this query? Through STDIN like ... mysql -u whatever -p icinga < show_single_events.sql ...or interactively by starting the MySQL-CLI and pasting the queries: mysql -u whatever -p icinga -A The output looks like you used the first option... is my guess correct? Do things change when using the second variant? Do the change when running the query multiple times? Right now the start_time column is completely wrong in your result... Cheers, NB: For all those wondering what's going on here -> we moved troubleshooting from #2793 to this ticket. |
Updated by tgelf on 2012-08-14 11:06:51 +00:00
@lara: I attached a new report-package. Could you try to find out whether this fixes your "Host from hostgroup X are missing" problem? There have been small leftovers from a Customvar-Voodoo-customization I did for one of our customers. Cheers, |
Updated by croft on 2012-08-14 11:46:42 +00:00
tgelf wrote:
Yes, sorry my fault...
Yes, looks better :-) |
Updated by croft on 2012-08-14 12:03:43 +00:00 tgelf wrote:
Problem is fixed :-) |
Updated by tgelf on 2012-08-14 12:45:53 +00:00 croft wrote:
Well, that's true. But unfortunately data remained as is, the start_time column content is rubbish. But slowly we are going to find out what's causing this wrong value: your object shows only soft state changes. If you have a look at the last event on fifth August, at least the last event (OK with statehistory_id = 359214) MUST have been a hard state (state_type = 1). What Icinga and especially what IDO version are you using? |
Updated by croft on 2012-08-14 13:21:36 +00:00 It's icinga-1.6.1 and idoutils 1.6.0. |
Updated by tgelf on 2012-08-14 13:40:28 +00:00 croft wrote:
Well, that's not that new... but also not sooo old. Does anyone know whether there have been known issues with wrong state_type values (SOFT instead of HARD) in 1.6.0? A workaround for this one is tricky. The only option coming to my mind is interpreting every OK state as a HARD state, even soft ones. I did so a little while ago, but then I found a better way to ship around most issues I've met with wrong data in the IDO database. Seems not being enough so far. Do you have other hosts or services showing similar issues and completely wrong SLA values? It would be "great" to have more such aliens :p Cheers, |
Updated by berk on 2012-08-15 10:12:24 +00:00
|
Updated by bmalynovytch on 2012-08-20 08:19:40 +00:00 Sorry for the noobish question : how / where are we supposed to install the package ? Is it a core module, a classic UI ou a new UI one ? Cheers, |
Updated by tgelf on 2012-08-20 16:30:10 +00:00 Hi Benjamin! bmalynovytch wrote:
The ZIP file is a sample package suitable for your JasperReport Server. Installation works as shown here: http://docs.icinga.org/latest/en/reporting\_1.6.html If that works fine, you can install the ZIP package attached to this issue. The Makefile in the current Reporting-tar.gz shows how this can be done. The Icinga-Web frontend will currently not be able to show those reports, you have to run them directly on your JasperServer or triggered by iReport. And then you'll also need the MySQL function as attached to this issue. However, please don't do so as long as you don't feel self-confident enough. There is a good chance that you'll find this package in the next Icinga-Release - installation will definitively be easier once it is packaged and documented. Cheers, |
Updated by bmalynovytch on 2012-08-21 07:48:09 +00:00 tgelf wrote:
Dear Thomas, Thank you for your answer. I'll wait for the release to see if setting up reporting could be interesting for my needs. Regards, |
Updated by berk on 2012-08-23 13:16:17 +00:00
|
Updated by berk on 2012-08-31 16:39:00 +00:00
|
Updated by mfriedrich on 2012-08-31 16:42:32 +00:00 the link for the documentation - typo? |
Updated by Anonymous on 2012-08-31 16:51:56 +00:00
Applied in changeset 333083a. |
Updated by croft on 2012-09-10 10:45:37 +00:00
Hi, sorry, but the new report- package doesn't work as expected... Kind regards, |
Updated by tgelf on 2012-09-10 11:13:18 +00:00 Hi Lara, croft wrote:
I didn't try out the new reporting package yet, however your report looks like the results still needs to be multiplied with 100. Could you try to do so in your reports query? Cheers, |
Updated by croft on 2012-09-10 12:09:06 +00:00 Hi Thomas, |
Updated by mfriedrich on 2012-09-26 13:51:57 +00:00 I've updated the Wiki Documentation accordingly for importing the sql, as well as possible error on missing EXECUTE grants. https://wiki.icinga.org/display/howtos/Setting+up+Icinga+with+Reporting keep in mind - if this is done for postgresql too, we should possibly add
instead. |
Updated by berk on 2012-10-12 07:20:30 +00:00
i think we can resolve this one, right? Still another issue for postgresql function |
Updated by mfriedrich on 2012-10-12 10:07:55 +00:00
yep |
Updated by mfriedrich on 2013-02-01 13:55:54 +00:00
|
Updated by mfriedrich on 2013-02-01 13:56:00 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/2972
Created by tgelf on 2012-08-10 06:32:13 +00:00
Assignee: berk
Status: Resolved (closed on 2012-10-12 10:07:55 +00:00)
Target Version: 1.8
Last Update: 2012-10-12 10:07:54 +00:00 (in Redmine)
Hi @ll,
SLA reporting is currently an annoying topic, at least when it comes to SLA reporting for Icinga. First of all, there are multiple options to choose from:
None of those options felt suitable for daily use in large envirionments. That's why I've spent a lot of time to throw in a new variant: an easy to use function for MySQL. Of course it could also be ported to other databases. But let me first explain how it works with a short example:
SELECT
name1 AS host,
COALESCE (name2, 'Host availability') AS service,
icinga_availability(object_id, '2012-01-01 00:00:00', NOW ()) AS sla
FROM icinga_objects
WHERE is_active = 1
AND objecttype_id IN (1, 2)
ORDER BY host, objecttype_id, service;
The query shall be self-explaining, the result could look as follows:
I did many, many different attempts until I got to this point. First I didn't like the idea to be forced to use a function and I also managed it to put everything in a single query. However, such query forced me to make intensive use of subselects once it goes to examine more than one single object_id at once. Throwing in millions of statehistory-rows I haven't been able to design them in a way allowing MySQL to execute them fast enough.
That's why I've build this function. It performed very, very well in all tests I did. It allows you to do crazy things such as asking for "all services who have performed better in July than they did in June". That's what you need to prove your boss that things are going to be better. At least those you have chosen to show him ;-)
The function is also fast enough for showing lot's of SLA-values in your web application, such as Icinga-Web. In my tests I have created a table showing the SLA for multiple months and years per line and service with many services on a single page. This still allowed me to browse through them pretty fast.
Right now the function isn't perfect yet and will require additional tuning. I've seen older MySQL versions behaving strange when it goes to NULL value handling in variables. I tried to ship around most of the issues I've met, but there may be others.
Icinga/IDO also isn't perfect. During my tests I discovered that for example the last_hard_state column is pretty wrong each time a hard state is reached before reaching max_check_attempts. This is what happens for example to service checks once Icinga knows that their host is down.
Such issues are the root cause for some strange hacks in the query. Before asking why the hell I did the one or the other not-so-elegant thing, please read the inline comments. They try to explain what is going on - and why.
I also know that splitting the main query in the function into multiple parts could make things easier. Nonetheless I'm trying hard to keep it as-is, one big fat query. The reason for this is simple: I want to be able to copy parts of it to a web application or reporting query. Especially the part putting together all the single events could be pretty useful.
SLA logic is hard-coded. A host is fine with an UP state. UNREACHABLE is a problem. A service is fine as long as it is OK or WARNING. UNKNOWN is a problem. A downtime means that everything is fine. Multiple downtimes for overlapping timeperiods are handled. An acknowledgement is not relevant for SLA reporting. Telling that you are working on a problem means that the problem still persists.
Special care is required once there are too few or no events to be found in the database. I'll add a few improvements for those border cases, some things already work fine. My intention is:
You will have a far better SLA experience by adding the following indexes:
CREATE INDEX sla_index ON icinga_statehistory
(object_id, state_time DESC);
CREATE INDEX sla_index ON icinga_downtimehistory
(object_id, actual_start_time, actual_end_time);
CREATE INDEX sla_index ON icinga_objects
(objecttype_id, is_active, name1);
The DESC is not useful for the SLA part, but it could help elsewhere too. There are other details, but I think I managed it to mention most of them. The function is attached to this issue, please try it out and have fun. And please let me know what you think of it!
Regards,
Thomas Gelf
Attachments
Changesets
2012-08-31 16:51:24 +00:00 by (unknown) 333083a
Subtasks:
Relations:
The text was updated successfully, but these errors were encountered: