Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #2972] Proposal - new SLA reporting strategy #27

Closed
icinga-migration opened this issue Aug 10, 2012 · 29 comments
Closed

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2972

Created by tgelf on 2012-08-10 06:32:13 +00:00

Assignee: berk
Status: Resolved (closed on 2012-10-12 10:07:55 +00:00)
Target Version: 1.8
Last Update: 2012-10-12 10:07:54 +00:00 (in Redmine)


Hi @ll,

SLA reporting is currently an annoying topic, at least when it comes to SLA reporting for Icinga. First of all, there are multiple options to choose from:

  • Reports from Icinga-Classic. This method currently gives you the most precise results. They are pretty slow when it comes to large datasets. And you still need something like Excel to achieve a useful report.
  • Icinga comes with Reports for the Jasper Server - as long as you are using IDO2DB in combination with MySQL. Customizing them is tricky and requires a lot of SQL knowledge. Unfortunately they do not support downtimes and will often give you no or wrong results.
  • The new kid on the block is the SLA support directly in the ido2db daemon. The obvious idea behind this attempt is to store report-friendly data as soon as SLA-relevant events occur. I haven't seen any tool or report making use of this data so far. Different people have been reporting strange side-effects once they activated this module.

None of those options felt suitable for daily use in large envirionments. That's why I've spent a lot of time to throw in a new variant: an easy to use function for MySQL. Of course it could also be ported to other databases. But let me first explain how it works with a short example:

SELECT
name1 AS host,
COALESCE (name2, 'Host availability') AS service,
icinga_availability(object_id, '2012-01-01 00:00:00', NOW ()) AS sla
FROM icinga_objects
WHERE is_active = 1
AND objecttype_id IN (1, 2)
ORDER BY host, objecttype_id, service;

The query shall be self-explaining, the result could look as follows:

host service sla
c1-db1 Host availability 91.5496
c1-db1 MySQL 91.5700
c1-db1 PING 91.5696
c1-db2 Host availability 91.5569
c1-db2 MySQL 100.0000
c1-db2 PING 91.5689

I did many, many different attempts until I got to this point. First I didn't like the idea to be forced to use a function and I also managed it to put everything in a single query. However, such query forced me to make intensive use of subselects once it goes to examine more than one single object_id at once. Throwing in millions of statehistory-rows I haven't been able to design them in a way allowing MySQL to execute them fast enough.

That's why I've build this function. It performed very, very well in all tests I did. It allows you to do crazy things such as asking for "all services who have performed better in July than they did in June". That's what you need to prove your boss that things are going to be better. At least those you have chosen to show him ;-)

The function is also fast enough for showing lot's of SLA-values in your web application, such as Icinga-Web. In my tests I have created a table showing the SLA for multiple months and years per line and service with many services on a single page. This still allowed me to browse through them pretty fast.

Right now the function isn't perfect yet and will require additional tuning. I've seen older MySQL versions behaving strange when it goes to NULL value handling in variables. I tried to ship around most of the issues I've met, but there may be others.

Icinga/IDO also isn't perfect. During my tests I discovered that for example the last_hard_state column is pretty wrong each time a hard state is reached before reaching max_check_attempts. This is what happens for example to service checks once Icinga knows that their host is down.

Such issues are the root cause for some strange hacks in the query. Before asking why the hell I did the one or the other not-so-elegant thing, please read the inline comments. They try to explain what is going on - and why.

I also know that splitting the main query in the function into multiple parts could make things easier. Nonetheless I'm trying hard to keep it as-is, one big fat query. The reason for this is simple: I want to be able to copy parts of it to a web application or reporting query. Especially the part putting together all the single events could be pretty useful.

SLA logic is hard-coded. A host is fine with an UP state. UNREACHABLE is a problem. A service is fine as long as it is OK or WARNING. UNKNOWN is a problem. A downtime means that everything is fine. Multiple downtimes for overlapping timeperiods are handled. An acknowledgement is not relevant for SLA reporting. Telling that you are working on a problem means that the problem still persists.

Special care is required once there are too few or no events to be found in the database. I'll add a few improvements for those border cases, some things already work fine. My intention is:

  • If a host is currently UP since the mid of July and there is not even one single event in the statehistory table, the function shall be intelligent enough to tell you that the availability for July was 50%. The function also respects current host or service state, that's why this should work fine. This part is not complete yet, but already partially working.
  • A host currently being DOWN with a last_state_change in June and no single event or downtime entry will show an availability of 0% for July.
  • A host definition created in August will show a NULL availability for July. If it was created in the mid of July it shall show 50%. Currently it doesn't, it would use it's initial state for the whole July - this will be fixed.

You will have a far better SLA experience by adding the following indexes:

CREATE INDEX sla_index ON icinga_statehistory
(object_id, state_time DESC);
CREATE INDEX sla_index ON icinga_downtimehistory
(object_id, actual_start_time, actual_end_time);
CREATE INDEX sla_index ON icinga_objects
(objecttype_id, is_active, name1);

The DESC is not useful for the SLA part, but it could help elsewhere too. There are other details, but I think I managed it to mention most of them. The function is attached to this issue, please try it out and have fun. And please let me know what you think of it!

Regards,
Thomas Gelf

Attachments

Changesets

2012-08-31 16:51:24 +00:00 by (unknown) 333083a

added new package with sla extension fixes #2972

Subtasks:

Relations:

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-12 22:55:00 +00:00

  • File added reporting-icingadev.zip
  • File added reporting-icingadev-sample.pdf

I attached a sample report package making use of this function and a sample PDF generated by this package. There is only one single "Availability" report to be found in there. It allows you to choose a single or multiple hostgroups and to either show the availability for the last week/month/year or to use a custom timerange.

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-13 07:18:35 +00:00

  • File added reporting-icingadev.zip

NB: New version of the sample reporting package. The former one had two references to ressources in the current Icinga reports.

@icinga-migration
Copy link
Author

Updated by berk on 2012-08-13 07:36:47 +00:00

  • Status changed from New to Assigned
  • Assigned to set to berk
  • Priority changed from Normal to High
  • Target Version set to 1.8

@icinga-migration
Copy link
Author

Updated by croft on 2012-08-14 10:54:14 +00:00

  • File added show_single_events.txt

Added output from show_single_events.sql

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-14 11:01:38 +00:00

Hi Lara,

how did you call this query? Through STDIN like ...

mysql -u whatever -p icinga < show_single_events.sql

...or interactively by starting the MySQL-CLI and pasting the queries:

mysql -u whatever -p icinga -A

The output looks like you used the first option... is my guess correct? Do things change when using the second variant? Do the change when running the query multiple times? Right now the start_time column is completely wrong in your result...

Cheers,
Thomas

NB: For all those wondering what's going on here -> we moved troubleshooting from #2793 to this ticket.

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-14 11:06:51 +00:00

  • File added reporting-icingadev.zip

@lara: I attached a new report-package. Could you try to find out whether this fixes your "Host from hostgroup X are missing" problem? There have been small leftovers from a Customvar-Voodoo-customization I did for one of our customers.

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by croft on 2012-08-14 11:46:42 +00:00

  • File added show_single_events2.txt

tgelf wrote:

Hi Lara,

how did you call this query? Through STDIN like ...

mysql -u whatever -p icinga < show_single_events.sql

...or interactively by starting the MySQL-CLI and pasting the queries:

mysql -u whatever -p icinga -A

The output looks like you used the first option... is my guess correct?

Yes, sorry my fault...

Do things change when using the second variant?

Yes, looks better :-)

@icinga-migration
Copy link
Author

Updated by croft on 2012-08-14 12:03:43 +00:00

tgelf wrote:

@lara: I attached a new report-package. Could you try to find out whether this fixes your "Host from hostgroup X are missing" problem? There have been small leftovers from a Customvar-Voodoo-customization I did for one of our customers.

Cheers,
Thomas

Problem is fixed :-)

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-14 12:45:53 +00:00

croft wrote:

> Do things change when using the second variant?

Yes, looks better :-)

Well, that's true. But unfortunately data remained as is, the start_time column content is rubbish. But slowly we are going to find out what's causing this wrong value: your object shows only soft state changes. If you have a look at the last event on fifth August, at least the last event (OK with statehistory_id = 359214) MUST have been a hard state (state_type = 1).

What Icinga and especially what IDO version are you using?

@icinga-migration
Copy link
Author

Updated by croft on 2012-08-14 13:21:36 +00:00

It's icinga-1.6.1 and idoutils 1.6.0.
I think there was no 1.6.1- version for the idoutils..

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-14 13:40:28 +00:00

croft wrote:

It's icinga-1.6.1 and idoutils 1.6.0.
I think there was no 1.6.1- version for the idoutils..

Well, that's not that new... but also not sooo old. Does anyone know whether there have been known issues with wrong state_type values (SOFT instead of HARD) in 1.6.0?

A workaround for this one is tricky. The only option coming to my mind is interpreting every OK state as a HARD state, even soft ones. I did so a little while ago, but then I found a better way to ship around most issues I've met with wrong data in the IDO database. Seems not being enough so far.

Do you have other hosts or services showing similar issues and completely wrong SLA values? It would be "great" to have more such aliens :p

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by berk on 2012-08-15 10:12:24 +00:00

  • Category set to Queries

@icinga-migration
Copy link
Author

Updated by bmalynovytch on 2012-08-20 08:19:40 +00:00

Sorry for the noobish question : how / where are we supposed to install the package ? Is it a core module, a classic UI ou a new UI one ?

Cheers,
Benjamin

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-08-20 16:30:10 +00:00

Hi Benjamin!

bmalynovytch wrote:

Sorry for the noobish question : how / where are we supposed to install the package ? Is it a core module, a classic UI ou a new UI one ?

The ZIP file is a sample package suitable for your JasperReport Server. Installation works as shown here:

http://docs.icinga.org/latest/en/reporting\_1.6.html

If that works fine, you can install the ZIP package attached to this issue. The Makefile in the current Reporting-tar.gz shows how this can be done. The Icinga-Web frontend will currently not be able to show those reports, you have to run them directly on your JasperServer or triggered by iReport. And then you'll also need the MySQL function as attached to this issue.

However, please don't do so as long as you don't feel self-confident enough. There is a good chance that you'll find this package in the next Icinga-Release - installation will definitively be easier once it is packaged and documented.

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by bmalynovytch on 2012-08-21 07:48:09 +00:00

tgelf wrote:

The ZIP file is a sample package suitable for your JasperReport Server. Installation works as shown here:
http://docs.icinga.org/latest/en/reporting\_1.6.html

Dear Thomas,

Thank you for your answer.
I'm not using JasperReport Server. That's why the package format didn't seem to fit.

I'll wait for the release to see if setting up reporting could be interesting for my needs.

Regards,
Benjamin

@icinga-migration
Copy link
Author

Updated by berk on 2012-08-23 13:16:17 +00:00

  • spa-queries updated and implemented in top10 and sla-report
  • docu still missing

@icinga-migration
Copy link
Author

Updated by berk on 2012-08-31 16:39:00 +00:00

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-08-31 16:42:32 +00:00

the link for the documentation - typo?

@icinga-migration
Copy link
Author

Updated by Anonymous on 2012-08-31 16:51:56 +00:00

  • Status changed from Assigned to Resolved

Applied in changeset 333083a.

@icinga-migration
Copy link
Author

Updated by croft on 2012-09-10 10:45:37 +00:00

  • File added HostAvailabilityReportLastMonth.pdf
  • File added localhost_avail_report.doc
  • Status changed from Resolved to Assigned

Hi,

sorry, but the new report- package doesn't work as expected...
I think the availability is calculated incorrectly. The "Availability report about a selected Host in a given time" says, that most of my servers had an uptime of 1% or less.
Example:
localhost, timeperiod: last month
See attachements please.

Kind regards,
Lara

@icinga-migration
Copy link
Author

Updated by tgelf on 2012-09-10 11:13:18 +00:00

Hi Lara,

croft wrote:

...most of my servers had an uptime of 1% or less.

I didn't try out the new reporting package yet, however your report looks like the results still needs to be multiplied with 100. Could you try to do so in your reports query?

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by croft on 2012-09-10 12:09:06 +00:00

Hi Thomas,
yes looks like. I think it has to be changed in the subreport host/Availability in given time, but where exactly? Could you please help me?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-09-26 13:51:57 +00:00

I've updated the Wiki Documentation accordingly for importing the sql, as well as possible error on missing EXECUTE grants.

https://wiki.icinga.org/display/howtos/Setting+up+Icinga+with+Reporting

keep in mind - if this is done for postgresql too, we should possibly add

sql/$rdbms/icinga_availibility.sql

instead.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-09-26 13:57:10 +00:00

  • File added icinga_reporting_1.8_host_availibilityreport.png

icinga_reporting_1.8_host_availibilityreport.png

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-09-26 13:57:47 +00:00

  • File added icinga_reporting_1.8_morning_report.png

icinga_reporting_1.8_morning_report.png

@icinga-migration
Copy link
Author

Updated by berk on 2012-10-12 07:20:30 +00:00

  • Status changed from Assigned to Feedback

@dnsmichi:

i think we can resolve this one, right? Still another issue for postgresql function

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-10-12 10:07:55 +00:00

  • Status changed from Feedback to Resolved

yep

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-02-01 13:55:54 +00:00

  • File deleted reporting-icingadev.zip

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2013-02-01 13:56:00 +00:00

  • File deleted reporting-icingadev.zip

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant