Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10758] Double running after icinga2 monitoring server restarts #3733

Closed
icinga-migration opened this issue Dec 1, 2015 · 21 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10758

Created by hostedpower on 2015-12-01 14:23:48 +00:00

Assignee: hostedpower
Status: Closed (closed on 2016-02-04 12:15:02 +00:00)
Target Version: (none)
Last Update: 2016-02-04 12:15:02 +00:00 (in Redmine)

Icinga Version: r2.4.1-1
Backport?: Not yet backported
Include in Changelog: 1

Please look at https://dev.icinga.org/issues/10656

I open this ticket because I assume a closed ticket isn't watched anymore, however this seems to be a huge problem.

Checks are executed twice after the connection to the monitoring server was lost. This corrupts checks which are depending on the state. I also had issues with other plugins , but it was most visibly with the check_mysql_health.

Attachments

  • debug.zip hostedpower - 2015-12-17 10:31:21 +00:00
  • icinga2.zip hostedpower - 2015-12-17 10:43:14 +00:00

Relations:

@icinga-migration
Copy link
Author

Updated by hostedpower on 2015-12-05 12:35:55 +00:00

Hi,

I don't understand nobody else has issues with this. I have lots of them :|

You just don't notice the bug very easily if you have only some simple checks: They are simply executed twice and that's it. With check_yum I sometimes seem to have similar issues caused by the double run. I added logging in the plugins themselves and I see double execution each time when I have the issues (2 differents PID's and run at the same moment)

Recently I noticed I also have the issue sometimes when rebooting the monitored servers (so where the icinga runs as a client for the monitoring server).

Resolution is each time the same: restarting the icinga2 client.

Almost all my checks are of remote exection type, so with

command_endpoint = host.vars.remote_client

Maybe this gives an extra clue?

I hope someone looks into this, because it almosts makes me ditching icinga2 because of all warnings and errors caused each time with so many servers :(

Jo

@icinga-migration
Copy link
Author

Updated by hostedpower on 2015-12-17 08:50:13 +00:00

Any update on this? Any pointers where in the code this could be? It's really driving me nuts and it is getting worser and worse the more hosts I add :(

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-12-17 08:56:31 +00:00

  • Category set to libmethods
  • Status changed from New to Feedback
  • Assigned to set to hostedpower

Please extract the exact command execution from the logs, attach ps aux output, etc.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2015-12-17 10:05:01 +00:00

Hi,

I can add ps aux, but not sure what you want to see with that? The double running won't be caught with that because it's takes only a sec and wouldn't be seen in ps aux.

Command execution log is also not so easy. It needs debuglog enabled but I cannot enable debuglog on all production servers. Enabling it when I have the problme won't work since restarting icinga2 fixes it.

In the other thread I showed you that the plugin was double executed. Please look there. You see 2 differend PID's executed at the same time for the same command plugin :|

I see what I can do to provide more info, but to be honest I spent already hours and hours and hours with this.

Jo

@icinga-migration
Copy link
Author

Updated by hostedpower on 2015-12-17 10:32:40 +00:00

  • File added debug.zip

Hi,

I have a debug.log with the problem now!! I don't think you need more info, please check it out! :)

Just one example out of the log:

[2015-12-17 11:20:32 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'monitoring.hosted-power.com'
[2015-12-17 11:20:32 +0100] notice/Process: Running command '/usr/lib/nagios/plugins/check_mysql_health' '--hostname' '10.10.81.220' '--mode' 'slave-io-running' '--password' '*' '--username' 'icinga': PID 3328
[2015-12-17 11:20:32 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'monitoring.hosted-power.com'
[2015-12-17 11:20:32 +0100] notice/Process: Running command '/usr/lib/nagios/plugins/check_mysql_health' '--hostname' '10.10.81.220' '--mode' 'slave-io-running' '--password' '
*' '--username' 'icinga': PID 3329
[2015-12-17 11:20:32 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'monitoring.hosted-power.com'

It seems that it receives the command twice somehow...

If I now restart the icinga2 on this client, it goes away and it's only executed once again ....

Jo

PS:

For most commands it happens, but you don't normally notice it, for example:

[2015-12-17 11:21:04 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'monitoring.hosted-power.com'
[2015-12-17 11:21:04 +0100] notice/Process: Running command '/usr/lib/nagios/plugins/check_dns' '-H' 'web.hosted-power.com' '-a' '178.18.81.220' '-s' '178.18.81.220' '-t' '10': PID 4273
[2015-12-17 11:21:04 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'monitoring.hosted-power.com'
[2015-12-17 11:21:04 +0100] notice/Process: Running command '/usr/lib/nagios/plugins/check_dns' '-H' 'web.hosted-power.com' '-a' '178.18.81.220' '-s' '178.18.81.220' '-t' '10': PID 4274

And also for some commands it already running 4 times. Yes 4 times :|

@icinga-migration
Copy link
Author

Updated by hostedpower on 2015-12-17 10:43:25 +00:00

  • File added icinga2.zip

For the sake of completeness I gathered now all logs with the instance having issues and the instance not having the issues.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-01-13 18:53:08 +00:00

Hi,

I would really like this fixed and I don't understand why a core developer doesn't look into this. Double running of all kinds of checks is quite crazy if you ask me.

What will be done with this? If anyone has a clue where to look (in which directory or files) I could maybe take a look, but this is more then some simple extending of this program.

Jo

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-13 19:47:41 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted hostedpower

I normally tend to ignore users with those whining why no-one is looking into their issues after some weeks, but I think you deserve an honest answer, even if I am repeating that for seven years now.

Open-source projects like Icinga are generously sponsored by companies investing man power, or allowing their employees to work on the proaject some hours a week. Other people also do invest their spare time for the fun and pride. In the end it is a matter if time, customers demand, personal demand and not to forget motivation.

When someone opens a new issue that requires the first amount of time - extract all required information, wait for feedback, and get an idea about the problem. Does it sound famliar? Yes/no, should we raise priority, do we need to ask our boss to get additional ressources to reproduce the problem, and work on a time estimate and then fix the problem. Does it affect the local system, does it affect customer sites, is it relevant for team members, or are there multiple users suffering from the same issue. Priority and importance in that order.

All-in-all that takes time to review, estimate and plan for the next development sprints. Not only for one issue but for any new and old issue sitting on the bug tracker waiting for their resolval. It is more than obvious that some issues won't get updates soon enough and users tend to disagree on the developer's decision to look into other problems first (but they aren't important in the user's eyes, it is only theirs).

Things users also don't lke to hear - even if we are trying our very best to answer each new issue, and keep working on existing issues next to the ones our NMS guys and customers have just thrown into - it is not enough in their very opinion. That may sound frustrated but it is just like it is - if you want your issue being fixed, either work on it and send a patch. Or find someone who can and make him/her do the patch for you, allowing upstream developers to review and apply the patch. It also helps the own good mood having contributed to an open source project which you have been using for free all the time.

The next step for this issue is to change it to "new", wait for soneone to reproduce it, and get an idea about priorities. And always hope for a patch, or a sponsor.

I'm going back to watching Mr. Robot, enjoy your evening.

-Michael

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-01-15 13:17:35 +00:00

Hi dnsmichi,

Thank you for your honest answer. I'm aware of this more or less. However I was involved with other open source projects and this seemed less the case. People were more helpful and core devs would have looked into this already for sure.

And of course I understand priority. I just don't understand that no one else has this problem and I assume this is because they have double (or triple or more) running without knowing it (you don't easily see it unless state is also important).

So I think this is still quite an important bug even when you don't notice it. Sometimes the same checks are runnning up to 5 times over here and maybe even more!!

If you could give me any clue in which kind of part of the code to look, I might take a look and try to fix myself. If it's a problem of funding, please let me know and I see what I can do help financially.

Just a final conclusion which you already know: in my case this bugs bites me every day and I'm really starting to hate the software because I need to logon to several server almost each day now to 'reset' icinga2 daemon by restarting it. It's really a disaster from my point of view and should have higher priority.

Regards
Jo

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 14:57:57 +00:00

  • Relates set to 10963

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 15:00:13 +00:00

  • Status changed from New to Feedback
  • Assigned to set to hostedpower
  • Target Version set to Backlog

My colleagues were experiencing a similar behavior will debugging on-site at a customer. A possible workaround: In case you are using the command_endpoint without a local configuration on the client, make sure to disable the 'checker' feature. It is not needed in that case. If you are eager to test stuff, install the current snapshot packages on both master and clients which tackles that problem of locally scheduling a check for a command_endpoint executed one.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-01-22 20:15:05 +00:00

Hi,

That is great news! I could not believe this bug was ignored more or less. If you suffer from it, it is really severe!!!

So you say that disabling the checker feature fixes it already? Where do I disable it, on the monitoring server or on the clients?

I assume it will be on one of the next releases soon ?

Thanks for updating this ticket!!

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-22 20:22:49 +00:00

  • Relates set to 11014

@icinga-migration
Copy link
Author

Updated by tgelf on 2016-01-22 20:48:55 +00:00

@hostedpower: you're right, it's a severe problem. Checks are executed twice, eventually shipped four times and going completely crazy when not only agents but also multiple masters are involved. I had a chance to investigate during the last days. I had a host executing/shipping it's swap check 325 times in 5-6ms :D You can consider the cluster stack in 2.4.x pretty broken right now, but as far as I know this should be fixed very soon.

When you respect a couple of rules, it already works fine:

  • give the latest snapshot package a try, ideally it carries a timestamp newer than 201601221800. Ask for a dedicated build if no such version is available for your OS - we can trigger them on demand
  • Disabling the checker feature didn't really help me, it looked as it would - but all we saw was a side effect based on the restart order of your nodes. As soon as we restarted our master it started going mad again. You should still disable it to be on the safe side, it's a feature you usually don't need on an agent (this should make part of the documentation I guess)
  • In case you are running multiple masters in a HA scenario, you must currently disable all but one of them. No chance to get such a setup running fine when agents are involved. At least I've not been able to achieve this.
  • Make sure that there is only one single communication path for every agent. This means that you either remove the host property from the master's endpoint object on your agent, or the other way round: remove it from all the agent endpoints on the master
  • Set log_duration=0 for all your agent endpoints on your master. This has been best practice since a little while, but didn't have the desired effect until the latest snapshot builds

Checks should run fine afterwards. I opened a couple of other related issues, but most of the communication took place in a customer support ticket @NETWAYS. I provided A LOT of details, collected millions of single events, did a bunch of calculations, estimations and statistics, flame graphs and more - everything extracted from a mid-sized setup with about 30,000 checks. The good news is: it's running fine right now, we where basically able to perfectly track down many of our current issues. It will run even faster very soon, we discovered a couple of knobs that could be tweaked.

Believe me, you are absolutely not alone. There is a lot of pressure right now, developers where granted enough resources to be allowed to get those problems fixed very soon. Work already started this week, and to be on the safe side the next two weeks are reserved for this and similar issues.

Cheers,
Thomas

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-01-23 09:41:40 +00:00

Hi Thomas,

Thanks a lot for your very extensive reply!! It gives a lot of insight.

I will wait a bit more then, I'm suffering from it for months now, so 1-2 weeks more won't harm I suppose. Luckily I only have 1 master or like you said it would be even worse :|

Jo

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-26 18:04:05 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted hostedpower
  • Target Version changed from Backlog to 2.4.2

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-27 15:39:27 +00:00

  • Category changed from libmethods to Cluster
  • Status changed from New to Feedback
  • Assigned to set to hostedpower
  • Priority changed from Normal to High

This should be fixed inside the current snapshot packages. Can you please test that? It requires the master and the agents being patched, so a test setup might come in handy.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-02-02 19:17:27 +00:00

Hi,

Thanks for the fix and sorry for the late reply, been very busy here!

I would like to test it, but unfortunately I only have it installed via CentOS and Debian packages (in production). So I don't think there is an easy way for me to test it?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-04 12:12:26 +00:00

  • Relates deleted 11014

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-04 12:13:04 +00:00

  • Duplicates set to 11014

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-04 12:15:02 +00:00

  • Status changed from Feedback to Closed
  • Target Version deleted 2.4.2

We consider this being fixed with #11014. I'm therefore closing this issue. In terms of testing - you should ensure that you've got a staging environment for testing such problems before putting it into production :)

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant