[dev.icinga.com #3099] only reap results when checkresult_list is not overloaded (max_check_result_list_items) #1104
Comments
Updated by mfriedrich on 2012-09-10 08:25:35 +00:00 we discussed that last week in nuremberg - what if one of the neb modules (like mod_gearman, dnx) fiddles with the check_result_list in memory? then your counter might be invalid, as the check_result_list is not threadsafe. the possibility to hit the while loop is rather small, but it still might be possible. |
Updated by tgelf on 2012-09-10 09:13:59 +00:00 Hi Michi, that's what I've written in my comment, isn't it? However, the patch as shown shall work fine with such NEBs, as it counts all elements at each single run. Doing so doesn't hurt as there are usually zero elements in the list at this point. Cheers, |
Updated by mfriedrich on 2012-09-10 09:41:37 +00:00 it was meant the time frame before the while loop and afterwards - question is if the neb modules could fiddle with the list in between that windows. if they are not (only doing that on special core events), it should be fine. though, i do not really remember how they do it, so before applying this, it will be mandatory to check that. besides, how about a config option for the max items in queue value? hardcoding 1000 might be too much/little? |
Updated by mfriedrich on 2012-09-10 09:41:53 +00:00
|
Updated by tgelf on 2012-09-10 10:09:26 +00:00 dnsmichi wrote:
Afaik they don't, they (should) act only on callbacks. I only checked mod_gearman, looks fine - but I didn't try it out. If a NEB modifies the list while process_check_result_queue is adding elements to the list far worse things could happen, it would probably have trouble with add_check_result_to_list :p
I agree on this, something like max_checkresults_list_length would be great. Usually there shall be 0 elements, I have chosen 1000 to speed things up when being stressed - otherwise it first has to reap everything before reading new data. process_check_result_queue is still able to write tens of thousands of results to memory (everything in che checkresults folder), but it will stop doing so unless all of them have been reaped afterwards. |
Updated by mfriedrich on 2012-09-10 13:55:11 +00:00 something like that.
|
Updated by tgelf on 2012-09-10 13:57:16 +00:00 Looks great. No setting or default setting (0) means former behaviour, no action. |
Updated by mfriedrich on 2012-09-10 14:02:11 +00:00 good idea. i was not sure how to handle that with existing setups. setting the default to 0, and checking that.
|
Updated by tgelf on 2012-09-10 14:14:58 +00:00 I'd opt for putting everything in the if block, doing so completely avoids counting list items for those not using this feature:
|
Updated by mfriedrich on 2012-09-10 14:50:44 +00:00 yep good. i do not really have a "check result generator" right on, do you have such? |
Updated by mfriedrich on 2012-09-10 14:53:47 +00:00
|
Updated by tgelf on 2012-09-10 14:57:12 +00:00 dnsmichi wrote:
Not really. "All" I have are live environments hammering check result data from distrubuted slaves to the master. |
Updated by mfriedrich on 2012-09-10 15:03:01 +00:00 ok, so something you could test for me then :) i will push a revision to git soon. |
Updated by mfriedrich on 2012-09-10 15:39:24 +00:00 mod_gearman merges the lists here which is not put within a lock, but only calls that when the core signals that the checkresult reaper would be called. https://github.com/sni/mod\_gearman/blob/master/neb\_module/mod\_gearman.c#L255
so this is safe, as it's a serialized event, waiting for the callback itsself, not fiddling within memory. for dnx i cannot tell, but i guess it's quite the same mechanism, as some would break the core entirely otherwise. |
Updated by mfriedrich on 2012-09-23 09:21:25 +00:00
it's tagged experimental, so everyone needs to take care themselves, and they might provide feedback then. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/3099
Created by tgelf on 2012-09-10 08:11:22 +00:00
Assignee: mfriedrich
Status: Resolved (closed on 2012-09-23 09:21:25 +00:00)
Target Version: 1.8
Last Update: 2012-09-23 09:21:25 +00:00 (in Redmine)
We often stumbled over an annoying problem in distributed setups when IDO is involved. If the reaper runs into it's own timeout, the checkresults linked-list will start growing. The core will get slower and slower and process less and lesser events until timeout happens.
If you reach the point where events come in faster than they are processed, game is over: the core will be unable to ever catch up. The only chance you have is killing the daemon, but doing so you loose all in-memory results. And there will be MANY of them.
Here is a quick fix for this issue (fixes process_check_result_queue in utils.c):
Well... of course this could be improved, but at least it works. Instead of locking up at 100% CPU usage Icinga will manage it to reap a grown queue and continue it's work. Please note that adding a counter to avoid counting single elements will not work as there are NEBs out there directly modifying the checkresults list.
Regards,
Thomas Gelf
Changesets
2012-09-10 15:12:59 +00:00 by mfriedrich 3f3840b
2012-09-10 15:46:48 +00:00 by mfriedrich 46cfbb3
2012-09-10 15:56:51 +00:00 by mfriedrich a3f2994
Relations:
The text was updated successfully, but these errors were encountered: