New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #8712] API Client not reconnecting after reload #2745
Comments
Updated by TheSerapher on 2015-03-12 11:29:29 +00:00 I have seen this issue: https://dev.icinga.org/issues/8672 I don't think the fix applied there is addressing our continued disconnected clients. According to the update notes on 2.3.0 Icinga2 API clients should re-connect at all times. We are not seeing any reconnects though. |
Updated by TheSerapher on 2015-03-12 11:34:11 +00:00 And here the report by the cluster check plugin. The DC2 master can be ignored, we have turned that one off due to MySQL IDO deadlock and lock timeout issues.
|
Updated by TheSerapher on 2015-03-12 11:46:50 +00:00 After restart (12:39 in the log), the checker in question (and running in this high load issue: https://dev.icinga.org/issues/8670) reports being connected:
No more log messages after that on the checker. The master reports this:
|
Updated by rhillmann on 2015-03-18 22:05:42 +00:00 I have seen something similar, but i am getting sometimes the following behavior as well:
Another bad side effect, the built in icinga check "cluster" shows the reconnecting server as not connected. This is correct, because they need to long to connect again. Why does the reload command disconnects every api client? Isn´t it better to leave the connection online and just reload api connection when something has been changed on that? I would mind the reload command should not interrupt api connection, if its not really neccessary (such like changes on endpoints or zone config).
|
Updated by macfergus on 2015-03-26 22:43:54 +00:00 I think we are seeing the same bug, or possibly just something very similar. We get the same "replaying the log for it" messages. It seems like what happens is:
I have a plausible guess at how to fix it. During ReplayLog, if there is an exception, it never clears the GetSyncing bit. Many other operations check the GetSyncing bit before proceeding, so they are stuck forever. The exception occurs here: icinga2/lib/remote/apilistener.cpp Line 741 in 8573636
It clears the syncing bit here: icinga2/lib/remote/apilistener.cpp Line 770 in 8573636
So when there's an exception, the latter never happens. I'd guess that if you ensure that the bit gets cleared despite an exception, it would fix the bug. Or maybe some other cleanup step is more appropriate. Here's a snippet from our log:
|
Updated by TheSerapher on 2015-04-16 13:01:41 +00:00 We are also seeing API disconnects resulting in core exception due to a missing file:
This happens ever so often when reloading new configurations. The Cluster Node then seems to be running (process wise) but is not doing anything. Restarting the instance once again fixes this. |
Updated by mfriedrich on 2015-06-18 08:54:37 +00:00
Fixed in 2.3.5 - please re-test. |
Updated by mfriedrich on 2015-06-18 08:54:45 +00:00
|
Updated by TheSerapher on 2015-06-18 10:05:01 +00:00 Looks good so far. We just installed the latest package from the development tree and restarted the entire cluster at the same time. One node (a low power VM) took a bit longer but eventually also reconnected. Great work! We will keep an eye on this but this looks way better than before. |
Updated by mwaldmueller on 2015-08-12 15:31:28 +00:00
I have to reopen the issue, the problem still exists. Icinga 2 log of checker: |
Updated by mwaldmueller on 2015-08-12 15:44:30 +00:00 mwaldmueller wrote:
Sorry, I forgot to mention the Icinga 2 version: 2.3.8 on all nodes |
Updated by mwaldmueller on 2015-08-24 09:04:04 +00:00 I opened a new ticket with my issue: #9976 |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8712
Created by TheSerapher on 2015-03-12 11:25:35 +00:00
Assignee: (none)
Status: Closed (closed on 2015-06-18 08:54:37 +00:00)
Target Version: (none)
Last Update: 2015-08-24 09:04:04 +00:00 (in Redmine)
We are having issues with about 1500 hosts and 24000 services during Icinga2 reloads. More often than not, clients are disconnected from the master and never reconnect. Hence they are not executing any checks. I have attached the log files. The timestamp for the triggered API reload is 11:15 in those logs, the disconnect happens at 11:36. I have also attached the out put of:
If any other data is required please let me know.
Attachments
Relations:
The text was updated successfully, but these errors were encountered: