[dev.icinga.com #8712] API Client not reconnecting after reload #2745

icinga-migration · 2015-03-12T11:25:35Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8712

Created by TheSerapher on 2015-03-12 11:25:35 +00:00

Assignee: (none)
Status: Closed (closed on 2015-06-18 08:54:37 +00:00)
Target Version: (none)
Last Update: 2015-08-24 09:04:04 +00:00 (in Redmine)

Icinga Version: 2.3.0
Backport?: Not yet backported
Include in Changelog: 1

We are having issues with about 1500 hosts and 24000 services during Icinga2 reloads. More often than not, clients are disconnected from the master and never reconnect. Hence they are not executing any checks. I have attached the log files. The timestamp for the triggered API reload is 11:15 in those logs, the disconnect happens at 11:36. I have also attached the out put of:

gdb -p $(pidof icinga2) -ex 'thread apply all bt full' --batch

If any other data is required please let me know.

Attachments

gdb.master.txt TheSerapher - 2015-03-12 11:25:05 +00:00 - Master GDB Output
gdb.checker.txt TheSerapher - 2015-03-12 11:25:05 +00:00 - Checker GDB Output
icinga2.log.checker TheSerapher - 2015-03-12 11:25:05 +00:00 - Checker Logfile
icinga2.log.master TheSerapher - 2015-03-12 11:25:05 +00:00 - Master Logfile
checker.txt mwaldmueller - 2015-08-12 15:18:31 +00:00 - GDB checker
master.txt mwaldmueller - 2015-08-12 15:18:41 +00:00 - GDB master

Relations:

relates #9406

The text was updated successfully, but these errors were encountered:

icinga-migration · 2015-03-12T11:29:29Z

Updated by TheSerapher on 2015-03-12 11:29:29 +00:00

I have seen this issue: https://dev.icinga.org/issues/8672

I don't think the fix applied there is addressing our continued disconnected clients. According to the update notes on 2.3.0 Icinga2 API clients should re-connect at all times. We are not seeing any reconnects though.

icinga-migration · 2015-03-12T11:34:11Z

Updated by TheSerapher on 2015-03-12 11:34:11 +00:00

And here the report by the cluster check plugin. The DC2 master can be ignored, we have turned that one off due to MySQL IDO deadlock and lock timeout issues.

Icinga 2 Cluster Problem: 2 Endpoints (icinga2-checker-2.monitor.dc2, icinga2-master-1.monitor.dc2) not connected.

Performance data    
Label   Value   Min Max Warning Critical
api_num_conn_endpoints  3.00    -   -   -   -
api_num_endpoints   5.00    -   -   -   -
api_num_not_conn_endpoints  2.00    -   -   -   -
checkercomponent_checker_idle   8,977.00    -   -   -   -
checkercomponent_checker_pending    33.00   -   -   -   -

icinga-migration · 2015-03-12T11:46:50Z

Updated by TheSerapher on 2015-03-12 11:46:50 +00:00

After restart (12:39 in the log), the checker in question (and running in this high load issue: https://dev.icinga.org/issues/8670) reports being connected:

[2015-03-12 12:39:48 +0100] information/ApiListener: New client connection for identity 'icinga2-master-1.monitor.dc1'

No more log messages after that on the checker. The master reports this:

[2015-03-12 12:39:00 +0100] information/ApiClient: Reconnecting to API endpoint 'icinga2-checker-1.monitor.dc2' via host 'icinga2-checker-1.monitor.dc2' and port '5665'
[...]
[2015-03-12 12:39:49 +0100] information/ApiListener: New client connection for identity 'icinga2-checker-1.monitor.dc2'
[2015-03-12 12:39:49 +0100] information/ApiListener: Syncing global zone 'global-templates'.
[...]
[2015-03-12 12:40:54 +0100] information/ApiClient: No messages for identity 'icinga2-checker-1.monitor.dc2' have been received in the last 60 seconds.
[2015-03-12 12:40:54 +0100] warning/ApiClient: API client disconnected for identity 'icinga2-checker-1.monitor.dc2'
[2015-03-12 12:40:54 +0100] warning/ApiListener: Removing API client for endpoint 'icinga2-checker-1.monitor.dc2'. 0 API clients left.

icinga-migration · 2015-03-18T22:05:42Z

Updated by rhillmann on 2015-03-18 22:05:42 +00:00

I have seen something similar, but i am getting sometimes the following behavior as well:

[2015-03-18 22:53:39 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:39 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.

Another bad side effect, the built in icinga check "cluster" shows the reconnecting server as not connected. This is correct, because they need to long to connect again. Why does the reload command disconnects every api client? Isn´t it better to leave the connection online and just reload api connection when something has been changed on that? I would mind the reload command should not interrupt api connection, if its not really neccessary (such like changes on endpoints or zone config).

Icinga 2 Cluster Problem: 6 Endpoints (chlorineguest10.a2.rz1.domain.com, dresdenguest6.muc.domain.com, dresdenguest7.muc.domain.com, hydrogenguest3.a1.rz0.domain.com, hydrogenguest4.a1.rz0.domain.com, hydrogenguest9.b2.rz1.domain.com) not connected.

icinga-migration · 2015-03-26T22:43:54Z

Updated by macfergus on 2015-03-26 22:43:54 +00:00

I think we are seeing the same bug, or possibly just something very similar. We get the same "replaying the log for it" messages.

It seems like what happens is:

The remote restarts just slightly before the master
The remote reconnects to the master.
Log replay begins.
The master restarts during log replay.
The remote never recovers.

I have a plausible guess at how to fix it. During ReplayLog, if there is an exception, it never clears the GetSyncing bit. Many other operations check the GetSyncing bit before proceeding, so they are stuck forever.

The exception occurs here:

icinga2/lib/remote/apilistener.cpp

Line 741 in 8573636

NetString::WriteStringToStream(client->GetStream(), pmessage->Get("message"));

It clears the syncing bit here:

icinga2/lib/remote/apilistener.cpp

Line 770 in 8573636

endpoint->SetSyncing(false);

So when there's an exception, the latter never happens. I'd guess that if you ensure that the bit gets cleared despite an exception, it would fix the bug. Or maybe some other cleanup step is more appropriate.

Here's a snippet from our log:

[2015-03-26 20:21:47 +0000] information/ApiClient: Reconnecting to API endpoint 'monitoring' via host '(foo)' and port '5665'
[2015-03-26 20:21:47 +0000] information/ApiListener: New client connection for identity 'monitoring'
[2015-03-26 20:21:47 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1427401300
[2015-03-26 20:21:47 +0000] notice/ApiClient: Received 'config::Update' message from 'monitoring'
[2015-03-26 20:21:48 +0000] notice/ApiClient: Received 'log::SetLogPosition' message from 'monitoring'
[2015-03-26 20:21:48 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1427401308
[2015-03-26 20:21:48 +0000] critical/ApiListener: Cannot connect to host '(foo)' on port '5665'
[2015-03-26 20:21:48 +0000] debug/ApiListener: Cannot connect to host '(foo)' on port '5665'
Error: Tried to read/write from a closed socket.
    (0) libbase.so: void boost::throw_exception >(boost::exception_detail::error_info_injector const&) (+0x71) [0x7fde477428c1]
    (1)  libbase.so:  void  boost::exception_detail::throw_exception_(std::runtime_error  const&,  char  const*,  char  const*,  int)  (+0x59)  [0x7fde47742969]
    (2) libbase.so: icinga::SocketEvents::ChangeEvents(int) (+0xc7) [0x7fde476ebc77]
    (3) libbase.so: icinga::TlsStream::Write(void const*, unsigned long) (+0x52) [0x7fde476ebd12]
    (4) libbase.so: icinga::NetString::WriteStringToStream(boost::intrusive_ptr const&, icinga::String const&) (+0xb4) [0x7fde476e49b4]
    (5) libremote.so: icinga::ApiListener::ReplayLog(boost::intrusive_ptr const&) (+0x884) [0x7fde44ced174]
    (6) libremote.so: icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) (+0x2fb) [0x7fde44cf23fb]
    (7) libremote.so: icinga::ApiListener::AddConnection(boost::intrusive_ptr const&) (+0x23e) [0x7fde44cf2bbe]
    (8) libboost_thread.so.1.54.0:  (+0xba4a) [0x7fde4806fa4a]
    (9) libpthread.so.0:  (+0x8182) [0x7fde479dc182]
    (10) libc.so.6: clone (+0x6d) [0x7fde468a847d]
[2015-03-26 20:21:49 +0000] notice/ApiListener: Relaying 'event::CheckResult' message
[2015-03-26 20:21:52 +0000] debug/ApiListener: Not connecting to Zone 'master' because we're already connected to it.
[2015-03-26 20:21:57 +0000] information/ApiClient: Not sending heartbeat for endpoint 'monitoring' because we're replaying the log for it.

icinga-migration · 2015-04-16T13:01:41Z

Updated by TheSerapher on 2015-04-16 13:01:41 +00:00

We are also seeing API disconnects resulting in core exception due to a missing file:

 Caught unhandled exception.
Current time: 2015-04-16 13:54:59 +0200

Application information:
  Application version: v2.3.0-124-g3d4be0e
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Operating system: Linux
  Operating system version: 3.13.0-46-generic
  Architecture: x86_64
  Distribution: Ubuntu 14.04.2 LTS

Error: Function call 'rename' for file '/var/lib/icinga2/icinga2.state.tmp' failed with error code 2, 'No such file or directory'
***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://dev.icinga.org/ and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
Failed to launch GDB: No such file or directory

This happens ever so often when reloading new configurations. The Cluster Node then seems to be running (process wise) but is not doing anything. Restarting the instance once again fixes this.

icinga-migration · 2015-06-18T08:54:37Z

Updated by mfriedrich on 2015-06-18 08:54:37 +00:00

Category set to Cluster
Status changed from New to Closed

Fixed in 2.3.5 - please re-test.

icinga-migration · 2015-06-18T08:54:45Z

Updated by mfriedrich on 2015-06-18 08:54:45 +00:00

Relates set to 9406

icinga-migration · 2015-06-18T10:05:01Z

Updated by TheSerapher on 2015-06-18 10:05:01 +00:00

Looks good so far. We just installed the latest package from the development tree and restarted the entire cluster at the same time. One node (a low power VM) took a bit longer but eventually also reconnected.

Great work! We will keep an eye on this but this looks way better than before.

icinga-migration · 2015-08-12T15:31:28Z

Updated by mwaldmueller on 2015-08-12 15:31:28 +00:00

File added checker.txt
File added master.txt

I have to reopen the issue, the problem still exists.

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

icinga-migration · 2015-08-12T15:44:30Z

Updated by mwaldmueller on 2015-08-12 15:44:30 +00:00

mwaldmueller wrote:

I have to reopen the issue, the problem still exists.

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

Sorry, I forgot to mention the Icinga 2 version: 2.3.8 on all nodes

icinga-migration · 2015-08-24T09:04:04Z

Updated by mwaldmueller on 2015-08-24 09:04:04 +00:00

I opened a new ticket with my issue: #9976

icinga-migration closed this as completed Jun 18, 2015

icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #8712] API Client not reconnecting after reload #2745

[dev.icinga.com #8712] API Client not reconnecting after reload #2745

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 18, 2015

icinga-migration commented Mar 26, 2015

icinga-migration commented Apr 16, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Aug 12, 2015

icinga-migration commented Aug 12, 2015

icinga-migration commented Aug 24, 2015

[dev.icinga.com #8712] API Client not reconnecting after reload #2745

[dev.icinga.com #8712] API Client not reconnecting after reload #2745

Comments

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 12, 2015

icinga-migration commented Mar 18, 2015

icinga-migration commented Mar 26, 2015

icinga-migration commented Apr 16, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Jun 18, 2015

icinga-migration commented Aug 12, 2015

icinga-migration commented Aug 12, 2015

icinga-migration commented Aug 24, 2015