Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8712] API Client not reconnecting after reload #2745

Closed
icinga-migration opened this issue Mar 12, 2015 · 12 comments
Closed
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8712

Created by TheSerapher on 2015-03-12 11:25:35 +00:00

Assignee: (none)
Status: Closed (closed on 2015-06-18 08:54:37 +00:00)
Target Version: (none)
Last Update: 2015-08-24 09:04:04 +00:00 (in Redmine)

Icinga Version: 2.3.0
Backport?: Not yet backported
Include in Changelog: 1

We are having issues with about 1500 hosts and 24000 services during Icinga2 reloads. More often than not, clients are disconnected from the master and never reconnect. Hence they are not executing any checks. I have attached the log files. The timestamp for the triggered API reload is 11:15 in those logs, the disconnect happens at 11:36. I have also attached the out put of:

gdb -p $(pidof icinga2) -ex 'thread apply all bt full' --batch

If any other data is required please let me know.

Attachments

  • gdb.master.txt TheSerapher - 2015-03-12 11:25:05 +00:00 - Master GDB Output
  • gdb.checker.txt TheSerapher - 2015-03-12 11:25:05 +00:00 - Checker GDB Output
  • icinga2.log.checker TheSerapher - 2015-03-12 11:25:05 +00:00 - Checker Logfile
  • icinga2.log.master TheSerapher - 2015-03-12 11:25:05 +00:00 - Master Logfile
  • checker.txt mwaldmueller - 2015-08-12 15:18:31 +00:00 - GDB checker
  • master.txt mwaldmueller - 2015-08-12 15:18:41 +00:00 - GDB master

Relations:

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 11:29:29 +00:00

I have seen this issue: https://dev.icinga.org/issues/8672

I don't think the fix applied there is addressing our continued disconnected clients. According to the update notes on 2.3.0 Icinga2 API clients should re-connect at all times. We are not seeing any reconnects though.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 11:34:11 +00:00

And here the report by the cluster check plugin. The DC2 master can be ignored, we have turned that one off due to MySQL IDO deadlock and lock timeout issues.

Icinga 2 Cluster Problem: 2 Endpoints (icinga2-checker-2.monitor.dc2, icinga2-master-1.monitor.dc2) not connected.

Performance data    
Label   Value   Min Max Warning Critical
api_num_conn_endpoints  3.00    -   -   -   -
api_num_endpoints   5.00    -   -   -   -
api_num_not_conn_endpoints  2.00    -   -   -   -
checkercomponent_checker_idle   8,977.00    -   -   -   -
checkercomponent_checker_pending    33.00   -   -   -   -

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-03-12 11:46:50 +00:00

After restart (12:39 in the log), the checker in question (and running in this high load issue: https://dev.icinga.org/issues/8670) reports being connected:

[2015-03-12 12:39:48 +0100] information/ApiListener: New client connection for identity 'icinga2-master-1.monitor.dc1'

No more log messages after that on the checker. The master reports this:

[2015-03-12 12:39:00 +0100] information/ApiClient: Reconnecting to API endpoint 'icinga2-checker-1.monitor.dc2' via host 'icinga2-checker-1.monitor.dc2' and port '5665'
[...]
[2015-03-12 12:39:49 +0100] information/ApiListener: New client connection for identity 'icinga2-checker-1.monitor.dc2'
[2015-03-12 12:39:49 +0100] information/ApiListener: Syncing global zone 'global-templates'.
[...]
[2015-03-12 12:40:54 +0100] information/ApiClient: No messages for identity 'icinga2-checker-1.monitor.dc2' have been received in the last 60 seconds.
[2015-03-12 12:40:54 +0100] warning/ApiClient: API client disconnected for identity 'icinga2-checker-1.monitor.dc2'
[2015-03-12 12:40:54 +0100] warning/ApiListener: Removing API client for endpoint 'icinga2-checker-1.monitor.dc2'. 0 API clients left.

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-03-18 22:05:42 +00:00

I have seen something similar, but i am getting sometimes the following behavior as well:

[2015-03-18 22:53:39 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:39 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:49 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:53:59 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'dresdenguest6.muc.domain.com' because we're replaying the log for it.
[2015-03-18 22:54:09 +0100] information/ApiClient: Not sending heartbeat for endpoint 'hydrogenguest6.a2.rz1.domain.com' because we're replaying the log for it.

Another bad side effect, the built in icinga check "cluster" shows the reconnecting server as not connected. This is correct, because they need to long to connect again. Why does the reload command disconnects every api client? Isn´t it better to leave the connection online and just reload api connection when something has been changed on that? I would mind the reload command should not interrupt api connection, if its not really neccessary (such like changes on endpoints or zone config).

Icinga 2 Cluster Problem: 6 Endpoints (chlorineguest10.a2.rz1.domain.com, dresdenguest6.muc.domain.com, dresdenguest7.muc.domain.com, hydrogenguest3.a1.rz0.domain.com, hydrogenguest4.a1.rz0.domain.com, hydrogenguest9.b2.rz1.domain.com) not connected.

@icinga-migration
Copy link
Author

Updated by macfergus on 2015-03-26 22:43:54 +00:00

I think we are seeing the same bug, or possibly just something very similar. We get the same "replaying the log for it" messages.

It seems like what happens is:

  • The remote restarts just slightly before the master
  • The remote reconnects to the master.
  • Log replay begins.
  • The master restarts during log replay.
  • The remote never recovers.

I have a plausible guess at how to fix it. During ReplayLog, if there is an exception, it never clears the GetSyncing bit. Many other operations check the GetSyncing bit before proceeding, so they are stuck forever.

The exception occurs here:

NetString::WriteStringToStream(client->GetStream(), pmessage->Get("message"));

It clears the syncing bit here:

endpoint->SetSyncing(false);

So when there's an exception, the latter never happens. I'd guess that if you ensure that the bit gets cleared despite an exception, it would fix the bug. Or maybe some other cleanup step is more appropriate.

Here's a snippet from our log:

[2015-03-26 20:21:47 +0000] information/ApiClient: Reconnecting to API endpoint 'monitoring' via host '(foo)' and port '5665'
[2015-03-26 20:21:47 +0000] information/ApiListener: New client connection for identity 'monitoring'
[2015-03-26 20:21:47 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1427401300
[2015-03-26 20:21:47 +0000] notice/ApiClient: Received 'config::Update' message from 'monitoring'
[2015-03-26 20:21:48 +0000] notice/ApiClient: Received 'log::SetLogPosition' message from 'monitoring'
[2015-03-26 20:21:48 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1427401308
[2015-03-26 20:21:48 +0000] critical/ApiListener: Cannot connect to host '(foo)' on port '5665'
[2015-03-26 20:21:48 +0000] debug/ApiListener: Cannot connect to host '(foo)' on port '5665'
Error: Tried to read/write from a closed socket.
    (0) libbase.so: void boost::throw_exception >(boost::exception_detail::error_info_injector const&) (+0x71) [0x7fde477428c1]
    (1)  libbase.so:  void  boost::exception_detail::throw_exception_(std::runtime_error  const&,  char  const*,  char  const*,  int)  (+0x59)  [0x7fde47742969]
    (2) libbase.so: icinga::SocketEvents::ChangeEvents(int) (+0xc7) [0x7fde476ebc77]
    (3) libbase.so: icinga::TlsStream::Write(void const*, unsigned long) (+0x52) [0x7fde476ebd12]
    (4) libbase.so: icinga::NetString::WriteStringToStream(boost::intrusive_ptr const&, icinga::String const&) (+0xb4) [0x7fde476e49b4]
    (5) libremote.so: icinga::ApiListener::ReplayLog(boost::intrusive_ptr const&) (+0x884) [0x7fde44ced174]
    (6) libremote.so: icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) (+0x2fb) [0x7fde44cf23fb]
    (7) libremote.so: icinga::ApiListener::AddConnection(boost::intrusive_ptr const&) (+0x23e) [0x7fde44cf2bbe]
    (8) libboost_thread.so.1.54.0:  (+0xba4a) [0x7fde4806fa4a]
    (9) libpthread.so.0:  (+0x8182) [0x7fde479dc182]
    (10) libc.so.6: clone (+0x6d) [0x7fde468a847d]
[2015-03-26 20:21:49 +0000] notice/ApiListener: Relaying 'event::CheckResult' message
[2015-03-26 20:21:52 +0000] debug/ApiListener: Not connecting to Zone 'master' because we're already connected to it.
[2015-03-26 20:21:57 +0000] information/ApiClient: Not sending heartbeat for endpoint 'monitoring' because we're replaying the log for it.

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-04-16 13:01:41 +00:00

We are also seeing API disconnects resulting in core exception due to a missing file:

 Caught unhandled exception.
Current time: 2015-04-16 13:54:59 +0200

Application information:
  Application version: v2.3.0-124-g3d4be0e
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Operating system: Linux
  Operating system version: 3.13.0-46-generic
  Architecture: x86_64
  Distribution: Ubuntu 14.04.2 LTS

Error: Function call 'rename' for file '/var/lib/icinga2/icinga2.state.tmp' failed with error code 2, 'No such file or directory'
***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://dev.icinga.org/ and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
Failed to launch GDB: No such file or directory

This happens ever so often when reloading new configurations. The Cluster Node then seems to be running (process wise) but is not doing anything. Restarting the instance once again fixes this.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 08:54:37 +00:00

  • Category set to Cluster
  • Status changed from New to Closed

Fixed in 2.3.5 - please re-test.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 08:54:45 +00:00

  • Relates set to 9406

@icinga-migration
Copy link
Author

Updated by TheSerapher on 2015-06-18 10:05:01 +00:00

Looks good so far. We just installed the latest package from the development tree and restarted the entire cluster at the same time. One node (a low power VM) took a bit longer but eventually also reconnected.

Great work! We will keep an eye on this but this looks way better than before.

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-08-12 15:31:28 +00:00

  • File added checker.txt
  • File added master.txt

I have to reopen the issue, the problem still exists.

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-08-12 15:44:30 +00:00

mwaldmueller wrote:

I have to reopen the issue, the problem still exists.

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

Sorry, I forgot to mention the Icinga 2 version: 2.3.8 on all nodes

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-08-24 09:04:04 +00:00

I opened a new ticket with my issue: #9976

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant