[dev.icinga.com #9976] API Client on checkers not reconnecting after reload/restart #3307

icinga-migration · 2015-08-21T09:39:10Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/9976

Created by mwaldmueller on 2015-08-21 09:39:10 +00:00

Assignee: (none)
Status: Closed (closed on 2015-10-16 12:35:00 +00:00)
Target Version: (none)
Last Update: 2015-10-16 12:35:00 +00:00 (in Redmine)

Icinga Version: 2.3.8
Backport?: Not yet backported
Include in Changelog: 1

This is related to #8712, the problem still exists.

My setup:

checker zone with 3 nodes
master zone with 1 node as parent zone

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

Only a restart of the Icinga 2 daemon helps to solve the problem. The GDB-traces are attached to the related issue.
Furthermore I think that the integrated cluster check should be able to determine such "hanging" clusternodes.

Attachments

tcpdump.txt.gz mwaldmueller - 2015-09-07 05:56:44 +00:00

Changesets

2015-09-29 14:03:38 +00:00 by mfriedrich 905de04

Fix deadlock in ApiClient::~ApiClient()

refs #9976

2015-09-30 14:39:36 +00:00 by (unknown) c1892a2

Remove JsonRpcConnection::m_WriteQueue

refs #9976

Relations:

relates #9976
relates #9976
relates #9730
relates #9976
relates #9798

The text was updated successfully, but these errors were encountered:

icinga-migration · 2015-08-24T08:03:57Z

Updated by mfrosch on 2015-08-24 08:03:57 +00:00

Relates set to 9983

icinga-migration · 2015-08-24T08:49:13Z

Updated by mfrosch on 2015-08-24 08:49:13 +00:00

Relates set to 9986

icinga-migration · 2015-08-24T08:50:21Z

Updated by mfrosch on 2015-08-24 08:50:21 +00:00

We try to fix this with #9986

icinga-migration · 2015-08-24T09:12:15Z

Updated by rhillmann on 2015-08-24 09:12:15 +00:00

probably this is related to #9798. I fixed the connection problems by setting net.ipv4.tcp_orphan_retries to 5

icinga-migration · 2015-08-25T12:17:30Z

Updated by mfrosch on 2015-08-25 12:17:30 +00:00

Status changed from New to Feedback

Please try to set log_rotation to "0" on all Endpoints that are only a agent.

This should disable any massive log read on the master, and will only allow Agent -> Master messages being spooled in a log (agent side)

Better solution will be something like #9730

(So we can test if this is not a TCP or other connection problem)

icinga-migration · 2015-08-25T12:17:49Z

Updated by mfrosch on 2015-08-25 12:17:49 +00:00

Relates set to 9730

icinga-migration · 2015-08-31T11:24:07Z

Updated by mfrosch on 2015-08-31 11:24:07 +00:00

Relates set to 10002

icinga-migration · 2015-08-31T13:44:47Z

Updated by mfriedrich on 2015-08-31 13:44:47 +00:00

Category set to Cluster
Status changed from Feedback to New
Target Version set to Backlog

icinga-migration · 2015-08-31T14:28:08Z

Updated by mfrosch on 2015-08-31 14:28:08 +00:00

Relates set to 9798

icinga-migration · 2015-09-07T05:57:06Z

Updated by mwaldmueller on 2015-09-07 05:57:06 +00:00

File added tcpdump.txt.gz

I've tried the current snapshot and set net.ipv4.tcp_orphan_retries to 5, but the problem still occurs, annexed a tcpdump.

Now I've set log_duration to "0" and will update the ticket soon...

icinga-migration · 2015-09-09T12:10:24Z

Updated by mwaldmueller on 2015-09-09 12:10:24 +00:00

Unfortunately setting "log_duration" to "0" doesn't solve the problem.

icinga-migration · 2015-09-12T09:10:09Z

Updated by mfriedrich on 2015-09-12 09:10:09 +00:00

Can you test the snapshot packages including a fix for #10002?

icinga-migration · 2015-09-12T09:10:23Z

Updated by mfriedrich on 2015-09-12 09:10:23 +00:00

Target Version deleted ~~Backlog~~

icinga-migration · 2015-09-21T13:41:30Z

Updated by mwaldmueller on 2015-09-21 13:41:30 +00:00

I've installed the snapshot packages on the master and on the checkers, but without success. The "heartbeat"-problem still occurs.

icinga-migration · 2015-09-29T15:21:44Z

Updated by mfriedrich on 2015-09-29 15:21:44 +00:00

Thread 11 (Thread 0x7fa9ad6c4700 (LWP 12587)):

#0  0x00007fa9b8e80344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0