Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #9976] API Client on checkers not reconnecting after reload/restart #3307

Closed
icinga-migration opened this issue Aug 21, 2015 · 17 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/9976

Created by mwaldmueller on 2015-08-21 09:39:10 +00:00

Assignee: (none)
Status: Closed (closed on 2015-10-16 12:35:00 +00:00)
Target Version: (none)
Last Update: 2015-10-16 12:35:00 +00:00 (in Redmine)

Icinga Version: 2.3.8
Backport?: Not yet backported
Include in Changelog: 1

This is related to #8712, the problem still exists.

My setup:

  • checker zone with 3 nodes
  • master zone with 1 node as parent zone

Icinga 2 log of checker:
[2015-08-12 17:14:30 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:40 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.
[2015-08-12 17:14:50 +0200] information/ApiClient: Not sending heartbeat for endpoint 'checker.localdomain' because we're replaying the log for it.

Only a restart of the Icinga 2 daemon helps to solve the problem. The GDB-traces are attached to the related issue.
Furthermore I think that the integrated cluster check should be able to determine such "hanging" clusternodes.

Attachments

Changesets

2015-09-29 14:03:38 +00:00 by mfriedrich 905de04

Fix deadlock in ApiClient::~ApiClient()

refs #9976

2015-09-30 14:39:36 +00:00 by (unknown) c1892a2

Remove JsonRpcConnection::m_WriteQueue

refs #9976

Relations:

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-24 08:03:57 +00:00

  • Relates set to 9983

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-24 08:49:13 +00:00

  • Relates set to 9986

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-24 08:50:21 +00:00

We try to fix this with #9986

@icinga-migration
Copy link
Author

Updated by rhillmann on 2015-08-24 09:12:15 +00:00

probably this is related to #9798. I fixed the connection problems by setting net.ipv4.tcp_orphan_retries to 5

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-25 12:17:30 +00:00

  • Status changed from New to Feedback

Please try to set log_rotation to "0" on all Endpoints that are only a agent.

This should disable any massive log read on the master, and will only allow Agent -> Master messages being spooled in a log (agent side)

Better solution will be something like #9730

(So we can test if this is not a TCP or other connection problem)

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-25 12:17:49 +00:00

  • Relates set to 9730

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-31 11:24:07 +00:00

  • Relates set to 10002

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-31 13:44:47 +00:00

  • Category set to Cluster
  • Status changed from Feedback to New
  • Target Version set to Backlog

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-31 14:28:08 +00:00

  • Relates set to 9798

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-09-07 05:57:06 +00:00

  • File added tcpdump.txt.gz

I've tried the current snapshot and set net.ipv4.tcp_orphan_retries to 5, but the problem still occurs, annexed a tcpdump.

Now I've set log_duration to "0" and will update the ticket soon...

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-09-09 12:10:24 +00:00

Unfortunately setting "log_duration" to "0" doesn't solve the problem.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-12 09:10:09 +00:00

Can you test the snapshot packages including a fix for #10002?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-12 09:10:23 +00:00

  • Target Version deleted Backlog

@icinga-migration
Copy link
Author

Updated by mwaldmueller on 2015-09-21 13:41:30 +00:00

I've installed the snapshot packages on the master and on the checkers, but without success. The "heartbeat"-problem still occurs.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-29 15:21:44 +00:00

Thread 11 (Thread 0x7fa9ad6c4700 (LWP 12587)):

#0  0x00007fa9b8e80344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

No symbol table info available.

#1  0x00007fa9b8be867b in boost::condition_variable::wait(boost::unique_lock&) () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#2  0x00007fa9b8b8dc5b in icinga::WorkQueue::Join(bool) () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#3  0x00007fa9b8ba9cd7 in icinga::WorkQueue::~WorkQueue() () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#4  0x00007fa9b823d873 in icinga::ApiClient::~ApiClient() () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#5  0x00007fa9b823d939 in icinga::ApiClient::~ApiClient() () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#6  0x00007fa9b82490d0 in boost::detail::function::functor_manager const&>, boost::_bi::list2 >, boost::_bi::value > > > >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type, mpl_::bool_) () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#7  0x00007fa9b8249142 in boost::detail::function::functor_manager const&>, boost::_bi::list2 >, boost::_bi::value > > > >::manage(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type) () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#8  0x00007fa9b8ba946f in icinga::WorkQueue::WorkerThreadProc() () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#9  0x00007fa9b9298629 in ?? () from /usr/lib/libboost_thread.so.1.49.0

No symbol table info available.

#10 0x00007fa9b8e7bb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

No symbol table info available.

#11 0x00007fa9b66e095d in clone () from /lib/x86_64-linux-gnu/libc.so.6

No symbol table info available.

#12 0x0000000000000000 in ?? ()

No symbol table info available.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-29 15:21:58 +00:00

  • Status changed from New to Assigned
  • Assigned to set to gbeutner

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-10-16 12:35:00 +00:00

  • Status changed from Assigned to Closed
  • Assigned to deleted gbeutner

I'm fairly certain this is fixed in the master branch.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant