Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11865] Icinga2 agent deadlock #4246

Closed
icinga-migration opened this issue May 31, 2016 · 10 comments
Closed

[dev.icinga.com #11865] Icinga2 agent deadlock #4246

icinga-migration opened this issue May 31, 2016 · 10 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/crash Shouldn't happen, requires attention

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11865

Created by ziaunys on 2016-05-31 19:27:48 +00:00

Assignee: ziaunys
Status: Feedback
Target Version: (none)
Last Update: 2016-12-07 18:31:03 +00:00 (in Redmine)

Icinga Version: 2.4.10
Backport?: Not yet backported
Include in Changelog: 1

It looks like #11046 is still an issue. We managed to run the Icinga2 master for 4 days of many reloads before it came up again. It's possible that the particular instance of the issue was fixed but we're seeing the same symptoms. Here is the timeline of events.

  1. The Puppet agent runs and reloads the Icinga2 master node.
  2. It attempts to reconnect to all 323 agent endpoints.
  3. 94 hosts using the cluster-zone check enter the DOWN state.
  4. The hosts don't recover until the master reloads again.

I'll attempt to gather more debugging material that I will add to this ticket. We've had this issue for 4 months now and it's difficult to reproduce. It seems like the fix introduced in 2.4.10 did fix one of the issues that produced this symptom because we didn't see it for 4 days which is a big difference. Before it would trigger around once a day. I'm going to do more debugging in our production environment because it's just too difficult to provide decent feedback from our test environment.

Do you guys have any kind of scale testing set up for the Icinga2 agent with multiple endpoints? I'm just curious because I haven't seen this issue reported (other than #11730) and it seems like it would have come up before.

Attachments


Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-06-06 12:44:24 +00:00

Are there any (OpenSSL) errors on the master being logged when those connection attempts happen?

@icinga-migration
Copy link
Author

Updated by ziaunys on 2016-06-06 18:18:08 +00:00

  • File added icinga_agent_disconnect.txt

dnsmichi wrote:

Are there any (OpenSSL) errors on the master being logged when those connection attempts happen?

I don't see any of those log messages in reference to the affected agents. I'm attaching all log data from the first time the master reloaded to when it was manually reloaded. The attached log messages start at the time 22:53. The agent reloaded at 22:37 and the notifications actually went out at around 22:40. If I track any of the affected agent endpoints using the logs, I see the endpoint disconnect, the host notification triggers, and then no additional messages are logged for that endpoint until the master is reloaded. There are some "invalid socket" and "tls stream" messages, but I believe they are in reference to nodes that the master cannot directly connect to. Those nodes have to make active TCP connections to the master.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-06-22 09:42:49 +00:00

  • Relates set to 12003

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-06-22 10:10:11 +00:00

  • Status changed from New to Feedback
  • Assigned to set to ziaunys

This should be fixed with #12003. Any chance you can test the snapshot packages in your staging environment?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-08 15:50:34 +00:00

Any updates?

@icinga-migration
Copy link
Author

Updated by ziaunys on 2016-09-15 20:50:33 +00:00

dnsmichi wrote:

Any updates?

Sorry for the late update. We upgraded all our agents to 2.5.4 and still see this issue. I discovered that we were doing a hard restart on our master node which I thought might have been an issue, so I switched over to using reloads. However, it didn't seem to fix the problem. Recently 10 of the agents permanently disconnected from the master after a reload. I left the master in this state so I could do a little more troubleshooting. I took the following steps:

  1. Scheduled a forced check for one of the disconnected satellites and saw this in the master's logs:
    [2016-09-15 13:08:39 -0700] information/JsonRpcConnection: Reconnecting to API endpoint 'icinga-satellite' via host 'icinga-satellite' and port '5665'
    The 'cluster-zone' check did not succeed and the endpoint failed reconnecting.
  2. The 'icinga-satellite' didn't log anything related to the master attempting to reconnect.
  3. I then ran netstat on the Icinga2 master and found that it did attempt to establish a connection to the satellite and was stuck in the "SYN_SENT" state. Here is the output from netstat:
    tcp 0 1 (icinga-master):40761 (icinga-satellite):5665 SYN_SENT on (11.04/5/0)
  4. After running the same netstat command on the satellite I found this:
    tcp 0 0 (icinga-satellite):34157 (icinga-master):5665 ESTABLISHED keepalive (3059.37/0/0)

The 'icinga-satellite' has an established TCP connection, but it seems like the keepalive timer is pretty high and no keepalive probes were sent to the master. The agent seems to be stuck in this state until I reload/restart either the satellite or the master.

I'll try to provide more detail as I keep troubleshooting, but it looks like the issue might be related to refusing new connections because agent communication is bi-directional?

@icinga-migration
Copy link
Author

Updated by ziaunys on 2016-09-15 21:12:04 +00:00

ziaunys wrote:

dnsmichi wrote:
> Any updates?

Sorry for the late update. We upgraded all our agents to 2.5.4 and still see this issue. I discovered that we were doing a hard restart on our master node which I thought might have been an issue, so I switched over to using reloads. However, it didn't seem to fix the problem. Recently 10 of the agents permanently disconnected from the master after a reload. I left the master in this state so I could do a little more troubleshooting. I took the following steps:

  1. Scheduled a forced check for one of the disconnected satellites and saw this in the master's logs:
    [2016-09-15 13:08:39 -0700] information/JsonRpcConnection: Reconnecting to API endpoint 'icinga-satellite' via host 'icinga-satellite' and port '5665'
    The 'cluster-zone' check did not succeed and the endpoint failed reconnecting.
  2. The 'icinga-satellite' didn't log anything related to the master attempting to reconnect.
  3. I then ran netstat on the Icinga2 master and found that it did attempt to establish a connection to the satellite and was stuck in the "SYN_SENT" state. Here is the output from netstat:
    tcp 0 1 (icinga-master):40761 (icinga-satellite):5665 SYN_SENT on (11.04/5/0)
  4. After running the same netstat command on the satellite I found this:
    tcp 0 0 (icinga-satellite):34157 (icinga-master):5665 ESTABLISHED keepalive (3059.37/0/0)

The 'icinga-satellite' has an established TCP connection, but it seems like the keepalive timer is pretty high and no keepalive probes were sent to the master. The agent seems to be stuck in this state until I reload/restart either the satellite or the master.

I'll try to provide more detail as I keep troubleshooting, but it looks like the issue might be related to refusing new connections because agent communication is bi-directional?

Also, months ago you asked me if I had seen any openssl errors. At the time I didn't because I could really only reproduce the issue in our production environment. However, since 2.5.4 it looks like the default logging has actually captured this openssl error which is logged on the agents that get stuck in the disconnected state.

[2016-09-15 12:26:57 -0700] information/JsonRpcConnection: Reconnecting to API endpoint 'icinga-master' via host 'icinga-master' and port '5665'
[2016-09-15 12:26:57 -0700] warning/TlsStream: OpenSSL error: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
[2016-09-15 12:26:57 -0700] critical/ApiListener: Client TLS handshake failed (to [icinga-master]:5665)

The odd thing is that the 'icinga-satellite' did not log this error message at any point when it got stuck in this state. I verified that it is running the same version of the agent with the same log settings. I didn't see this message logged on the 'icinga-master' for the 'icinga-satellite', but it was logged for other agents in the disconnected state.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-09-16 07:19:03 +00:00

Puh. Never seen that error before. Google led me to this thread: http://security.stackexchange.com/questions/39844/getting-ssl-alert-write-fatal-bad-record-mac-during-openssl-handshake

Can you verify that the SSL handshake works as described (e.g. by using tcpdump/wireshark)? In addition to that some output when connecting from the agent to the master using openssl s_client.

A much more general question - are your endpoints configured to try to connect to both ends? (e.g. the master tries to connect to the agent with the host attribute set, and the agent tries to connect to the master endpoint which has the host attribute set). It might be worth a try to just go for one direction, such as the master actively connecting to the agents, or vice versa.

Last but not least - your agents are connected using top-down config sync or command endpoint? It might be reasonable to play around with the log_duration parameter on the master/agents reducing the replay log size which will be transferred upon reconnect.

@icinga-migration
Copy link
Author

Updated by ziaunys on 2016-09-29 23:15:04 +00:00

I've recently been less involved with managing our Icinga2 infrastructure, but I'll make sure to produce this feedback one way or another.

dnsmichi wrote:

Puh. Never seen that error before. Google led me to this thread: http://security.stackexchange.com/questions/39844/getting-ssl-alert-write-fatal-bad-record-mac-during-openssl-handshake

Can you verify that the SSL handshake works as described (e.g. by using tcpdump/wireshark)? In addition to that some output when connecting from the agent to the master using openssl s_client.

A much more general question - are your endpoints configured to try to connect to both ends? (e.g. the master tries to connect to the agent with the host attribute set, and the agent tries to connect to the master endpoint which has the host attribute set). It might be worth a try to just go for one direction, such as the master actively connecting to the agents, or vice versa.

Last but not least - your agents are connected using top-down config sync or command endpoint? It might be reasonable to play around with the log_duration parameter on the master/agents reducing the replay log size which will be transferred upon reconnect.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-12-07 18:31:03 +00:00

Any pointers or updates? Otherwise I'll close it here.

@icinga-migration icinga-migration added needs feedback We'll only proceed once we hear from you again bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@gunnarbeutner gunnarbeutner added the core/crash Shouldn't happen, requires attention label Feb 7, 2017
@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label May 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/crash Shouldn't happen, requires attention
Projects
None yet
Development

No branches or pull requests

3 participants