Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10002] Deadlock in WorkQueue::Enqueue #3324

Closed
icinga-migration opened this issue Aug 26, 2015 · 11 comments
Closed

[dev.icinga.com #10002] Deadlock in WorkQueue::Enqueue #3324

icinga-migration opened this issue Aug 26, 2015 · 11 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10002

Created by aledermueller on 2015-08-26 13:34:46 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2015-10-15 13:19:22 +00:00)
Target Version: 2.3.11
Last Update: 2015-10-15 13:19:22 +00:00 (in Redmine)

Icinga Version: 2.3.8
Backport?: Already backported
Include in Changelog: 1

Hey,

Agents (zones): approx. 400 (mixed versions with 2.3.8 and 2.3.9)
Masters: 2 (Version 2.3.8)

After a while Icinga2 on one master hangs without using resources like CPU and IO. netstat shows full Recv-Qs (data from the agents) and empty Send-Qs. While 2/3 of the connections is on close_wait, the other 1/3 is established.

A stacktrace is attached: gdb -p xxx -ex 'thread apply all bt full' -ex deta -ex q -batch > debug

In the debug log are mainly the following entries. The counter for pending tasks is growing....

[2015-08-26 14:05:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 13; Average latency: 0ms; Threads: 4; Pool utilization: 14.7925%
[2015-08-26 14:05:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 86; Average latency: 34ms; Threads: 5; Pool utilization: 24.8859%
[2015-08-26 14:05:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 5; Average latency: 0ms; Threads: 4; Pool utilization: 20.9834%
[2015-08-26 14:06:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 372; Average latency: 0ms; Threads: 8; Pool utilization: 71.0408%
[2015-08-26 14:06:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 584; Average latency: 0ms; Threads: 36; Pool utilization: 75.625%
[2015-08-26 14:06:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 858; Average latency: 0ms; Threads: 64; Pool utilization: 92.9821%
[2015-08-26 14:06:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1091; Average latency: 0ms; Threads: 64; Pool utilization: 99.7029%
[2015-08-26 14:07:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1257; Average latency: 0ms; Threads: 64; Pool utilization: 99.9874%
[2015-08-26 14:07:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1371; Average latency: 0ms; Threads: 64; Pool utilization: 99.9995%
[2015-08-26 14:07:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1401; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 14:07:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1485; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 14:08:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1545; Average latency: 0ms; Threads: 64; Pool utilization: 100%
...
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Pool #1: Pending tasks: 2453; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Thread pool; current: 16; adjustment: 2

Thanks, Achim

Attachments

Changesets

2015-09-02 05:46:30 +00:00 by (unknown) 5c77e6e

Fix deadlock in ApiListener::RelayMessage

fixes #10002

2015-09-02 07:16:20 +00:00 by (unknown) 35acba7

Remove default WQ limits

refs #10002

2015-10-15 13:16:51 +00:00 by (unknown) e480af3

Remove default WQ limits

refs #10002

2015-10-15 13:18:02 +00:00 by (unknown) c8d24b6

Fix deadlock in ApiListener::RelayMessage

fixes #10002

Relations:

@icinga-migration
Copy link
Author

Updated by aledermueller on 2015-08-27 07:10:51 +00:00

  • File added debug-100-master1-idomaster
  • File added debug-100-master2

The same thing happened again. Now the second master shows the same behavior/logs. A stacktrace of both is attached, master1 is the host writing to the ido-master.

Thanks, Achim

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-27 14:52:46 +00:00

  • Relates set to 9983

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-31 11:23:54 +00:00

Maybe also connected to #9976 ?

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-31 11:24:00 +00:00

  • Relates set to 9976

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-08-31 14:24:42 +00:00

  • Relates set to 9798

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-09-02 05:46:59 +00:00

There's an experimental patch in the master branch which needs further testing.

@icinga-migration
Copy link
Author

Updated by Anonymous on 2015-09-02 05:47:02 +00:00

  • Status changed from New to Resolved
  • Done % changed from 0 to 100

Applied in changeset 5c77e6e.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-09-02 05:47:19 +00:00

  • Category set to Cluster
  • Status changed from Resolved to Feedback
  • Assigned to set to aledermueller
  • Target Version set to 2.4.0

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-14 08:22:08 +00:00

According to Achim and Blerim, the fixes made it work again (2.3.10 without fixes causes trouble, the snapshot packages run fine for nearly a week now). I'd say we'll test this a little more and may back port that into 2.3.11 next week.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-14 08:23:04 +00:00

  • Priority changed from Normal to High

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-10-15 13:19:22 +00:00

  • Subject changed from Pool utilization: 100% to Deadlock in WorkQueue::Enqueue
  • Status changed from Feedback to Resolved
  • Assigned to changed from aledermueller to gbeutner
  • Target Version changed from 2.4.0 to 2.3.11
  • Backport? changed from TBD to Yes

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.3.11 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant