Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #13971] cgroup: fork rejected by pids controller in /system.slice/icinga2.service #4918

Closed
icinga-migration opened this issue Jan 12, 2017 · 13 comments · Fixed by #5477
Assignees
Labels
area/setup Installation, systemd, sample files backported Fix was included in a bugfix release bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/13971

Created by Skap1981 on 2017-01-12 13:11:08 +00:00

Assignee: mfriedrich
Status: Assigned
Target Version: (none)
Last Update: 2017-01-13 08:19:29 +00:00 (in Redmine)

Icinga Version: 2.5.4
Backport?: Not yet backported
Include in Changelog: 1

After Updating from SLES12 SP1 to SLES12 SP2 Icinga2 crashes after a shorttime

In messages I see the following entries:
2017-01-12T11:55:40.742685+01:00 mgtmon035 kernel: [65567.582895] cgroup: fork rejected by pids controller in /system.slice/icinga2.service
2017-01-12T11:55:43.246611+01:00 mgtmon035 kernel: [65570.086553] icinga2[124779]: segfault at 7fff0001a2df ip 00007ffff43f9d44 sp 00007ffff7ebcab0 error 4 in libc-2.22.so[7ffff433f000+19a000]
2017-01-12T11:55:45.129162+01:00 mgtmon035 systemd[1]: icinga2.service: Main process exited, code=killed, status=6/ABRT
2017-01-12T11:55:45.583138+01:00 mgtmon035 systemd[1]: icinga2.service: Unit entered failed state.
2017-01-12T11:55:45.583354+01:00 mgtmon035 systemd[1]: icinga2.service: Failed with result 'signal'.

icinga2.log show many criticals:
[2017-01-12 11:55:41 +0100] critical/checker: Exception occured while checking 'HOSTxxx!Service_xxx': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable'

Stacktrace attatched.

Attachments

@icinga-migration
Copy link
Author

Updated by Skap1981 on 2017-01-12 14:12:54 +00:00

Additional Info:

package libboost_system1_54_0:
SLES12 SP1: 1.54.0-13.1
SLES12 SP2: 1.54.0-22.1

If i try to update Icinga2 to 2.6 I get a dependency error:
Problem: nothing provides libboost_chrono.so.1.54.0()(64bit) needed by icinga2-bin-2.6.0-1.x86_64

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2017-01-12 14:17:18 +00:00

  • Status changed from New to Feedback
  • Assigned to set to Skap1981

This sounds like a resource limit introduced by systemd or the cgroups.

https://lwn.net/Articles/663873/

Can you verify that SLES12SP2 does not accidentally enable this feature?

The chrono dependency was introduced by SP2. See #13671 for details.

@icinga-migration
Copy link
Author

Updated by Skap1981 on 2017-01-12 14:42:35 +00:00

mfriedrich wrote:

This sounds like a resource limit introduced by systemd or the cgroups.

https://lwn.net/Articles/663873/

Can you verify that SLES12SP2 does not accidentally enable this feature?

not accidentally i guess. Now I have to figure out how to handly cgroup

The chrono dependency was introduced by SP2. See #13671 for details.

Ok, SLES 12 SDK. Thx, i´m able to update. First I have to resolve the cgroup-problem.

@icinga-migration
Copy link
Author

Updated by Skap1981 on 2017-01-12 21:45:15 +00:00

Problem solved.

I set the DefaultTasksMax to infinity

Not the best way, but until I get a better solution, this works.

Reschedule of 1500 services works fine without any crash.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2017-01-13 07:37:05 +00:00

Hmm, so yet again it seems systemd related. Do you think that this should be added to the troubleshooting section in the docs?

@icinga-migration
Copy link
Author

Updated by Skap1981 on 2017-01-13 07:49:15 +00:00

mfriedrich wrote:

Hmm, so yet again it seems systemd related. Do you think that this should be added to the troubleshooting section in the docs?

yes, it is system related.

The parameter DefaultTasksMax is set to 512 per default.
We use PBis for ldap-integration. The lwsmd.service uses ~180 Tasks in SP2. In SP1 there are only 7 tasks in use by lwsmd.service. So I think this is a special constellation on our systems.

But in large Icinga2-enviroments, 512 tasks could be a limiter. A hint in the troubleshooting sections could be helpfull.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2017-01-13 08:19:29 +00:00

  • Category changed from Checker to Documentation
  • Status changed from Feedback to Assigned
  • Assigned to changed from Skap1981 to mfriedrich

Ok, thanks, I'll have a look.

@icinga-migration icinga-migration added bug Something isn't working area/documentation End-user or developer help labels Jan 17, 2017
@gunnarbeutner gunnarbeutner added area/setup Installation, systemd, sample files and removed area/documentation End-user or developer help labels Feb 7, 2017
@gunnarbeutner
Copy link
Contributor

Should we perhaps update our default systemd unit file to set that parameter to a reasonable value?

@lipixx
Copy link

lipixx commented May 19, 2017

I am having the same problem, and yes, it seems that your unit file must be modified adding the TasksMax= to a reasonable value. Moreover, take in mind that this feature was introduced in 226 version. Please read my bug from another software with the same problem: https://bugs.schedmd.com/show_bug.cgi?id=3526

@dnsmichi
Copy link
Contributor

Thanks for the insights, much appreciated. So we would need sort of distribution specific unit files (and let external packagers know about it). Or we'll enhance CMake to generate the service file like your proposed patch for autotools in the linked ticket.

I'll add a note to troubleshooting docs meanwhile to help others with a quick workaround until we resolve this issue.

dnsmichi pushed a commit that referenced this issue May 19, 2017
dnsmichi pushed a commit that referenced this issue May 19, 2017
Add troubleshooting hints for cgroup fork errors

refs #4918
@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 8, 2017

Aug  8 13:43:15 icinga2 systemd: Reloading.
Aug  8 13:43:15 icinga2 systemd: [/usr/lib/systemd/system/icinga2.service:13] Unknown lvalue 'TasksMax' in section 'Service'
Aug  8 13:43:20 icinga2 systemd: Stopping Icinga host/service/network monitoring system...
Aug  8 13:43:21 icinga2 systemd: Starting Icinga host/service/network monitoring system...
Aug  8 13:43:22 icinga2 systemd: Started Icinga host/service/network monitoring system.

Users might open issues, but hopefully will find this one first.

@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 8, 2017

systemd/systemd/issues/3211

Or to say this differently: the default of 512 is really just a default. It's not supposed to cover all services, and services really should override this individually if there's the need to.

There is a PR which raises the limit: systemd/systemd#3753 but I doubt that this has hit SLES already.

https://github.com/systemd/systemd/blob/dd050decb6ad131ebdeabb71c4f9ecb4733269c0/NEWS#L60

        * There's a new system.conf setting DefaultTasksMax= to
          control the default TasksMax= setting for services and
          scopes running on the system. (TasksMax= is the primary
          setting that exposes the "pids" cgroup controller on systemd
          and was introduced in the previous systemd release.) The
          setting now defaults to 512, which means services that are
          not explicitly configured otherwise will only be able to
          create 512 processes or threads at maximum, from this
          version on. Note that this means that thread- or
          process-heavy services might need to be reconfigured to set
          TasksMax= to a higher value. It is sufficient to set
          TasksMax= in these specific unit files to a higher value, or
          even "infinity". Similar, there's now a logind.conf setting
          UserTasksMax= that defaults to 4096 and limits the total
          number of processes or tasks each user may own
          concurrently. nspawn containers also have the TasksMax=
          value set by default now, to 8192. Note that all of this
          only has an effect if the "pids" cgroup controller is
          enabled in the kernel. The general benefit of these changes
          should be a more robust and safer system, that provides a
          certain amount of per-service fork() bomb protection.

dnsmichi pushed a commit that referenced this issue Aug 8, 2017
This solves the problem with Systemd >= 226 and fork errors with
Icinga 2. Seen on SLES 11 SP2.

fixes #4918
@dnsmichi dnsmichi added this to the 2.7.1 milestone Aug 8, 2017
@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 8, 2017

I'm setting this to 2.7.1 as this affects many users/customers.

dnsmichi pushed a commit that referenced this issue Aug 8, 2017
This solves the problem with Systemd >= 226 and fork errors with
Icinga 2. Seen on SLES 11 SP2.

fixes #4918

refs #5477
@dnsmichi dnsmichi added the backported Fix was included in a bugfix release label Aug 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/setup Installation, systemd, sample files backported Fix was included in a bugfix release bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants