New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #13655] Crash - Error: parse error: premature EOF #4882
Comments
Updated by erikandersonmx on 2016-12-16 21:02:14 +00:00 Full debug log https://gist.github.com/erikanderson/4cb950e2ee6888ea98a49e47ba02ca6a |
Updated by erikandersonmx on 2016-12-16 21:59:38 +00:00 from kern.log Dec 16 14:55:36 sd-prod-icinga1 kernel: [16255.139654] icinga2[31719]: segfault at 7f3800005587 ip 00007f38e50a8814 sp 00007f38c9ea8340 error 4 in libc-2.19.so[7f38e4fe7000+1ba000] |
Updated by erikandersonmx on 2016-12-16 23:01:05 +00:00 This issue seems to be the same thing https://dev.icinga.com/issues/13173 |
Updated by erikandersonmx on 2016-12-19 23:35:32 +00:00 I think the root cause of this was the check_rabbitmq plugin from here https://github.com/CaptPhunkosis/check\_rabbitmq We have a large number of queues and so my theory is that when the plugin reported a recovery it sent a large list of queues that were ok and it overloaded something within icinga. We haven't had this issue in the past, before 2.6.0 but I don't have any hard data on that I think this can be closed |
Updated by gvde on 2016-12-21 08:01:59 +00:00 We see the exact same problem on our server since the update last week. 5 crashes within 7 days. Two this morning within two hours. erikandersonmx wrote:
We don't use this plugin. I doubt it's related with that.
I don't think so. IMHO it has nothing to do with that plugin. And even if it was related to the plugin: no plugin should ever be able to break the icinga2 server, regardless of what it returns. This needs to be fixed. |
Updated by gvde on 2016-12-21 08:14:44 +00:00 Here is my full crash report: Caught unhandled exception. Application information: System information: Build information: Error: parse error: premature EOF (0) libbase.so.2.6.0: (+0xa4a25) [0x7f22a980ca25] |
Updated by fallback on 2016-12-22 09:56:41 +00:00 I got a similar problem. Unfortunately there is no other error/warning/suspicious message before the crash. Everything went smooth till then.
After this, the Host (satellite) loses the conncetion the its master.
|
Updated by n0braist on 2016-12-22 11:05:58 +00:00 Hi, same for me but in the moment just on sles with HA cluster environment permanent crashes after some minutes: example1:
example2:
|
Updated by n0braist on 2016-12-22 11:22:56 +00:00 little correction. also on non clustered single node machines |
Updated by erikandersonmx on 2016-12-22 18:50:25 +00:00 Thanks for the confirmation that it is most likely not that one specific plugin. I just had a similar crash on a node that had an updated version of the check so there is something else going on here. I'm also going to try to see if I can get the previous version of icinga2 somewhere (its been nuked from the ppa of course) so I can revert for the holidays |
Updated by erikandersonmx on 2016-12-22 20:14:15 +00:00 Based on the error I am looking at this change to see how it might impact this: https://dev.icinga.com/projects/i2/repository/revisions/14ea2596c5cc9a2622f3ade2819cff86cc2eec71/diff/lib/base/json.cpp |
Updated by erikandersonmx on 2016-12-22 20:18:42 +00:00 This is the issue for that change that is refd in the changelog https://dev.icinga.com/issues/12538 |
Updated by erikandersonmx on 2016-12-22 20:31:12 +00:00 Nm, that is most likely not the issue as I dug into the for loop change. From what I have seen so far it seems to be happening right after a notification is sent so I am going to zero in on changes made to notifications |
Updated by erikandersonmx on 2016-12-22 20:35:52 +00:00 These are the changes made for notifications and seem like prime candidates for the source of this https://dev.icinga.com/issues/12718 |
Updated by n0braist on 2016-12-22 21:06:47 +00:00 In my case it seems to be a problem during notification or better if different notifications for one problem are defined at the same time (multiple notification definitions). On the other hand hot candidate is hipsaint (hipchat). calling in notification but I cannot confirm. Still investigating, but later if anything is running again.... |
Updated by fallback on 2016-12-23 06:29:36 +00:00 I'm currently running with the default mail-notification-scripts and one custom sms-notification script written in bash. I think it has nothing to do with hipsaint/hipchat. Morelikely like @erikandersonmx said with the fact that multiple different notifications are defined. |
Updated by gvde on 2016-12-23 07:35:32 +00:00 n0braist wrote:
I don't know how it's organized in SLES but are your notification scripts (i.e. the script mail-host-notification.sh, etc.) in /usr/lib/nagios/ or are they in /etc/icinga2/scripts like on CentOS? |
Updated by dominicpratt on 2016-12-23 07:48:57 +00:00 We're experiencing the same error on a Debian Jessie system without hipchat or something, we're only using the standard notification scripts in /etc/icinga2/scripts. |
Updated by n0braist on 2016-12-23 08:08:44 +00:00 Hi, my scripts are in /usr/lib/nagios/ |
Updated by greatexpectations on 2016-12-24 09:47:56 +00:00 Hi, as a temporary workaround on RHEL/CentOS 7 (or any other distribution that uses systemd), you may want to modify the systemd unit file for icinga2 so that the service is automatically restarted in case of a crash.
then add the Restart=always line in the Service section, e.g. [Service] Kind regards |
Updated by gvde on 2016-12-24 10:04:00 +00:00 greatexpectations wrote:
For all the last crashes that won't work on our CentOS 7: even through there is a crash report for icinga2, the icinga2 process itself is still running and the status of the icinga2.service is still active. icinga2 doesn't check anything anymore but still the process is running. Thus, systemd won't notice that icinga2 isn't operating correctly anymore and thus there won't be any restart... |
Updated by gvde on 2016-12-24 10:12:12 +00:00 greatexpectations wrote:
And on a sidenote: you don't want to modify the original icinga2.service file on CentOS 7, i.e. that in /usr/lib/systemd/system/, on ever. All files in that directory come from the rpm packages and are not to be modified. Any change will be overwritten during next update. Modifications of systemd services go into /etc/systemd/system, e.g. you create a directory icinga2.service.d and put the configuration into a conf file inside that directory... |
Updated by fallback on 2016-12-29 09:04:20 +00:00 gvde wrote:
This is exactly the same behaviour on my "Ubuntu 16.04.1 LTS". |
Updated by Marax on 2016-12-30 08:12:31 +00:00 I noticed here that the crashes occur when a plugin with extensive output changes its state. Icinga2 tries to send a notification and crash. The mail contains the extended output of the plugin under additional info. for example... ***** Icinga ***** Notification Type: PROBLEM Service: Interface Usage Date/Time: 2016-12-28 09:37:33 +0100 Additional Info: CRITICAL - interface GigabitEthernet0/24 (alias Gi0/24) usage is in:95.51% (955074379.75bit/s) out:0.61% (6077712.50bit/s), interface Vlan1 (alias Vl1) usage is in:0.00% (5821.50bit/s) out:0.00% (14859.00bit/s), interface GigabitEthernet0/1 (alias Gi0/1) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/2 (alias Gi0/2) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/3 (alias Gi0/3) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/4 (alias Gi0/4) usage is in:0.00% (374.25bit/s) out:0.00% (2084.00bit/s), interface GigabitEthernet0/5 (alias Gi0/5) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/6 (alias Gi0/6) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/7 (alias Gi0/7) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/8 (alias Gi0/8) usage is in:0.00% (0.00bit/s) out:0.00% (1517.50bit/s), interface GigabitEthernet0/9 (alias Gi0/9) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/10 (alias Gi0/10) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/11 (alias Gi0/11) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/12 (alias Gi0/12) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/13 (alias Gi0/13) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/14 (alias Gi0/14) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/15 (alias Gi0/15) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/16 (alias Gi0/16) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/17 (alias Gi0/17) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/18 (alias Gi0/18) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/19 (alias Gi0/19) usage is in:0.39% (3866826.75bit/s) out:62.95% (629520783.75bit/s), interface GigabitEthernet0/20 (alias Gi0/20) usage is in:0.22% (2195255.50bit/s) out:32.56% (325550413.50bit/s), interface GigabitEthernet0/21 (alias Gi0/21) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/22 (alias Gi0/22) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface GigabitEthernet0/23 (alias Gi0/23) usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) (down), interface Null0 usage is in:0.00% (0.00bit/s) out:0.00% (0.00bit/s) Comment: [] Can someone confirm this from his experience? |
Updated by fallback on 2016-12-30 08:55:43 +00:00 In my experience it's not related to the plugin's output. It even keeps crashing although no service or host notification was sent by the time. |
Updated by Marax on 2016-12-30 10:15:46 +00:00 Maybe it's not necessarily a plugin output. We don't run Icinga2 in cluster mode here, but perhaps the bug also affects the communication between cluster nodes. However, for testing purposes i have switched off the notification for the affected services. Let's see if it runs until next year ;-) |
Updated by flex on 2017-01-04 13:06:07 +00:00 Marax wrote:
I can confirm this, if we remove a check which output too long, everything is OK |
Updated by fugstrolch on 2017-01-05 13:04:05 +00:00 (sorry for my english :) ) We have a HA-cluster with two Icinga2-Masters. Since the Installation of V2.6.0 they crashed every 2-3 days. MAYBE: The "plugin-notification-command" (which isn't necessary since V2.6.0 ?) has problems with extended plugin-outputs? Can someone confirm this? Solution would be: Don't use "import plugin-notification-command" any more (?) |
Updated by Marax on 2017-01-05 18:36:16 +00:00 flex wrote:
Since switching off the notification of services with extensive output there are no crashs anymore since last year ;-) It has something in common with passing of a lot of data to external scripts. Perhaps the resulting runtime. It does not have to be a notification. I think that all external scripts will be affected. |
Updated by lucasmopdx on 2017-01-05 22:02:34 +00:00 Is there any resolution available here? I've tried removing the import plugin-notification-command. I can't make my notification messages any shorter. This is a pretty serious bug and the packages for 2.5.4 appear to have been removed from the repository (at least for Ubuntu Trusty), so downgrading is also a difficult option. |
Updated by lucasmopdx on 2017-01-05 22:30:33 +00:00 Here's a backtrace, if useful:
pvtinfo=pvtinfo@entry=0x7f1bee7297c0 <typeinfo for boost::exception_detail::clone_impl<boost::exception_detail::error_info_injectorstd::invalid_argument >>,
current_function=current_function@entry=0x7f1bee4ccc40 <icinga::JsonDecode(icinga::String const&)::PRETTY_FUNCTION> "icinga::Value icinga::JsonDecode(const icinga::String&)",
|
Updated by lucasmopdx on 2017-01-05 22:42:16 +00:00 The issue appears to be here: https://github.com/Icinga/icinga2/blob/v2.6.0/lib/base/process.cpp#L278 Basically, it allocates a buffer that's a maximum of 4096 bytes for a message. If the message exceeds 4096 bytes it's just not read past that length? (I didn't look at the code that sends this) |
Updated by Marax on 2017-01-06 07:37:28 +00:00 lucasmopdx wrote:
I have no idea of programming but is there possibly a relationship with issue #13567 ? Or are two bugs in the same file? |
Updated by lucasmopdx on 2017-01-06 21:30:49 +00:00 Marax wrote:
These are two separate bugs in the same file. |
Updated by lucasmopdx on 2017-01-06 21:35:15 +00:00 Added a PR on github to fix this issue: Note that I was not able to test it as we resolved by downgrading to icinga2 2.5.4 (painfully). |
Updated by mfriedrich on 2017-01-09 15:16:48 +00:00
|
Updated by mfriedrich on 2017-01-09 15:18:26 +00:00
|
Updated by n0braist on 2017-01-09 15:33:21 +00:00 On information for the length: I saw the behavior if the output length is longer than 5120 bytes. |
Updated by mfriedrich on 2017-01-09 15:40:37 +00:00
|
Updated by erikandersonmx on 2017-01-09 16:52:52 +00:00 lucasmopdx wrote:
Awesome find! |
Updated by mfriedrich on 2017-01-09 16:59:11 +00:00
|
Updated by mfriedrich on 2017-01-11 13:08:26 +00:00
|
Updated by snallygaster on 2017-01-12 16:09:51 +00:00 Just want to check - is everyone that is experiencing this issue running SELinux in enforcing mode? Although not a good idea for a production environment test, since running my test server with SELinux in permissive mode I haven't had a recurrence of the issue. |
Updated by gvde on 2017-01-12 16:18:38 +00:00 snallygaster wrote:
We are running the production server in permissive mode only as there are a lot of avc hits. And we see the crash from time to time... So at least for us it's not related with selinux avc denies... |
Updated by gvde on 2017-01-13 09:18:50 +00:00 n0braist wrote:
I can confirm that it seems to be related to the length of the plugin output, crashing when it's longer than 4223 bytes, not crashing when it's less then 2933 bytes. |
Updated by gbeutner on 2017-01-16 07:18:33 +00:00
The only thing I'm not quite happy with with regard to the PR is the fact that the ProcessHandler function is throwing exceptions. There's nobody to catch those exceptions which would definitely cause the child process to crash. I'll see if I can get that updated myself and then merge the PR into the master branch. |
Updated by mfriedrich on 2017-01-16 10:29:08 +00:00 Test case for many arguments creating a crash:
|
Updated by Anonymous on 2017-01-16 13:10:03 +00:00
Applied in changeset 06064e7. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/13655
Created by erikandersonmx on 2016-12-16 20:40:28 +00:00
Assignee: gbeutner
Status: Resolved (closed on 2017-01-16 13:10:03 +00:00)
Target Version: 2.6.1
Last Update: 2017-01-16 13:10:03 +00:00 (in Redmine)
Not sure where to look from here:
@error: parse error: premature EOF
{arguments
(right here) ------^
(0) libbase.so.2.6.0: (+0xc9148) [0x7f60bab36148]
(1) libbase.so.2.6.0: (+0xc91f9) [0x7f60bab361f9]
(2) libbase.so.2.6.0: icinga::JsonDecode(icinga::String const&) (+0x3ce) [0x7f60baad26ae]
(3) libbase.so.2.6.0: (+0x85f82) [0x7f60baaf2f82]
(4) libbase.so.2.6.0: (+0x87640) [0x7f60baaf4640]
(5) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x409088]
(6) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x406fa9]
(7) libc.so.6: __libc_start_main (+0xf5) [0x7f60b9c43f45]
(8) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x4070e0]@
Changesets
2017-01-16 13:04:45 +00:00 by (unknown) 06064e7
2017-01-16 13:05:18 +00:00 by gbeutner 82bab6b
2017-01-16 13:05:36 +00:00 by gbeutner c31d024
2017-01-16 13:15:39 +00:00 by (unknown) 9fa3f3b
2017-01-16 13:15:42 +00:00 by gbeutner 060e20f
2017-01-16 13:15:42 +00:00 by gbeutner ff07cee
Relations:
The text was updated successfully, but these errors were encountered: