[dev.icinga.com #2458] ido2db forks endless if database is not available, race condition on unclosed socket #920
Comments
Updated by Tommi on 2012-03-23 18:03:21 +00:00 today i found >2000 ido2db processes again
|
Updated by Tommi on 2012-04-06 12:25:58 +00:00 ido2db process tree
Now second ido2db process 24077 is started. try to attach there (not that easy!!) and trace until this process dies
process tree after terminating
ido2db debug log
|
Updated by mfriedrich on 2012-04-20 12:47:24 +00:00 is this still the case? |
Updated by Tommi on 2012-04-27 12:01:07 +00:00
last entry was too early closed. Today it happens again, after i changed the dabase credentials to an invalid entry. If the database is not reachable, the reconnect tries ever and ever again to connect leaving each time the process open After 5min i have >10Zombies
Looks like the reaper is not in place here. But i have no glue, how to implement |
Updated by mfriedrich on 2012-04-27 12:09:31 +00:00 the problem is rather easy, when you got 5 things to do the same time. the functionality check on an established connection within handle_client_connection() returns IDO_ERROR on connection failure. now guess which if condition handles that return value.
right, none. |
Updated by mfriedrich on 2012-04-27 14:38:53 +00:00
well thing is ... i cannot reproduce that over here. so my guess is the following. ido2db_wait_for_connections() is polling the socket for new connections. when the child is forked, it calls ido2db_handle_client_connection() to process the gotten data and to first try the initial handshake with the db connection. db connection fails, memory is cleaned up, ido2db_handle_client_connection() returns (being the child) socket is closed when that function returns, but not cleaned up - meaning the socket to shutdown. i see the immediate return after bailing out and socket close as possible error for the child - still now having shutdown the socket.
you could try to remove the line with "return IDO_OK" because then the code will run further into the ido2db_cleanup_socket() call itsself, quitting cleanly and possibly not leaving the childs in wait state on the socket. but i doubt that this is really solving the problem, as the child should not really cleanup the socket, but leave that to the parent. for some strange reason your forked childs do not cleanly exit, still holding ressources. can you provide some truss output plus possible which child is doing what output? especially when run through a debugger, follow the child forks, set a breakpoint to the failed db connection and run it step by step, controlling the value of watched variables. |
Updated by Tommi on 2012-05-02 18:55:13 +00:00 well, this looks like its nessesary to raise a bigger debug session. But its to important to skip this. I must see,when i can do this and how to debug threads with dbx. But it will never fit for the 1.7 release. |
Updated by mfriedrich on 2012-05-14 17:38:37 +00:00
i ran into it today myself, when final testing the pgsql. once i had my postgresql running, but tried to kill the schema, stopped ido2db, re-created the schema and then started postgresql, it was like 'fire-and-forget'. the sole reason being it was the funny "check if db is connected and if not reconnect" feature which was suggested to be added a while back. this feature has been implemented with the "ido2db_db_hello" function in mind, in order to successfully detect an established and working database. problem at this stage - ido2db_db_hello was designed to be called only once (or twice, to be exact): on the first handshake after the idomod api, plus on reconnect during ido2db_db_query, but not within its own function namely ido2db_db_reconnect called quite often before issueing a query even. the main problem is and was the implementation of ido2db_db_hello which just wiped all the threads, sockets, memory and exited the child itsself - triggering the parents sighandler. at this stage of the call flow, this was incorrect, as it did not return fully to allow proper socket close, which actually means a client disconnect for idomod. so in order to fix this, return values must be checked for sanity as well as final exit must happen after handling a client connection, with sane return values, and not hard calls to _exit(0);
or on a different location, when hitting queries.
|
Updated by mfriedrich on 2012-05-14 17:44:02 +00:00
a different, not so clean fix would be an enforced socket close, disconnecting idomod hard. but as this is not the intended way to work, wiping everything from within db_hello (which is not authorized to do so), the bigger diff will (hopefully) clear the issue. |
Updated by mfriedrich on 2012-05-14 18:10:19 +00:00
steps to reproduce
i have seen this on oracle plus rhel as well, when the oracle connection became unstable. with applied fix
|
Updated by crfriend on 2012-05-14 20:13:49 +00:00 I have also seen this, although it might be more of a corner-case, with MySQL and can reproduce it trivially by using the "Classic" UI to restart the Icinga process. The Icinga process stops, disconnects from the database, restarts, connects again to the database, performs its updates, and enters the event-loop -- leaving behind a "" ("zombie") ido2db process that's owned by the parent. Investigation of the matter using Solaris' "preap" command, in the case above, yields a "process exited with status 0" (or some very similar verbiage). This indicates, to me, that the parent ido2db process is not looking for the signal that the child has exited, or is ignoring the exit status once it receives the signal. |
Updated by mfriedrich on 2012-05-14 20:16:17 +00:00 i'm pretty sure that my attempt fixes this - tested it ok. so please be so kind to fetch the 'next' tree and compare it to current master. |
Updated by mfriedrich on 2012-05-14 20:16:49 +00:00
|
Updated by ricardo on 2012-05-14 20:36:56 +00:00 tried, but couldn't reproduce this behavior with mysql. |
Updated by crfriend on 2012-05-14 23:20:20 +00:00 ricardo wrote:
I may have a different branch than you, but I can use the tactic to get "" processes at will. I've been playing with this problem a little bit, and I notice that the "grandparent" sets up the signal-handlers with a call to signal(). I've got a few beers on board at the moment, but from the Solaris doco I have in front of me it seems that the sig-handler reverts to the default behaviour after the signal fires. I changed the local code to use sigset(), but that didn't seem to have the desired effect. In looking at the startup in the verbose debug log, I notice that the sig-handlers are getting defined very early -- in the parent, and it's the grandchild that's the ultimate arbiter of things once they really get going. At this point, I'm wondering if holding off on the calls to signal() until the grandchild is spun up, or re-issuing them in the grandchild might be the way to go. |
Updated by mfriedrich on 2012-05-15 06:20:35 +00:00 hello? can you please just pull next and test the fix? i already know where the problem is - but without proper test feedback this can't reach 1.7! |
Updated by crfriend on 2012-05-15 10:50:11 +00:00 I'm still fighting with git and cannot for the life of me figure out how to switch to the "next" branch. More to the point, I can get it so git says I'm on "next" but the ido2db.c file is identical to the one from 1.7.0beta1. Is there a concise "cheat-sheet" on how to get to that point quickly, or, if the proposed patch isn't too complex, just attach it here so I can integrate it into my system for testing? Cheers! |
Updated by mfriedrich on 2012-05-15 10:57:43 +00:00 |
Updated by crfriend on 2012-05-15 12:29:51 +00:00 dnsmichi wrote: OK, I deserved that. It looks like the defunct process problem exists. I built the new code, stopped my test Icinga instance, installed the new code, and then started it. Here's the process tree after the start:
Then, using the classic web interface, I restarted it. Here's the process tree a few minutes after the restart:
Note that I was able to reap the zombie with the "preap" command. My gut feel is that this is not a gating issue for the release of 1.7.0; it needs to be docuemnted, however, for the platforms that are affected. I'll hammer on it a bit more when I get home (being at work right now). |
Updated by jmosshammer on 2012-05-15 12:33:04 +00:00 Just tested, no problems here (Ubuntu 10.04LTS, postgresql 8.4) |
Updated by mfriedrich on 2012-05-15 13:05:23 +00:00
@carl @Jannis Gunnar reported ok either, so for me it's working. |
Updated by crfriend on 2012-05-15 13:31:12 +00:00 dnsmichi wrote:
It's a matter of the ido2db child exiting, and I can perturb the problem by restarting the Icinga instance via the web interface (by way of the IDO module normally disconnecting from the ido2db child). That the thing happens on database-failures is only because the children exit and aren't reaped; any child exit will cause the symptom. It's broader than just a database connection issue. |
Updated by Tommi on 2012-05-18 15:08:46 +00:00
for me its not solved on solaris. based on the v1.7.0 release tarball the same symtoms are there again
|
Updated by mfriedrich on 2012-05-18 15:10:53 +00:00
then it's up to you solving it further. my issues are resolved with the given patch. |
Updated by Tommi on 2012-05-18 18:33:23 +00:00 Thx :( |
Updated by crfriend on 2012-05-18 23:10:27 +00:00 Tommi wrote:
Oh, look at it as a challenge. ;-) I am seeing the same problem on Solaris as you, and I'm exerting as much mental force on the matter as I can in a roomful of chickens (it's a very long story). I am also seeing it in vastly more general terms than database-disconnects, so it's likely an "easy" fix. Unfortunately, if that's the case, it's also a whopping big portability issue. Careful reads of the signal() routines in Solaris and Linux reveal subtle differences, and I believe the devil may lie in those details -- specifically the ones dealing with when the sig-handlers are enabled and when they're not. I've tried thrice this evening to get decent truss traces, but seem to be fighting with both rampant distractions (see above) and smf gleefully restarting things even though I have Icinga explicitly disabled in smf. (Next step is an svccfg delete operation.) There's also the strange-looking code in ido2db_parent_sighandler() that, at the moment, doesn't make a whole lot of sense to me (although I admit that I'm more used to dealing with interrupts in hardware). |
Updated by mfriedrich on 2012-08-20 12:41:00 +00:00
|
Updated by mfriedrich on 2014-12-08 14:37:37 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/2458
Created by Tommi on 2012-03-21 07:39:33 +00:00
Assignee: Tommi
Status: Closed (closed on 2012-08-20 12:41:00 +00:00)
Target Version: (none)
Last Update: 2014-12-08 14:37:37 +00:00 (in Redmine)
ido2db forks permanent new processes if database not available. previous forked instances are not terminated, they remain as zombies on the system. After some hours of database outage i counted nearly 1000 ido2db processes. Seen on solaris 10 with oracle backend, ido2db 1.6.1. Cant remember to have similar behavior on linux, maybe OS specific
Changesets
2012-05-14 17:44:45 +00:00 by mfriedrich 4c9b080
Relations:
The text was updated successfully, but these errors were encountered: