Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #2458] ido2db forks endless if database is not available, race condition on unclosed socket #920

Closed
icinga-migration opened this issue Mar 21, 2012 · 30 comments

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2458

Created by Tommi on 2012-03-21 07:39:33 +00:00

Assignee: Tommi
Status: Closed (closed on 2012-08-20 12:41:00 +00:00)
Target Version: (none)
Last Update: 2014-12-08 14:37:37 +00:00 (in Redmine)

Icinga Version: 1.10.0
OS Version: any

ido2db forks permanent new processes if database not available. previous forked instances are not terminated, they remain as zombies on the system. After some hours of database outage i counted nearly 1000 ido2db processes. Seen on solaris 10 with oracle backend, ido2db 1.6.1. Cant remember to have similar behavior on linux, maybe OS specific

Changesets

2012-05-14 17:44:45 +00:00 by mfriedrich 4c9b080

modify db connection error handling in the ido2db child to prevent endless forks #2458

when the handle_client_connection function is being run, reading from
the socket, and then passing that to a dynamic buffer, the
check_for_client_input call will happen, and then checking if the
idodmod client should be disconnected due to the gotten data.
within handle_client_input, it will be made clear that the initial
db_hello needs to take place. problem at this stage is, that a
non-existing database connection will cause this function to fail,
calling its own wipe threads/memory/connection functionality plus a hard
call to _exit(0) which will leave the socket unclosed and open - and
therefore leading to a new handshake with idomod on the next loop run.

within db_hello, the db_version_check takes place as well, where errors
are ignored (cause return value not taken into account) plus the
instance name check depends on a valid db connection, which will lead
into an error on db_query call (where a call to db_reconnect happens,
which will fail too). the handle_db_error function will now set
disconnect_client=IDO_TRUE, but after not fetching the correct values,
the else condition will match, trying to hard-exit the child in process.
this exit will lead into the mentioned leftopen socket, leading to
multiple forks on new connections - while the childs are left in defunct
or socket_wait, as their file descriptor is still open on the socket -
which is still in use by the first and following n forks. so this is a
valid race condition amongst the first false exit leaving the socket
open, as well as the others not getting rid of it after their exit calls
as well.

the proper fix for this brainfuck is merely to check the return values
of each function call in this row, and bailing early without wiping
anything nor exiting the child. this is being left alone to the
handle_client_connection function, which will normally decide upon
disconnect_client==IDO_TRUE? if it is time to disconnect idomod AND
close the socket. plus terminate the threads and clean the memory.

same goes for the housekeeping thread as well, this is not even allowed
to call _exit(0) on its main process, and therefore will return an error
on db connection failure too, whereas this thread connection polls in
5sec interval, if a valid connection and/or instance name is available.

overall, this is not the best fix, but it is more clean than closing the
socket before invoking an uncontrolled call to _exit(0) - which probably
would have saved the hassle as well - did not try though.

refs #2458

Relations:

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-03-23 18:03:21 +00:00

today i found >2000 ido2db processes again
The problem here is not the failed database connection. The fork is not cleaned from the system if the database connection unsuccessfully and a new try is started. A full stop of ido2db using init script cleanes all processes properly, including the zombies

[1332497487.012159] [001.2] [pid=2540] [tid=1] ido2db_handle_client_connection() start
[1332497487.013560] [001.2] [pid=2540] [tid=1] ido2db_open_debug_log() end
[1332497487.014670] [001.2] [pid=2540] [tid=1] ido2db_idi_init() start
[1332497487.014700] [001.2] [pid=2540] [tid=1] ido2db_idi_init() end
[1332497487.014722] [001.2] [pid=2540] [tid=1] ido2db_db_init() start
[1332497487.025790] [001.2] [pid=2540] [tid=1] OCI_Initialize() ok
[1332497487.025830] [001.2] [pid=2540] [tid=1] ido2db_db_init() end
[1332497487.025853] [001.2] [pid=2540] [tid=1] ido2db_db_connect() start
[1332497487.092821] [001.2] [pid=2540] [tid=1] OCIERROR - MSG ORA-12154: TNS:could not resolve the connect identifier specified

[1332497487.093042] [001.2] [pid=2540] [tid=1] Error: Could not connect to oracle database
[1332497487.093131] [001.2] [pid=2540] [tid=1] ido2db_handle_client_connection() idi.dbinfo.connected is '0'
[1332497487.093180] [001.2] [pid=2540] [tid=1] ido2db_db_disconnect() start
[1332497487.093205] [001.2] [pid=2540] [tid=1] ido2db_is_connected() Warning:Null Connection handle 
[1332497487.093228] [001.2] [pid=2540] [tid=1] ido2db_db_disconnect() already disconnected
[1332497487.093310] [001.2] [pid=2540] [tid=1] ido2db_db_deinit() start
[1332497487.093375] [001.2] [pid=2540] [tid=1] ido2db_free_cached_object_ids() start
[1332497487.093401] [001.2] [pid=2540] [tid=1] ido2db_free_cached_object_ids() end
[1332497487.093422] [001.2] [pid=2540] [tid=1] ido2db_db_deinit() end
[1332497487.093850] [001.2] [pid=2540] [tid=1] ido2db_db_deinit() start
[1332497487.093875] [001.2] [pid=2540] [tid=1] ido2db_free_cached_object_ids() start
[1332497487.093897] [001.2] [pid=2540] [tid=1] ido2db_free_cached_object_ids() end
[1332497487.093917] [001.2] [pid=2540] [tid=1] ido2db_db_deinit() end
[1332497487.093938] [001.2] [pid=2540] [tid=1] ido2db_free_input_memory() start
[1332497487.093958] [001.2] [pid=2540] [tid=1] ido2db_free_input_memory() end
[1332497487.093978] [001.2] [pid=2540] [tid=1] ido2db_free_connection_memory() start
[1332497487.093999] [001.2] [pid=2540] [tid=1] ido2db_free_connection_memory() end
[1332497505.269327] [001.2] [pid=2705] [tid=1] ido2db_handle_client_connection() start
[1332497505.270671] [001.2] [pid=2705] [tid=1] ido2db_open_debug_log() end
[1332497505.271738] [001.2] [pid=2705] [tid=1] ido2db_idi_init() start
[1332497505.271769] [001.2] [pid=2705] [tid=1] ido2db_idi_init() end
[1332497505.271792] [001.2] [pid=2705] [tid=1] ido2db_db_init() start
[1332497505.285868] [001.2] [pid=2705] [tid=1] OCI_Initialize() ok
[1332497505.285906] [001.2] [pid=2705] [tid=1] ido2db_db_init() end
[1332497505.285930] [001.2] [pid=2705] [tid=1] ido2db_db_connect() start
[1332497505.353285] [001.2] [pid=2705] [tid=1] OCIERROR - MSG ORA-12154: TNS:could not resolve the connect identifier specified

[1332497505.353500] [001.2] [pid=2705] [tid=1] Error: Could not connect to oracle database
[1332497505.353588] [001.2] [pid=2705] [tid=1] ido2db_handle_client_connection() idi.dbinfo.connected is '0'
[1332497505.353634] [001.2] [pid=2705] [tid=1] ido2db_db_disconnect() start
[1332497505.353658] [001.2] [pid=2705] [tid=1] ido2db_is_connected() Warning:Null Connection handle 
[1332497505.353681] [001.2] [pid=2705] [tid=1] ido2db_db_disconnect() already disconnected
[1332497505.353763] [001.2] [pid=2705] [tid=1] ido2db_db_deinit() start
[1332497505.353826] [001.2] [pid=2705] [tid=1] ido2db_free_cached_object_ids() start
[1332497505.353852] [001.2] [pid=2705] [tid=1] ido2db_free_cached_object_ids() end
[1332497505.353874] [001.2] [pid=2705] [tid=1] ido2db_db_deinit() end
[1332497505.354297] [001.2] [pid=2705] [tid=1] ido2db_db_deinit() start
[1332497505.354322] [001.2] [pid=2705] [tid=1] ido2db_free_cached_object_ids() start
[1332497505.354344] [001.2] [pid=2705] [tid=1] ido2db_free_cached_object_ids() end
[1332497505.354365] [001.2] [pid=2705] [tid=1] ido2db_db_deinit() end
[1332497505.354434] [001.2] [pid=2705] [tid=1] ido2db_free_input_memory() start
[1332497505.354456] [001.2] [pid=2705] [tid=1] ido2db_free_input_memory() end
[1332497505.354476] [001.2] [pid=2705] [tid=1] ido2db_free_connection_memory() start
[1332497505.354496] [001.2] [pid=2705] [tid=1] ido2db_free_connection_memory() end
[1332497523.411876] [001.2] [pid=2785] [tid=1] ido2db_handle_client_connection() start
[1332497523.413221] [001.2] [pid=2785] [tid=1] ido2db_open_debug_log() end
[1332497523.414348] [001.2] [pid=2785] [tid=1] ido2db_idi_init() start
[1332497523.414380] [001.2] [pid=2785] [tid=1] ido2db_idi_init() end
[1332497523.414402] [001.2] [pid=2785] [tid=1] ido2db_db_init() start
[1332497523.421905] [001.2] [pid=2785] [tid=1] OCI_Initialize() ok
[1332497523.421943] [001.2] [pid=2785] [tid=1] ido2db_db_init() end
[1332497523.421966] [001.2] [pid=2785] [tid=1] ido2db_db_connect() start
[1332497523.487357] [001.2] [pid=2785] [tid=1] OCIERROR - MSG ORA-12154: TNS:could not resolve the connect identifier specified

[1332497523.487643] [001.2] [pid=2785] [tid=1] Error: Could not connect to oracle database
[1332497523.487737] [001.2] [pid=2785] [tid=1] ido2db_handle_client_connection() idi.dbinfo.connected is '0'
[1332497523.487786] [001.2] [pid=2785] [tid=1] ido2db_db_disconnect() start
[1332497523.487810] [001.2] [pid=2785] [tid=1] ido2db_is_connected() Warning:Null Connection handle 
[1332497523.487834] [001.2] [pid=2785] [tid=1] ido2db_db_disconnect() already disconnected
[1332497523.487974] [001.2] [pid=2785] [tid=1] ido2db_db_deinit() start
[1332497523.488056] [001.2] [pid=2785] [tid=1] ido2db_free_cached_object_ids() start
[1332497523.488084] [001.2] [pid=2785] [tid=1] ido2db_free_cached_object_ids() end
[1332497523.488106] [001.2] [pid=2785] [tid=1] ido2db_db_deinit() end
[1332497523.488502] [001.2] [pid=2785] [tid=1] ido2db_db_deinit() start
[1332497523.488526] [001.2] [pid=2785] [tid=1] ido2db_free_cached_object_ids() start
[1332497523.488548] [001.2] [pid=2785] [tid=1] ido2db_free_cached_object_ids() end
[1332497523.488568] [001.2] [pid=2785] [tid=1] ido2db_db_deinit() end
[1332497523.488588] [001.2] [pid=2785] [tid=1] ido2db_free_input_memory() start
[1332497523.488610] [001.2] [pid=2785] [tid=1] ido2db_free_input_memory() end
[1332497523.488630] [001.2] [pid=2785] [tid=1] ido2db_free_connection_memory() start
[1332497523.488651] [001.2] [pid=2785] [tid=1] ido2db_free_connection_memory() end
[1332497547.208404] [001.2] [pid=3061] [tid=1] ido2db_handle_client_connection() start
[1332497547.209790] [001.2] [pid=3061] [tid=1] ido2db_open_debug_log() end
[1332497547.210897] [001.2] [pid=3061] [tid=1] ido2db_idi_init() start
[1332497547.210928] [001.2] [pid=3061] [tid=1] ido2db_idi_init() end
[1332497547.210950] [001.2] [pid=3061] [tid=1] ido2db_db_init() start
[1332497547.222834] [001.2] [pid=3061] [tid=1] OCI_Initialize() ok
[1332497547.222876] [001.2] [pid=3061] [tid=1] ido2db_db_init() end

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-04-04 11:43:48 +00:00

could be related to the change introduced in #2271 and debugged / fixed in #2500

rather than the debug log, ps output plus strace -p information would be more interesting in that case.

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-04-06 12:25:58 +00:00

ido2db process tree

> ptree 23937
23937 icinga/bin/ido2db -c
  23991 
  23962 
  24077 icinga/bin/ido2db -c

Now second ido2db process 24077 is started. try to attach there (not that easy!!) and trace until this process dies

> truss -p 24077
/1:     read(9, 0x100320E36, 8208)      (sleeping...)
/2:     nanosleep(0xFFFFFFFF79FFBF28, 0x00000000) (sleeping...)
/1:     read(9, "\0\v\0\0\f\0\0\001\001", 8208)         = 11
/1:     write(9, "\0\v\0\0\f\0\0\001\002", 11)          = 11
/1:     read(9, "\0\v\0\0\f\0\0\001\002", 8208)         = 11
/1:     read(9, "\0C6\0\006\0\0\0\0\004\0".., 8208)     = 198
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 115)     = 115
/1:     lseek(5, 0, SEEK_CUR)                           = 8380
/1:     write(9, "\0\n\0\006\0\0\0\0 @", 10)            = 10
/1:     close(9)                                        = 0
/1:     lwp_sigmask(SIG_SETMASK, 0x00000002, 0x00000000) = 0xFFBFFEFF
[0x0000FFFF]
/1:     sigaction(SIGINT, 0xFFFFFFFF7FFFD790, 0xFFFFFFFF7FFFD8A0) = 0
/1:     lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF
[0x0000FFFF]
/1:     fstat(3, 0xFFFFFFFF7FFFE490)                    = 0
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     putmsg(3, 0xFFFFFFFF7FFFDB40, 0xFFFFFFFF7FFFDB30, 0) = 0
/1:     open("/var/run/syslog_door", O_RDONLY)          = 9
/1:     door_info(9, 0xFFFFFFFF7FFFDA28)                = 0
/1:     getpid()                                        = 24077 [23937]
/1:     door_call(9, 0xFFFFFFFF7FFFD9F8)                = 0
/1:     close(9)                                        = 0
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 92)      = 92
/1:     lseek(5, 0, SEEK_CUR)                           = 8472
/1:     fstat(3, 0xFFFFFFFF7FFFE580)                    = 0
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     putmsg(3, 0xFFFFFFFF7FFFDC30, 0xFFFFFFFF7FFFDC20, 0) = 0
/1:     open("/var/run/syslog_door", O_RDONLY)          = 9
/1:     door_info(9, 0xFFFFFFFF7FFFDB18)                = 0
/1:     getpid()                                        = 24077 [23937]
/1:     door_call(9, 0xFFFFFFFF7FFFDAE8)                = 0
/1:     close(9)                                        = 0
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 110)     = 110
/1:     lseek(5, 0, SEEK_CUR)                           = 8582
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 77)      = 77
/1:     lseek(5, 0, SEEK_CUR)                           = 8659
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 102)     = 102
/1:     lseek(5, 0, SEEK_CUR)                           = 8761
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 92)      = 92
/1:     lseek(5, 0, SEEK_CUR)                           = 8853
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 73)      = 73
/1:     lseek(5, 0, SEEK_CUR)                           = 8926
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 86)      = 86
/1:     lseek(5, 0, SEEK_CUR)                           = 9012
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 84)      = 84
/1:     lseek(5, 0, SEEK_CUR)                           = 9096
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
/1:     write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 71)      = 71
/1:     lseek(5, 0, SEEK_CUR)                           = 9167
/1:     lwp_kill(2, SIGCANCEL)                          = 0
/2:         Received signal #36, SIGCANCEL, in nanosleep() [caught]
/2:           siginfo: SIGCANCEL pid=24077 uid=44684 code=-1
/2:     nanosleep(0xFFFFFFFF79FFBF28, 0x00000000)       Err#4 EINTR
/2:     schedctl()                                      = 0xFFFFFFFF79D32010
/2:     lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000008) = 0xFFBFFEFF
[0x0000FFFF]
/2:     lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF
[0x0000FFFF]
/2:     lwp_exit()
/1:     lwp_wait(2, 0xFFFFFFFF7FFFE6EC)                 = 0
/1:     time()                                          = 1333709082
/1:     getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 73)      = 73
lseek(5, 0, SEEK_CUR)                           = 9240
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 86)      = 86
lseek(5, 0, SEEK_CUR)                           = 9326
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 84)      = 84
lseek(5, 0, SEEK_CUR)                           = 9410
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 71)      = 71
lseek(5, 0, SEEK_CUR)                           = 9481
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 81)      = 81
lseek(5, 0, SEEK_CUR)                           = 9562
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 79)      = 79
lseek(5, 0, SEEK_CUR)                           = 9641
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 86)      = 86
lseek(5, 0, SEEK_CUR)                           = 9727
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
write(5, " [ 1 3 3 3 7 0 9 0 8 2 .".., 84)      = 84
lseek(5, 0, SEEK_CUR)                           = 9811
close(8)                                        = 0
getpid()                                        = 24077 [23937]
fstat(3, 0xFFFFFFFF7FFFEF80)                    = 0
time()                                          = 1333709082
getpid()                                        = 24077 [23937]
putmsg(3, 0xFFFFFFFF7FFFE630, 0xFFFFFFFF7FFFE620, 0) = 0
open("/var/run/syslog_door", O_RDONLY)          = 8
door_info(8, 0xFFFFFFFF7FFFE518)                = 0
getpid()                                        = 24077 [23937]
door_call(8, 0xFFFFFFFF7FFFE4E8)                = 0
close(8)                                        = 0
lseek(5, 0, SEEK_CUR)                           = 9811
close(5)                                        = 0
_exit(0)

process tree after terminating

> ptree 23937
23937 icinga/bin/ido2db -c
  23991 
  23962 
  24077 

ido2db debug log

[1333709072.246391] [001.2] [pid=24077] [tid=1]
ido2db_handle_client_connection() start
[1333709072.247540] [001.2] [pid=24077] [tid=1] ido2db_open_debug_log() end
[1333709072.248633] [001.2] [pid=24077] [tid=1] ido2db_idi_init() start
[1333709072.248661] [001.2] [pid=24077] [tid=1] ido2db_idi_init() end
[1333709072.248707] [001.2] [pid=24077] [tid=1] ido2db_db_init() start
[1333709072.261618] [001.2] [pid=24077] [tid=1] OCI_Initialize() ok
[1333709072.261653] [001.2] [pid=24077] [tid=1] ido2db_db_init() end
[1333709072.261672] [001.2] [pid=24077] [tid=1] ido2db_db_connect() start
[1333709082.435050] [001.2] [pid=24077] [tid=1] OCIERROR - MSG ORA-01017:
invalid username/password; logon denied
[1333709082.437209] [001.2] [pid=24077] [tid=1] Error: Could not connect to
oracle database
[1333709082.438084] [001.2] [pid=24077] [tid=1]
ido2db_handle_client_connection() idi.dbinfo.connected is '0'
[1333709082.438352] [001.2] [pid=24077] [tid=1] ido2db_db_disconnect() start
[1333709082.438618] [001.2] [pid=24077] [tid=1] ido2db_is_connected()
Warning:Null Connection handle
[1333709082.438884] [001.2] [pid=24077] [tid=1] ido2db_db_disconnect()
already disconnected
[1333709082.439265] [001.2] [pid=24077] [tid=1] ido2db_db_deinit() start
[1333709082.439601] [001.2] [pid=24077] [tid=1]
ido2db_free_cached_object_ids() start
[1333709082.439872] [001.2] [pid=24077] [tid=1]
ido2db_free_cached_object_ids() end
[1333709082.440137] [001.2] [pid=24077] [tid=1] ido2db_db_deinit() end
[1333709082.441440] [001.2] [pid=24077] [tid=1] ido2db_db_deinit() start
[1333709082.441784] [001.2] [pid=24077] [tid=1]
ido2db_free_cached_object_ids() start
[1333709082.442046] [001.2] [pid=24077] [tid=1]
ido2db_free_cached_object_ids() end
[1333709082.442309] [001.2] [pid=24077] [tid=1] ido2db_db_deinit() end
[1333709082.442570] [001.2] [pid=24077] [tid=1] ido2db_free_input_memory()
start
[1333709082.442833] [001.2] [pid=24077] [tid=1] ido2db_free_input_memory()
end
[1333709082.443094] [001.2] [pid=24077] [tid=1]
ido2db_free_connection_memory() start
[1333709082.443526] [001.2] [pid=24077] [tid=1]
ido2db_free_connection_memory() end

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-04-20 12:47:24 +00:00

is this still the case?

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-04-26 19:04:07 +00:00

  • Status changed from New to Closed
  • Assigned to set to Tommi

The fixes for #2503 and #2271 also solved this issue for me

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-04-27 12:01:07 +00:00

  • Status changed from Closed to Assigned
  • Assigned to changed from Tommi to mfriedrich

last entry was too early closed. Today it happens again, after i changed the dabase credentials to an invalid entry. If the database is not reachable, the reconnect tries ever and ever again to connect leaving each time the process open After 5min i have >10Zombies

>ptree 1730
1730  ../bin/ido2db -c ido2db.cfg
  2183  
  2870  
  3712  
  2040  
  3315  
  2326  
  2211  
  4037  
  3882  
  2271  
  3924  
  2243  
  2294  
  2002  
  3984  
  3745  
  2697  

[1335527600.398495] [001.2] [pid=4152] [tid=1] ido2db_handle_client_connection()
 start
[1335527600.399592] [001.2] [pid=4152] [tid=1] ido2db_open_debug_log() end
[1335527600.400558] [001.2] [pid=4152] [tid=1] ido2db_idi_init() start
[1335527600.400584] [001.2] [pid=4152] [tid=1] ido2db_idi_init() end
[1335527600.400603] [001.2] [pid=4152] [tid=1] ido2db_db_init() start
[1335527600.409211] [001.2] [pid=4152] [tid=1] OCI_Initialize() ok
[1335527600.409245] [001.2] [pid=4152] [tid=1] ido2db_db_init() end
[1335527600.409264] [001.2] [pid=4152] [tid=1] ido2db_db_connect() start
[1335527600.852262] [001.2] [pid=4152] [tid=1] OCIERROR - MSG ORA-12154: TNS:could not resolve the connect identifier specified

[1335527600.852590] [001.2] [pid=4152] [tid=1] Error: Could not connect to oracle database
[1335527600.852673] [001.2] [pid=4152] [tid=1] ido2db_handle_client_connection()
 idi.dbinfo.connected is '0'
[1335527600.852694] [001.2] [pid=4152] [tid=1] ido2db_db_disconnect() start
[1335527600.852713] [001.2] [pid=4152] [tid=1] ido2db_is_connected() Warning:Null Connection handle
[1335527600.852730] [001.2] [pid=4152] [tid=1] ido2db_db_disconnect() already disconnected
[1335527600.852966] [001.2] [pid=4152] [tid=1] ido2db_db_deinit() start
[1335527600.853033] [001.2] [pid=4152] [tid=1] ido2db_free_cached_object_ids() start
[1335527600.853052] [001.2] [pid=4152] [tid=1] ido2db_free_cached_object_ids() end
[1335527600.853069] [001.2] [pid=4152] [tid=1] ido2db_db_deinit() end
[1335527600.853525] [001.2] [pid=4152] [tid=1] ido2db_db_deinit() start
[1335527600.853579] [001.2] [pid=4152] [tid=1] ido2db_free_cached_object_ids() start
[1335527600.853600] [001.2] [pid=4152] [tid=1] ido2db_free_cached_object_ids() end
[1335527600.853617] [001.2] [pid=4152] [tid=1] ido2db_db_deinit() end
[1335527600.853634] [001.2] [pid=4152] [tid=1] ido2db_free_input_memory() start
[1335527600.853652] [001.2] [pid=4152] [tid=1] ido2db_free_input_memory() end
[1335527600.853669] [001.2] [pid=4152] [tid=1] ido2db_free_connection_memory() start
[1335527600.853686] [001.2] [pid=4152] [tid=1] ido2db_free_connection_memory() end

Looks like the reaper is not in place here. But i have no glue, how to implement

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-04-27 12:09:31 +00:00

the problem is rather easy, when you got 5 things to do the same time.

the functionality check on an established connection within handle_client_connection() returns IDO_ERROR on connection failure. now guess which if condition handles that return value.

~/coding/icinga/icinga-core/module/idoutils/src $ grep -r ido2db_handle_client_connection *
Binary file ido2db matches
ido2db.c:               ido2db_handle_client_connection(0);
ido2db.c:                               ido2db_handle_client_connection(new_sd);
ido2db.c:                       ido2db_handle_client_connection(new_sd);
ido2db.c:int ido2db_handle_client_connection(int sd) {
ido2db.c:       ido2db_log_debug_info(IDO2DB_DEBUGL_PROCESSINFO, 2, "ido2db_handle_client_connection() start\n");
ido2db.c:               ido2db_log_debug_info(IDO2DB_DEBUGL_PROCESSINFO, 2, "ido2db_handle_client_connection() pthread_attr_setstacksize with %lu error\n", IDO2DB_DEFAULT_THREAD_STACK_SIZE);
ido2db.c:                       ido2db_log_debug_info(IDO2DB_DEBUGL_PROCESSINFO, 2, "ido2db_handle_client_connection() idi.dbinfo.connected is '%d'\n", idi.dbinfo.connected);
ido2db.c:       ido2db_log_debug_info(IDO2DB_DEBUGL_PROCESSINFO, 2, "ido2db_handle_client_connection() end\n");

right, none.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-04-27 14:38:53 +00:00

  • Assigned to changed from mfriedrich to Tommi

well thing is ... i cannot reproduce that over here. so my guess is the following.

ido2db_wait_for_connections() is polling the socket for new connections.

when the child is forked, it calls ido2db_handle_client_connection() to process the gotten data and to first try the initial handshake with the db connection.

db connection fails, memory is cleaned up, ido2db_handle_client_connection() returns (being the child)

socket is closed when that function returns, but not cleaned up - meaning the socket to shutdown.

i see the immediate return after bailing out and socket close as possible error for the child - still now having shutdown the socket.

                        case 0:
                                /* child processes data... */
                                ido2db_handle_client_connection(new_sd);

                                /* close socket when we're done */
                                close(new_sd);
                                return IDO_OK;
                                break;

you could try to remove the line with "return IDO_OK" because then the code will run further into the ido2db_cleanup_socket() call itsself, quitting cleanly and possibly not leaving the childs in wait state on the socket.

but i doubt that this is really solving the problem, as the child should not really cleanup the socket, but leave that to the parent.

for some strange reason your forked childs do not cleanly exit, still holding ressources. can you provide some truss output plus possible which child is doing what output? especially when run through a debugger, follow the child forks, set a breakpoint to the failed db connection and run it step by step, controlling the value of watched variables.

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-05-02 18:55:13 +00:00

well, this looks like its nessesary to raise a bigger debug session. But its to important to skip this. I must see,when i can do this and how to debug threads with dbx. But it will never fit for the 1.7 release.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-14 17:38:37 +00:00

  • Assigned to changed from Tommi to mfriedrich
  • Target Version set to 1.7

i ran into it today myself, when final testing the pgsql. once i had my postgresql running, but tried to kill the schema, stopped ido2db, re-created the schema and then started postgresql, it was like 'fire-and-forget'.

the sole reason being it was the funny "check if db is connected and if not reconnect" feature which was suggested to be added a while back.

this feature has been implemented with the "ido2db_db_hello" function in mind, in order to successfully detect an established and working database. problem at this stage - ido2db_db_hello was designed to be called only once (or twice, to be exact): on the first handshake after the idomod api, plus on reconnect during ido2db_db_query, but not within its own function namely ido2db_db_reconnect called quite often before issueing a query even.

the main problem is and was the implementation of ido2db_db_hello which just wiped all the threads, sockets, memory and exited the child itsself - triggering the parents sighandler. at this stage of the call flow, this was incorrect, as it did not return fully to allow proper socket close, which actually means a client disconnect for idomod.

so in order to fix this, return values must be checked for sanity as well as final exit must happen after handling a client connection, with sane return values, and not hard calls to _exit(0);

  • ido2db_handle_client_connection
    • ido2db_db_connect
  • ido2db_check_for_client_input
  • ido2db_handle_client_input
  • ido2db_db_hello
  • !!! do not EXIT here !!!
    • check_db_version
      • return IDO_ERROR in case version check fails
    • check instance_name in db
      • return IDO_ERROR if instance query fails (but not if not found, then an insert will happen later on)
    • ido2db_db_query
    • ido2db_handle_db_error
      • set idi.disconnect_client = IDO_TRUE to trigger that in ido2db_handle_client_connection

or on a different location, when hitting queries.

  • ido2db_db_query / oracle prepared statement connection checks
    • check if db connection is alive with db_reconnect, and return the error in case
  • ido2db_db_reconnect
    • if reconnect was successful, send db_hello, and return its return value
  • ido2db_db_hello
  • !!! do not EXIT here !!!
    • check_db_version
      • return IDO_ERROR in case version check fails
    • check instance_name in db
      • return IDO_ERROR if instance query fails (but not if not found, then an insert will happen later on)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-14 17:44:02 +00:00

  • Category set to 25

a different, not so clean fix would be an enforced socket close, disconnecting idomod hard. but as this is not the intended way to work, wiping everything from within db_hello (which is not authorized to do so), the bigger diff will (hopefully) clear the issue.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-14 18:10:19 +00:00

  • Priority changed from Normal to Urgent

steps to reproduce

  • stop ido2db
  • stop postgresql
  • start ido2db
  • start postgresql
  • try to tail your syslog and ps aux | grep ido2db | wc -l
# ps aux | grep ido2db | wc -l
27
  • try to stop ido2db
# service ido2db stop
Stopping ido2db: ido2db not running. PID file /var/icinga/ido2db.lock not found

# killall -9 ido2db

i have seen this on oracle plus rhel as well, when the oracle connection became unstable.

with applied fix

May 14 20:08:58 icinga-dev icinga: idomod: Successfully reconnected to data sink!  2362 items lost, 65535 queued items to flush.
May 14 20:08:58 icinga-dev ido2db: Client connected, data available.
May 14 20:08:58 icinga-dev ido2db: Handling client connection...
May 14 20:08:58 icinga-dev ido2db: Error: Could not connect to pgsql database: could not connect to server: Connection refused#012#011Is the server running on host "localhost" (127.0.0.1) and accepting#012#011TCP/IP connections on port 5432?
May 14 20:08:58 icinga-dev ido2db: Error: database connection failed, forced client disconnect...
May 14 20:08:58 icinga-dev icinga: idomod: Error writing to data sink!  Some output may get lost.  65451 queued items to flush.
May 14 20:08:58 icinga-dev ido2db: Successfully shutdown... (PID=6784)

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-14 20:13:49 +00:00

I have also seen this, although it might be more of a corner-case, with MySQL and can reproduce it trivially by using the "Classic" UI to restart the Icinga process. The Icinga process stops, disconnects from the database, restarts, connects again to the database, performs its updates, and enters the event-loop -- leaving behind a "" ("zombie") ido2db process that's owned by the parent.

Investigation of the matter using Solaris' "preap" command, in the case above, yields a "process exited with status 0" (or some very similar verbiage). This indicates, to me, that the parent ido2db process is not looking for the signal that the child has exited, or is ignoring the exit status once it receives the signal.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-14 20:16:17 +00:00

i'm pretty sure that my attempt fixes this - tested it ok. so please be so kind to fetch the 'next' tree and compare it to current master.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-14 20:16:49 +00:00

  • Subject changed from ido2db on solaris forks endless if database is not available to ido2db forks endless if database is not available, race condition on unclosed socket

@icinga-migration
Copy link
Author

Updated by ricardo on 2012-05-14 20:36:56 +00:00

tried, but couldn't reproduce this behavior with mysql.

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-14 23:20:20 +00:00

ricardo wrote:

tried, but couldn't reproduce this behavior with mysql.

I may have a different branch than you, but I can use the tactic to get "" processes at will.

I've been playing with this problem a little bit, and I notice that the "grandparent" sets up the signal-handlers with a call to signal(). I've got a few beers on board at the moment, but from the Solaris doco I have in front of me it seems that the sig-handler reverts to the default behaviour after the signal fires. I changed the local code to use sigset(), but that didn't seem to have the desired effect.

In looking at the startup in the verbose debug log, I notice that the sig-handlers are getting defined very early -- in the parent, and it's the grandchild that's the ultimate arbiter of things once they really get going. At this point, I'm wondering if holding off on the calls to signal() until the grandchild is spun up, or re-issuing them in the grandchild might be the way to go.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-15 06:20:35 +00:00

hello? can you please just pull next and test the fix? i already know where the problem is - but without proper test feedback this can't reach 1.7!

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-15 10:50:11 +00:00

I'm still fighting with git and cannot for the life of me figure out how to switch to the "next" branch. More to the point, I can get it so git says I'm on "next" but the ido2db.c file is identical to the one from 1.7.0beta1.

Is there a concise "cheat-sheet" on how to get to that point quickly, or, if the proposed patch isn't too complex, just attach it here so I can integrate it into my system for testing?

Cheers!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-15 10:57:43 +00:00

https://wiki.icinga.org/display/Dev/Riding+the+Git

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-15 12:29:51 +00:00

dnsmichi wrote:

https://wiki.icinga.org/display/Dev/Riding+the+Git

OK, I deserved that.

It looks like the defunct process problem exists. I built the new code, stopped my test Icinga instance, installed the new code, and then started it. Here's the process tree after the start:

t1a:src >. ps -ef | grep icinga
  icinga 13073  9074   0 08:12:34 pts/2       0:00 -bash
  icinga  9074  9055   0 07:40:13 pts/2       0:00 -bash
  icinga  9055  9054   0 07:40:05 ?           0:03 /usr/lib/ssh/sshd
  icinga 13005 14284   1 08:10:53 ?           0:00 /usr/local/icinga/bin/icinga -d /usr/local/icinga/etc/icinga.cfg
  icinga 13072  9074   0 08:12:34 pts/2       0:00 ps -ef
  icinga 13004 12993   1 08:10:53 ?           0:01 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
  icinga 12993 14284   0 08:10:43 ?           0:00 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
t1a:src >.

Then, using the classic web interface, I restarted it. Here's the process tree a few minutes after the restart:

t1a:src >. ps -ef | grep icinga
  icinga 13265  9074   0 08:14:54 pts/2       0:00 grep icinga
  icinga  9074  9055   0 07:40:13 pts/2       0:00 -bash
  icinga  9055  9054   0 07:40:05 ?           0:03 /usr/lib/ssh/sshd
  icinga 13005 14284   1 08:10:53 ?           0:01 /usr/local/icinga/bin/icinga -d /usr/local/icinga/etc/icinga.cfg
  icinga 13264  9074   0 08:14:54 pts/2       0:00 ps -ef
  icinga 13236 12993   2 08:14:14 ?           0:01 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
  icinga 13240 12993   0        - ?           0:00 
  icinga 12993 14284   0 08:10:43 ?           0:00 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
t1a:src >. preap 13240
13240: exited with status 0
t1a:src >. ps -ef | grep icinga
  icinga  9074  9055   0 07:40:13 pts/2       0:00 -bash
  icinga  9055  9054   0 07:40:05 ?           0:03 /usr/lib/ssh/sshd
  icinga 13005 14284   1 08:10:53 ?           0:03 /usr/local/icinga/bin/icinga -d /usr/local/icinga/etc/icinga.cfg
  icinga 13236 12993   1 08:14:14 ?           0:02 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
  icinga 13615  9074   0 08:20:00 pts/2       0:00 grep icinga
  icinga 13614  9074   0 08:20:00 pts/2       0:00 ps -ef
  icinga 12993 14284   0 08:10:43 ?           0:00 /usr/local/icinga/bin/ido2db -c /usr/local/icinga/etc/ido2db.cfg
t1a:src >.

Note that I was able to reap the zombie with the "preap" command.

My gut feel is that this is not a gating issue for the release of 1.7.0; it needs to be docuemnted, however, for the platforms that are affected.

I'll hammer on it a bit more when I get home (being at work right now).

@icinga-migration
Copy link
Author

Updated by jmosshammer on 2012-05-15 12:33:04 +00:00

Just tested, no problems here (Ubuntu 10.04LTS, postgresql 8.4)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-15 13:05:23 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

@carl
i thought this was an ido2db when database gone away problem? why do you mix that with icinga restart for now?

@Jannis
thanks for testing!

Gunnar reported ok either, so for me it's working.

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-15 13:31:12 +00:00

dnsmichi wrote:

@carl
i thought this was an ido2db when database gone away problem? why do you mix that with icinga restart for now?

It's a matter of the ido2db child exiting, and I can perturb the problem by restarting the Icinga instance via the web interface (by way of the IDO module normally disconnecting from the ido2db child). That the thing happens on database-failures is only because the children exit and aren't reaped; any child exit will cause the symptom. It's broader than just a database connection issue.

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-05-18 15:08:46 +00:00

  • Status changed from Resolved to Assigned

for me its not solved on solaris. based on the v1.7.0 release tarball the same symtoms are there again

13285 ido2db -c 
  14941 
  14100 
  14062 
  14210 
  13830 
  14332 
  13620 
  13950 
  13347 
  14992 
  14011 
  13745 
  15347 
  13886 
  15118 
  15545 
  14284 
...
[1337353550.425871] [001.2] [pid=16210] [tid=1] ido2db_handle_client_connection(
) start
[1337353550.426900] [001.2] [pid=16210] [tid=1] ido2db_open_debug_log() end
[1337353550.427861] [001.2] [pid=16210] [tid=1] ido2db_idi_init() start
[1337353550.427886] [001.2] [pid=16210] [tid=1] ido2db_idi_init() end
[1337353550.427905] [001.2] [pid=16210] [tid=1] ido2db_db_init() start
[1337353550.438367] [001.2] [pid=16210] [tid=1] OCI_Initialize() ok
[1337353550.438616] [001.2] [pid=16210] [tid=1] ido2db_db_init() end
[1337353550.438638] [001.2] [pid=16210] [tid=1] ido2db_db_connect() start
[1337353550.922076] [001.2] [pid=16210] [tid=1] OCIERROR - MSG ORA-12154: TNS:co
uld not resolve the connect identifier specified

[1337353550.922478] [001.2] [pid=16210] [tid=1] Error: Could not connect to orac
le database
[1337353550.922575] [001.2] [pid=16210] [tid=1] ido2db_handle_client_connection(
) idi.dbinfo.connected is '0'
[1337353550.922597] [001.2] [pid=16210] [tid=1] ido2db_db_disconnect() start
[1337353550.922616] [001.2] [pid=16210] [tid=1] ido2db_is_connected() Warning:Null Connection handle
[1337353550.922635] [001.2] [pid=16210] [tid=1] ido2db_db_disconnect() already disconnected
[1337353550.922903] [001.2] [pid=16210] [tid=1] ido2db_db_deinit() start
[1337353550.922987] [001.2] [pid=16210] [tid=1] ido2db_free_cached_object_ids() start
[1337353550.923020] [001.2] [pid=16210] [tid=1] ido2db_free_cached_object_ids() end
[1337353550.923038] [001.2] [pid=16210] [tid=1] ido2db_db_deinit() end
[1337353550.923492] [001.2] [pid=16210] [tid=1] ido2db_db_deinit() start

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-05-18 15:10:53 +00:00

  • Assigned to changed from mfriedrich to Tommi
  • Priority changed from Urgent to Normal
  • Target Version deleted 1.7
  • Done % changed from 100 to 80

then it's up to you solving it further. my issues are resolved with the given patch.

@icinga-migration
Copy link
Author

Updated by Tommi on 2012-05-18 18:33:23 +00:00

Thx :(

@icinga-migration
Copy link
Author

Updated by crfriend on 2012-05-18 23:10:27 +00:00

Tommi wrote:

Thx :(

Oh, look at it as a challenge. ;-)

I am seeing the same problem on Solaris as you, and I'm exerting as much mental force on the matter as I can in a roomful of chickens (it's a very long story). I am also seeing it in vastly more general terms than database-disconnects, so it's likely an "easy" fix. Unfortunately, if that's the case, it's also a whopping big portability issue.

Careful reads of the signal() routines in Solaris and Linux reveal subtle differences, and I believe the devil may lie in those details -- specifically the ones dealing with when the sig-handlers are enabled and when they're not. I've tried thrice this evening to get decent truss traces, but seem to be fighting with both rampant distractions (see above) and smf gleefully restarting things even though I have Icinga explicitly disabled in smf. (Next step is an svccfg delete operation.)

There's also the strange-looking code in ido2db_parent_sighandler() that, at the moment, doesn't make a whole lot of sense to me (although I admit that I'm more used to dealing with interrupts in hardware).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-08-20 12:41:00 +00:00

  • Status changed from Assigned to Closed
  • Icinga Version set to 1
  • (unknown custom field) set to 1
  • OS Version set to Solaris
  • (unknown custom field) set to Oracle
  • (unknown custom field) set to 1

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-12-08 14:37:37 +00:00

  • Project changed from 18 to Core, Classic UI, IDOUtils
  • Category changed from 25 to IDOUtils
  • Icinga Version changed from 1 to 1
  • OS Version set to any

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant