Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #9406] Selective cluster reconnecting breaks client communication #3061

Closed
icinga-migration opened this issue Jun 11, 2015 · 10 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/9406

Created by mfrosch on 2015-06-11 20:53:56 +00:00

Assignee: mfrosch
Status: Resolved (closed on 2015-06-15 12:50:03 +00:00)
Target Version: 2.3.7
Last Update: 2015-07-10 08:35:20 +00:00 (in Redmine)

Icinga Version: 2.3.4
Backport?: Not yet backported
Include in Changelog: 1

I took some time to analyze a reconnect problem I'm experiencing.

The setup is as follows:

  • 2 masters
  • multiple agents

Every "remote" check is distributed via comand_endpoint, to local scheduled checks on the agents.

Now, when I boot everything up:

  • Start both masters, lets say, icingaA and icingaB
  • Start the agents, icingaX

Everything is running fine, but now when I reload one of the agents, ALL command_endpoint checks icingaB tries to do for icingaX fail.

And that because of "is not connected to".

By investigating the code I found out that the selective reconnecting of Icinga 2 connections is causing us trouble!

[master zone problem] icingaB thinks, icingaA is the master (because of internal determination) and it has not to reconnect to any other endpoint

[agent problem] On the other hand, the agent thinks, hey I'm already connected to icingaA, and thats find, so I have not to connect to icingaB (Not connecting to Zone 'master' because we're already connected to it.)

Thats were this reconnect problem comes from.

I'll fiddle around with the code and push a branch for clarification.

We should discuss this on monday in detail!!

Changesets

2015-06-11 21:02:13 +00:00 by mfrosch 7ce9de0

FIDDLE: Selective reconnecting breaks command_endpoint needs

DO NOT MERGE THIS!

This commit diff shows the code parts mentioned in issue #9406

refs #9406

2015-06-11 21:06:16 +00:00 by mfrosch dd1b5c0

FIDDLE: Selective reconnecting breaks command_endpoint needs

DO NOT MERGE THIS!

This commit diff shows the code parts mentioned in issue #9406

refs #9406

2015-06-15 08:20:21 +00:00 by mfrosch ac0db02

WIP: Remove selective reconnecting behavior

We want to remove the partial reconnecting behavior, so that all endpoints of
a zone try to connect to a lower or higher zone in hierarchy.

refs #9406

2015-06-15 12:47:04 +00:00 by mfrosch cfbe82d

Remove selective reconnecting behavior

We want to remove the partial reconnecting behavior, so that all endpoints of
a zone try to connect to a lower or higher zone in hierarchy.

fixes #9406

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

2015-07-10 08:32:28 +00:00 by mfrosch 97f4875

Remove selective reconnecting behavior

We want to remove the partial reconnecting behavior, so that all endpoints of
a zone try to connect to a lower or higher zone in hierarchy.

fixes #9406

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

Relations:

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-06-11 20:54:29 +00:00

  • Description updated

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-06-11 21:08:35 +00:00

Please check branch support/selective-reconnect-9406 https://git.icinga.org/?p=icinga2.git;a=commit;hb=support/selective-reconnect-9406;js=1

For documentation:

commit dd1b5c051596df0988c3faee55626e838f6e8794
Author: Markus Frosch 
Date:   Thu Jun 11 23:02:13 2015 +0200

    FIDDLE: Selective reconnecting breaks command_endpoint needs

    DO NOT MERGE THIS!

    This commit diff shows the code parts mentioned in issue #9406

    refs #9406

diff --git a/lib/remote/apilistener.cpp b/lib/remote/apilistener.cpp
index de96f97..c05a739 100644
--- a/lib/remote/apilistener.cpp
+++ b/lib/remote/apilistener.cpp
@@ -368,7 +368,9 @@ void ApiListener::ApiTimerHandler(void)
        }
    }

+   /* [master problem]
    if (IsMaster()) {
+   */
        Zone::Ptr my_zone = Zone::GetLocalZone();

        BOOST_FOREACH(const Zone::Ptr& zone, DynamicType::GetObjectsByType()) {
@@ -379,6 +381,7 @@ void ApiListener::ApiTimerHandler(void)
                continue;
            }

+           /* [slave problem]
            bool connected = false;

            BOOST_FOREACH(const Endpoint::Ptr& endpoint, zone->GetEndpoints()) {
@@ -388,12 +391,13 @@ void ApiListener::ApiTimerHandler(void)
                }
            }

-           /* don't connect to an endpoint if we already have a connection to the zone */
+           / * don't connect to an endpoint if we already have a connection to the zone * /
            if (connected) {
                Log(LogDebug, "ApiListener")
                    << "Not connecting to Zone '" << zone->GetName() << "' because we're already connected to it.";
                continue;
            }
+           [slave problem] */

            BOOST_FOREACH(const Endpoint::Ptr& endpoint, zone->GetEndpoints()) {
                /* don't connect to ourselves */
@@ -421,7 +425,9 @@ void ApiListener::ApiTimerHandler(void)
                thread.detach();
            }
        }
+   /* [master problem]
    }
+   */

    BOOST_FOREACH(const Endpoint::Ptr& endpoint, DynamicType::GetObjectsByType()) {
        if (!endpoint->IsConnected())

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-12 16:35:33 +00:00

  • Category set to Cluster
  • Priority changed from High to Normal

That diff isn't really readable w/o additional comments - better explain that by picture on Monday. Wrong git branch btw.

@icinga-migration
Copy link
Author

Updated by mfrosch on 2015-06-15 12:50:03 +00:00

  • Status changed from New to Resolved
  • Done % changed from 0 to 100

Applied in changeset cfbe82d.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-15 13:11:22 +00:00

  • Assigned to set to mfrosch
  • Target Version set to 2.3.5
  • Estimated Hours set to 2

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-17 08:06:04 +00:00

  • Subject changed from Selective reconnecting breaks command_endpoint needs to Selective cluster reconnecting breaks client communication

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 08:54:45 +00:00

  • Relates set to 8712

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 08:57:48 +00:00

  • Relates set to 8920

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-07-02 18:25:20 +00:00

  • Relates set to 9528

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-07-10 08:35:21 +00:00

  • Target Version changed from 2.3.5 to 2.3.7

I forgot to cherry-pick that into support/2.3, mea culpa. Seems 2.3.5 was not tested for resolving the issue after release too.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.3.7 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant