New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #10819] Cluster config sync ignores zones.d from API packages #3766
Comments
Updated by mfriedrich on 2015-12-10 15:20:34 +00:00
|
Updated by mfriedrich on 2015-12-10 15:24:37 +00:00 The problem originates from IsConfigMaster(). |
Updated by mfriedrich on 2015-12-10 15:25:09 +00:00
|
Updated by mfriedrich on 2015-12-10 21:23:59 +00:00
Fixed
|
Updated by mfriedrich on 2015-12-10 21:25:00 +00:00
Applied in changeset c5b13ff. |
Updated by tgelf on 2015-12-10 21:34:18 +00:00 Cool, thank's a lot! The patch mentions zone dirs, but I didn't completely understand the meaning of IsConfigMaster in this context. Is there anything one has to pay attention when using this, like avoiding to create specific directories in /etc? How behaves ApiListener::ConfigUpdateHandler in case former stages of the same package have been deployed to another master? Simple use case: "config tool deploys config to the first available master". This must work regardless of whether it was the one that got the last deployment or not. Cheers, |
Updated by tgelf on 2015-12-10 21:34:35 +00:00
|
Updated by mfriedrich on 2015-12-10 21:43:23 +00:00
Please test that fix and do not re-open the issue. I've tested it working with the recent changes and you should do the very same. IsConfigMaster selects the authoritative node holding this zone configuration, previously it was only /etc/icinga2/zones.d being populated (cluster v3 2 years ago), nowadays we have a zone dirs registry where multiple sources may register their zone directories. That happens automatically, nothing the user would see nor have to care about. The fix takes care of checking that accordingly. |
Updated by mfriedrich on 2015-12-10 21:43:38 +00:00
|
Updated by tgelf on 2015-12-10 22:24:48 +00:00 dnsmichi wrote:
I didn't "re-open" this, just politely asked for feedback. Be sure that we will intensively test this, but first I need to have just a very few more details about WHAT behaviour I'm going to test.
Ok, so please allow me to dig deeper. Our history of handling polluted zone dirs (especially in /var/lib/icinga2/api) hasn't been free of issues, so one of the intentions of my question above was to find out how a simple two-node cluster behaves when an API user continues to deploy the very same zone to the very same package to both nodes, one at a time. Be assured that we'll test it as it makes part of normal operation. Deployments will usually point to the very same node, but may be directed to the second one in case the other one is down or not reachable. Before testing this I'd like to figure out whether the current design foresees this. To me it seems that ApiListener::SyncZoneDirs is fine, but ApiListener::ConfigUpdateHandler looks suspicious. However, I'm neither a C** nor an Icinga2 core developer, so I'm just guessing - and that's why I'm asking for feedback. Please understand that to me this is an essential part of the whole replication system. Imagine a scenarios where node A got the latest deployment, cluster is in sync. For whatever reason A is not reachable by the config tool, so the next deployment goes to node B. B reloads, A reconnects and... what happens next? I would expect A to receive the newer zone from B. But is this really what has been implemented? A short yes (or no) regarding this question would be great and allow me to test whether everything works as designed.
That's where my questions start, as the expected behaviour isn't documented at all and didn't work until now I guess. But I mentioned that before. The only additional question that arises is how Icinga2 figures out which node owns a "newer" zone in case it consists of data from multiple sources, packages whatever. That's not so important for me, but it would help to better understand what's going on in case I encounter farther problems in my tests. And what means "holding" a zone configuration? Does this only refer to etc-zones.d (fine) or also api-zones.d as your comment suggests (error-prone, IMO).
If "that scenario" refers to having a local version of a zone in api-zones.d, the fact that ways too many people learned how to clean up that directory tells me that such scenarios haven't been as rare as you might think ;) Please forgive me if I set this issue to "feedback" once again. A (very short) explanation of the expected behaviour in addition to that single-line fix should IMO make part of fixing this issue. In my believes this component should be well understood, as it broke our legs too many times in the past. I'm not asking for extensive details. The most essential question I'd like to see answered is "Should a real-world work-flow as mentioned above work as expected?". If the answer is YES, I have no farther questions and will happily test your patch. Otherwise (or in any case, just to satisfy my curiosity) a few more details about where I should be prepared to expect eventual issues would be very helpful. Thanks a lot for your time, |
Updated by tgelf on 2015-12-10 22:25:08 +00:00
|
Updated by mfriedrich on 2015-12-11 08:59:35 +00:00
Please join us on Monday and we'll talk about that in real life. Thank you. |
Updated by tgelf on 2015-12-11 09:14:27 +00:00
dnsmichi wrote:
What's so hard about writing down just a very few words explaining how this feature was designed to work? I have to reopen this completely, not just for feedback. We tried it out, it doesn't work. Nothing happens at all if a zone is deployed through the API: [2015-12-11 09:38:53 +0100] notice/ApiListener: Ignoring sync for 'master'. Zone directory '/var/lib/icinga2/api/zones/master' does not exist. If we manually create /etc/icinga2/zones.d/master the first part of the game works. But that makes this feature pretty useless, so I consider this issue not fixed. In the meantime we'll use this ugly workaround and try to figure out more details related the scenario mentioned before. I have still no clue why you refuse to explain what the designed/expected behaviour should look like? IMO there is no need for a personal meeting for such a simple question. Regards, |
Updated by mfrosch on 2015-12-11 09:27:25 +00:00 Usually config packages should only be deployed to one master, so that the internal replication can take care of it. It might cause weird issues when you deploy to both. Is the sync working when you deploy a config package with "zones.d/master" to the first master? Maybe we should validate and talk about cleaning up stuff in another issue, currently /var/lib/icinga2/api/zones is mainly used for stuff from /etc/icinga2/zones.d - and syncs from this. |
Updated by mfrosch on 2015-12-11 09:41:06 +00:00 @tgelf: Did you tried the latest snapshot? To correct my last comment: /var/lib/icinga2/api/zones is also used by packages, so that zone related stuff in packages is synced there. |
Updated by mfriedrich on 2015-12-11 10:02:11 +00:00
Please post the output of
|
Updated by tgelf on 2015-12-11 10:03:11 +00:00 lazyfrosch wrote:
Sure, wouldn't have made much sense otherwise. 2.4.1 - 1.snapshot201512102013.el7.centos
Thank you, we know. The current snapshot fills /var/lib/icinga2/api/zones, but as mentioned before only if the corresponding zones exist in /etc/icinga2/zones.d. Zones deployed to packages but not locally configured to the master are ignored. Icinga2 forces me to create 1000 endpoint zones for 1000 agents, so manually creating their directories makes this... well, a little bit "less exciting". |
Updated by tgelf on 2015-12-11 10:03:46 +00:00 dnsmichi wrote:
v2.4.1-63-g15ca998 |
Updated by mfriedrich on 2015-12-11 10:09:41 +00:00 tgelf wrote:
That's the version without the proposed fix.
The fixed version looks like this:
|
Updated by tgelf on 2015-12-11 10:14:03 +00:00 dnsmichi wrote:
Installed RPM:
Latest package on packages.icinga.org:
I'll happily test a newer one. |
Updated by tgelf on 2015-12-11 10:24:12 +00:00 lazyfrosch wrote:
That's what we do, we always deploy to the very same master. But of course an outage of that master must not stop me from continuing my deployments to the next master.
That's what I expected and that's why I ask for clarification. It's hard to accept that it seems to be impossible to get an explanation of how such a vital component was designed to behave.
Only under special preconditions. I collected all variants with matching log lines, but will wait for the next snapshot build right now.
That's not what I've been told yesterday, but honestly - as a user I should not care. If someone manages it to come up with an explanation of how it was designed to work we could figure out which parts work, which don't and where tweaking the architecture might make sense. |
Updated by mfriedrich on 2015-12-11 10:46:29 +00:00 tgelf wrote:
There's an issue on the build server with cmake & ccache (similar to #10823). I've triggered a manual snapshot build which is available at http://packages.icinga.org/epel/7/snapshot/x86\_64/ |
Updated by mfriedrich on 2015-12-11 11:04:05 +00:00
|
Updated by tgelf on 2015-12-11 11:17:04 +00:00 dnsmichi wrote:
Thank you, I'm running v2.4.1-64-gc5b13ff right now. Master 1 log after config dump related reload:
Same for the second master (log order seems a little bit strange, so try to not confuse agent/master information):
/var/lib/icinga2/api/packages/director/master1-1449831415-0/zones.d/ on master1:
Please do not try to understand what I have in agent1/endpoints.conf - this is one of the many attempts we made to get this working. Prio 1 for me right now is getting files shipped in a reasonable way. /var/lib/icinga2/zones on master1:
/var/lib/icinga2/zones on master2:
Both masters are member of the master zone, they didn't know anything about director-global (a global zone) and agent1 (a non-global zone). They work on master1, but are not synchronized. As logs show they are sent by master1 but not accepted on master2 - exactly as expected yesterday evening when trying to understand ApiListener::ConfigUpdateHandler. I guess it fails here:
Cheers, |
Updated by mfriedrich on 2015-12-11 11:36:45 +00:00 There's only one configuration master allowed inside the zone. Your log indicates that the second master has its own local configuration for the "master" zone:
It ignores the update sent from master1. Remove the zone configuration from inside zones.d (be it etc or api config packages) from this node and it will start to work. Having multiple configuration masters inside a zone is not supported. |
Updated by tgelf on 2015-12-11 12:01:33 +00:00 dnsmichi wrote:
Well, it has it's own Zone-Definition itself, sure - otherwise it would hardly work. But none of the nodes has /etc/icinga2/zones.d/master.
My major problem is that it completely ignores all the zones it DOESN'T know about. The "master" zone seems to have worked at least once, but you're right, logs are telling something different. In my believes also that part (deploying elements to master zone) somehow MUST work in such a scenario. How should one otherwise deploy hosts that the master zone should check? Cheers, |
Updated by mfriedrich on 2015-12-11 12:22:15 +00:00 tgelf wrote:
Both have /var/lib/icinga2/api/packages/****/zones.d/master somewhere. Otherwise the config update wouldn't be ignored.
The Endpoint and Zone object configuration must be in place before pushing configuration packages for zones. That's a matter of trust relationships and knowing which zone is capable of the configuration sync and which is not allowed (e.g. "master" is configured inside zones.conf, but "foo" is not - deploy a config package for the zone "foo", it does not get synced for obvious reasons). Though I think this problem is far away from the original issue which obviously has been fixed. Please go for a new issue, or join a real life conversation about the problems you'll see with the current cluster trust model vs api packages. Kind regards, |
Updated by tgelf on 2015-12-11 13:04:51 +00:00 dnsmichi wrote:
No, sorry. Master2 didn't have and still hasn't. Any other ideas, something else I could check?
It's vital for the whole construct that the API allows me to deploy additional zones and endpoints. However, thank you for your comment - it leads me to one more possible way of how to solve this part of the issue. I'll give it a try and report.
Not really, please see above. Thanks, |
Updated by tgelf on 2015-12-11 13:58:32 +00:00 Ok, I guess I reproducible figured out one part of the problem. Starting with an empty master2:
Some related logs:
Now it ignores the director-global zone:
Unfortunately we do not see that it accepted the master zone - but we see that it updated some files:
It ignores the agent1 zone and reloads:
Right now it accepted the master zone, but refused the other zones defined in master/director/zones.conf. After the restart, it goes on:
This is my major problem, it now believes to be authoritative for master and will refuse all future updates:
However, it now accepted our agent1 zone:
And it happily connects to agent1 and ships it all the global zones:
Things go worse after that - but it doesn't look that bad. What we need to fix first of all is the fact that it accepts only the very first sync for a specific zone. Cheers, |
Updated by tgelf on 2015-12-11 14:04:22 +00:00 One more hint, please have another look at how /var/lib/icinga2/api/zones looked on both nodes in comment #24 - is it correct that master2 places the files it got via config sync from master1 in _etc/director? It also created the .authoritative file - it didn't do so for the other zones. They are written to api/zones/director and not to api/zones/_etc/director. |
Updated by mfriedrich on 2015-12-11 14:17:47 +00:00 Please show the tree on both master1 and master2:
|
Updated by tgelf on 2015-12-11 14:34:56 +00:00 I did already mention that it was emtpy (and never existed during this tests) on master2:
And it is filled on master1:
|
Updated by tgelf on 2015-12-11 14:37:36 +00:00 Btw, that was one of the inactive stages of course, we are testing various config variants. The active one looks slightly different:
But I guess your question targeted only master2, you didn't believe it was empty ;) |
Updated by mfriedrich on 2015-12-11 16:01:05 +00:00 I've tested this for a while now in my 2 instance cluster, and it works like expected. TestsFresh start, clearing all data:
Start node1, icinga2a Start node2, icinga2b Create a package and push config
icinga2a
icinga2b
NotesThat's debug output I've added for better testing. I'll change that into debug messages later.
Possible ProblemsEither you'll have zone directories (they can be empty too) in
If one of these matches on the second node, it will disallow any updates sent by the first master. Furthermore cluster config updates won't happen if the config files did not change on disk. So it might be the case that you don't see any config sync updates on restart of the nodes, just because there is nothing to sync. Better clear the api/zones directory on both nodes beforehand. |
Updated by tgelf on 2015-12-11 17:22:40 +00:00 dnsmichi wrote:
Glad to hear that.
Does not look like a default config, but still looks like a reasonable test setup to me. Here what I did for cleanup:
Startet both nodes after cleaning up both of them.
Messages differ in current GIT and snapshots (using v2.4.1-68-gd781c39), here you go: Master 1:
Master 2:
The 'master' zone got replicated from master1 (who got it through the API) to master2, master2 therefore reloaded between 17:14:27 and 17:47:29. And while the initial replication was fine, it started to refuse that config afterwards:
As you can see, Master 2 feels not responsible for 'catalog' and 'master', but two seconds later it is. This perfectly reflects what I continued to explain. More details:
Not, never had.
Always on master1 after config dump, never on master2.
Not the case here. While IMO Icinga2 MUST be able to cope with an outdated zone config in a stage, this was never the issue here. I mentioned that we can exclude those errors a couple of times in this ticket.
As all my logs showed, the problem is that master1 HAS something to sync, but master2 doesn't accept it as it feels authoritative for the zone. We are talking about the initial config dump here, so there definitively ARE changes. Subsequent changes are always accepted on master1 but no longer on master2. |
Updated by mfriedrich on 2015-12-11 18:18:33 +00:00
That's impossible unless you are using "include_zones" manually by yourself inside your configuration (and that's considered dangerous as it will result in duplicated includes). Please show the recursive grep output from inside your configuration.
|
Updated by mfriedrich on 2015-12-11 18:25:37 +00:00 Ok, discard my last comment, I think I found something. |
Updated by tgelf on 2015-12-11 21:26:34 +00:00 For the sake of completeness, the only occurrence of include_zones in this setup is the autogenerated
in /var/lib/icinga2/api/packages/director/master1-1449852446-0/include.conf on master1. It does not exist on master2. Grepped through /etc/icinga2 and /var/lib/icinga2 on both of them. |
Updated by mfriedrich on 2015-12-12 11:10:45 +00:00
Noted the idea, slept well and then wrote it down to finally implement and test a fix (cannot get this out of my head anyways). Changing the config authority checks unveiled a different, hidden problem caused by a change we did implement the API config packages in early September. I will not open additional issues as this will harden the backport process of issues and git commits but mark these bugs/problems below. During startup, additional includes are collected next to the user-defined ones:
Unfortunately /var/lib/icinga2/api/zones which are included inside IncludeNonLocalZone() call IncludeZoneDirRecursive which registers a zone tag for "_etc". This is bug 1 in addition to this issue. It will make Icinga 2 always copy /var/lib... to /var/lib... but normally you won't recognize that during the initial "copy-all-zones-config to /var/lib" process in ApiListener::SyncZoneDirs() on startup. Fixing this bug by removing the RegisterZoneDir() call would obviously fix the issue you've found. Note: This issue does not happen on the first config sync, it will happen on restart(s) when /var/lib/icinga2/api/zones is already populated and SyncZoneDir() will think through IsConfigMaster() that this is an authoritative copy (which it is not). Second Problem: There are 2 config authoritative checks in 2 location. One lives in aplistener-filesync.cpp and ensures that
Both checks use the same functionality, previously it was just a check for "/etc/icinga2/zones.d", now it is a check for all registered local zone config authorities (that's already fixed). The second config authoritative check lives inside daemonutility.cpp in IncludeNonLocalZone(). It ensures that
Third problem: The check in IncludeNonLocalZone() for authoritative copies does not take packages with zones.d into account. So even if the problems above are fixed, it will certainly crash on duplicate config include on the master (probably if you accidentally remove .authoritative inside the directory, which is a safety hook, but who knows what users will do). I did not test that in detail, but I rest assured that changing the authoritative check into etc and api packages will certainly help unveil user-defined problems. So the third problem can be fixed by introducing a general function called HasZoneConfigAuthority() previously called IsConfigMaster(). In order to use them in libcli and libremote, we'll add them to libconfig, where the registered zone dir functionality lives already. Fourth problem: The order of including etc, packages and cluster synced config was wrong. etc and api packages will call RegisterZoneDir() which gets evaluated inside HasZoneConfigAuthority(). Previously it was: etc, cluster synced config, api packages. This would leave api packages with zones.d not being taken into account for config inclusion and probably also break the initial config compilation. I did not test such failure scenario in detail, just fixing the code how it's supposed to be. Fifth problem: SyncZoneDir() doesn't need an additional authoritative config check, as it already calls ConfigCompiler::GetZoneDirs() and only syncs registered zone directories. Further is needs to check wether there is new config available, since it now always logs that it will sync something even if there is nothing to sync from within newConfig. I've pushed a fix to git master, and will wait for your feedback from tests. In order to properly test it with daemons running in foreground in my not-so-regular-but-still-working-for-years cluster test setup, killing icinga2a, editing the api package config so it is modified, and starting icinga2a again will make icinga2b accept a new configuration, even if it has loaded one already from inside /var/lib/icinga2/api/zones. Off-topic: Obviously no-one including myself ever tested the api packages with zones.d configuration. That's one for QA & feature requestors preventing such problems in the first place. |
Updated by tgelf on 2015-12-12 13:58:20 +00:00
I deployed it to the very same environment, works as it should right now. Thanks a lot for your effort, Michael! I'll leave this issue closed right now ;) Some related topics we should discuss on Monday:
Thanks again, enjoy your weekend! |
Updated by mfriedrich on 2015-12-12 14:52:45 +00:00
Thanks for testing this that quick :) |
Updated by gbeutner on 2016-02-23 09:58:24 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/10819
Created by tgelf on 2015-12-10 15:16:27 +00:00
Assignee: mfriedrich
Status: Resolved (closed on 2015-12-12 13:58:20 +00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:58:24 +00:00 (in Redmine)
I'm using the API to deploy config packages. They ship a few files in conf.d and also a few zone directories in zones.d: a global one, the master zone and a few endpoints. When deploying the configuration to a single master my config replication seems to completely ignore those zones.
Might be a config error, but I doubt so: we tried this in multiple environments, v2.4.0 and v2.4.1. dnsmichi was also able to reproduces this problem. Would be great if this could be fixed as it makes the whole feature hard to use for clustered environments.
Thanks,
Thomas
Attachments
Changesets
2015-12-10 21:21:29 +00:00 by mfriedrich c5b13ff
2015-12-11 16:03:07 +00:00 by mfriedrich d781c39
2015-12-12 11:16:00 +00:00 by mfriedrich 8055f05
2015-12-18 10:53:56 +00:00 by mfriedrich 79899d7
2016-02-23 08:09:55 +00:00 by mfriedrich 1f5a216
2016-02-23 08:09:55 +00:00 by mfriedrich f0a1872
2016-02-23 08:09:55 +00:00 by mfriedrich 48fe703
2016-02-23 08:09:55 +00:00 by mfriedrich 2f8d416
Relations:
The text was updated successfully, but these errors were encountered: