Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11684] Cluster resync problem with API created objects #4169

Closed
icinga-migration opened this issue Apr 26, 2016 · 17 comments
Closed
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11684

Created by geds on 2016-04-26 12:43:08 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-11-14 13:56:43 +00:00)
Target Version: 2.6.0
Last Update: 2016-11-24 15:32:02 +00:00 (in Redmine)

Icinga Version: 2.4.7
Backport?: Not yet backported
Include in Changelog: 1

Hello,

there is a problem with Icinga2 in cluster mode not resyncing API created objects.

To reproduce this issue:

  1. Spin up vagrant with icinga2x-cluster
  1. Stop Icinga2 instances on both cluster servers

  2. Edit /etc/icinga2/zones.conf on both 'icinga2a' and 'icinga2b' to enable config replication

object Endpoint "icinga2a" {
host = "192.168.33.10"
}

object Endpoint "icinga2b" {
host = "192.168.33.20"
}

object Zone "master" {
endpoints = [ "icinga2a", "icinga2b" ]
}

object Zone "checker" {
}

object Zone "global-templates" {
global = true
}

  1. Start Icinga2 on 'icinga2a'

  2. Create a host/service on 'icinga2a'

curl -k -s -u "aws:$PASSWORD" -H 'Accept: application/json' -X PUT "https://192.168.33.10:5665/v1/objects/hosts/host1" -d "{ \templates\ [ \"generic-host\" ], \attrs\ { \address\ \"127.0.0.1\", \check_command\ \"hostalive\" } }"
curl -k -s -u "aws:$PASSWORD" -H 'Accept: application/json' -X PUT "https://192.168.33.10:5665/v1/objects/services/host1!service1" -d "{ \attrs\ { \check_command\ \"passive\", \enable_active_checks\ \"0\" } }"

  1. Start Icinga2 on 'icinga2b' and make sure the API objects were synced

ls -1 /var/lib/icinga2/api/packages/_api/icinga2b*/conf.d/hosts
host1.conf

ls -1 /var/lib/icinga2/api/packages/_api/icinga2b*/conf.d/services
host4!service1.conf

  1. Stop Icinga2 on 'icinga2b' and delete API objects:

rm -f /var/lib/icinga2/api/packages/_api/icinga2b*/conf.d/*/*

Repeat step 6. However this time configs are not there.

Changesets

2016-10-11 08:55:13 +00:00 by gbeutner 0145a32

Fix object resync issues

refs #11684

2016-10-14 13:54:34 +00:00 by gbeutner 759aba8

Fix object resync issues

refs #11684

2016-10-24 06:40:12 +00:00 by gbeutner d70d779

Add missing call for the base class' Stop() method

refs #11684

2016-11-10 16:15:06 +00:00 by mfriedrich 5dd4898

Ensure that UpdateConfigObject sets the target zone

refs #11684

2016-11-10 16:16:08 +00:00 by mfriedrich 72bf538

API: Set zone attribute for local zone if not specified

This allows to sync the object to other nodes in the same
zone on reconnect.

refs #11684

2016-11-10 16:44:05 +00:00 by mfriedrich 2e2de7c

Enhance log messages for cluster config sync

refs #11684

2016-11-11 15:29:37 +00:00 by mfriedrich 4b86f69

Ensure that runtime created objects are synced on (re)connect

refs #11684

2016-11-17 12:51:04 +00:00 by mfriedrich e5a6bdc

Ensure that UpdateConfigObject sets the target zone

refs #11684

2016-11-17 12:51:04 +00:00 by mfriedrich 099fc76

API: Set zone attribute for local zone if not specified

This allows to sync the object to other nodes in the same
zone on reconnect.

refs #11684

2016-11-17 12:51:04 +00:00 by mfriedrich bef53ac

Enhance log messages for cluster config sync

refs #11684

2016-11-17 12:51:04 +00:00 by mfriedrich 46d7145

Ensure that runtime created objects are synced on (re)connect

refs #11684

2016-12-05 15:37:31 +00:00 by mfriedrich 338f5c0

Fix crash in CreateObjectHandler::HandleRequest()

fixes #13409
refs #11684

Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-05-09 15:53:47 +00:00

  • Parent Id set to 11415

Workaround discussed somewhere else: Set the "zone" attribute explicitly for PUT requests.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-11 13:43:02 +00:00

  • Status changed from New to Rejected

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-11 13:50:25 +00:00

  • Status changed from Rejected to New

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-05-11 13:55:38 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-05-21 13:52:25 +00:00

Unfortunately setting the zone attribute from GetLocalZone() won't fix the issue itself. The files generated with CreateObject() are still located in "conf.d" which does not take them into account for runtime-syncs.

Testing a separate patch which takes the zoneName into account and puts those files underneath zones.d///.conf.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-05-21 14:23:54 +00:00

  • Relates set to 11541

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-06-23 13:38:41 +00:00

  • Relates deleted 11541

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-09-28 13:41:44 +00:00

  • Target Version set to 2.6.0
  • Parent Id deleted 11415

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-10 16:14:32 +00:00

The fix works for the scenario where master A starts and sends UpdateObject messages to master B (which involves setting the target_zone for RelayMessage).

The other way around, master B being shutdown and then reconnecting, the config sync does not work. This is probably due to the missing implicit zone attribute inside the same zone.

I'll investigate further.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-10 16:43:30 +00:00

Forget the statement above, re-syncing objects created at runtime through the API doesn't work, neither if explicitly specified via zone attribute nor when empty.

One observation - if I specify a child zone ("satellite") instead of the current "master" (and also have code which explicitly sets that if omitted), the configuration gets synced.

mbmif /usr/local/tests/icinga2/master-slave (master) # cat icinga2a/lib/icinga2/api/packages/_api/mbmif.int.netways.de-1442332428-0/conf.d/hosts/google.com11.conf
object Host "google.com11" {
    address = "8.8.8.8"
    check_command = "hostalive"
    version = 1443624650.253672
    zone = "satellite"
}

mbmif /usr/local/tests/icinga2/master-slave (master) # cat icinga2b/lib/icinga2/api/packages/_api/mbmif.local-1463837930-0/conf.d/hosts/google.com11.conf
object Host "google.com11" {
    address = "8.8.8.8"
    check_command = "hostalive"
    version = 1443624650.253672
    zone = "satellite"
}

The other way around, it does not work with the same zone.

mbmif /usr/local/tests/icinga2/master-slave (master) # cat icinga2a/lib/icinga2/api/packages/_api/mbmif.int.netways.de-1442332428-0/conf.d/hosts/google.com6.conf
object Host "google.com6" {
    address = "8.8.8.8"
    check_command = "hostalive"
    zone = "master"
}

mbmif /usr/local/tests/icinga2/master-slave (master) # cat icinga2b/lib/icinga2/api/packages/_api/mbmif.local-1463837930-0/conf.d/hosts/google.com6.conf
cat: icinga2b/lib/icinga2/api/packages/_api/mbmif.local-1463837930-0/conf.d/hosts/google.com6.conf: No such file or directory

I'm assuming that something within CanAccessObject() is somehow broken here.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-10 18:27:16 +00:00

Funny. It is the check between the object version and the endpoint log position. Which means that a synced endpoint (no more replay logs) is not able to fetch objects at any time. This would probably also explain why Comments/Downtimes are not synced.

Debug patched output:

[2016-11-10 18:01:41 +0100] critical/ApiListener: Object 'google1' version '1.44492e+09' is older than endpoint log position '1.4788e+09'
[2016-11-10 18:01:41 +0100] critical/ApiListener: Object 'sync1' version '1.47879e+09' is older than endpoint log position '1.4788e+09'

I'm not entirely sure why we added that check in the past, most obviously it should've prevented unwanted object syncs. Although there is no direct relation between the replay log, and the runtime object configs. I've played around with an offset, but there is no exact value for that. Removed it

Different test with downtimes.

michi@mbmif ~ $ curl -k -s -u root:icinga -H 'Accept: application/json' -X POST https://localhost:7000/v1/actions/schedule-downtime -d '{ "filter": true, "type": "Service", "author": "michi", "comnt": "sync test", "start_time": 1478798977, "end_time": 1478799077, "fixed": true }'

mbmif /usr/local/tests/icinga2/master-slave (master) # ls -la icinga2a/lib/icinga2/api/packages/_api/mbmif.int.netways.de-1442332428-0/conf.d/downtimes/ | wc -l
    1022

mbmif /usr/local/tests/icinga2/master-slave (master) # ls -la icinga2b/lib/icinga2/api/packages/_api/mbmif.local-1463837930-0/conf.d/downtimes/ | wc -l
    1022

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-10 19:05:19 +00:00

  • Relates set to 11541

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-10 19:14:05 +00:00

  • Status changed from Assigned to Feedback

Pushed a fix to git master, please test the snapshot packages.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-14 13:56:43 +00:00

  • Status changed from Feedback to Resolved
  • Done % changed from 0 to 100

@icinga-migration
Copy link
Author

Updated by geds on 2016-11-24 15:11:10 +00:00

I have tested this fix with version: v2.5.4-206-gb028ff2 (icinga2.x86_64 2.5.4-1.snapshot201611221649.el7.centos @ICINGA-snapshot). The cluster is resyncing API objects now, but there is still one problem - cluster resyncs everything even if some objects were deleted on one node while other was down.

How to reproduce:
Follow the steps 1 to 7 in the description and make sure everything is synced and you have same API created objects on both servers.

  1. Stop Icinga2 on 'icinga2b'
  2. Delete an object from 'icinga2a' using API:

curl -k -s -u "aws:$PASSWORD" -H 'Accept: application/json' -X DELETE "https://192.168.33.10:5665/v1/objects/services/host1!service1"

  1. Start Icinga2 on 'icinga2b'

This time 'service1' gets synced to Icinga2a from Icinga2b because it was still present there while it was down.

Not sure if i should create a separate ticket for this.

BTW thanks a lot for your hard work!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-24 15:23:13 +00:00

Hm, imho that's a problem which wasn't introduced by this fix but existed already. Can you please open a new issue?

@icinga-migration
Copy link
Author

Updated by geds on 2016-11-24 15:32:02 +00:00

Will do. Thanks for your prompt answer.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.6.0 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant