Multisite vCloud Director With Global Load Balancing

Using Global Load Balancing to access Multisite vCloud Director (vCD) has been possible since vCD 9.0 but only worked if the tenants using it had services in each site. If a Provider had, say five sites, but a tenant only had a presence in two of them, connecting to a site in which the tenant did not have service would result in a failed login. All this has changed in vCloud Director 10.0.

A couple of years ago, in the Architecting Multisite vCloud Director white paper I wrote about the way we allowed vCloud Director sites to be federated (we call it “associated”) with each other. Back in vCD 9.0 this didn’t do much, but we had big plans for it. How things have moved on across two years and four major releases.

The back-story

TL;DR - Just take me to the interesting new stuff!

In vCloud Director 9.0 we introduced the idea of “Multisite”, but its apparent usefulness was limited. At least on the surface, but more on that in a moment. It allowed Cloud Providers to associate their vCD instances together. This allowed a level of secured communications between different sites. With this in place, the Provider or Tenant could then associate their Organizations (Orgs) across those sites. Once this was complete, magic happened! Well, some magic. Behind the scenes lots of cool stuff was in motion, but at the GUI level, the users who logged into an associated Org (we call it a “member of the association”) would see a menu item with an icon of some “Datacenters” which allowed them to easily switch their view between each of their member sites.

Here’s what it looked like in my two-site lab. We’re currently connected to my test Org in Site-A, so the only option is to “switch to” my TestOrg in Site-B.

The hero of the piece

The clever part which enabled this was, and still is, hidden within those association steps. Associating sites allows vCD instances to securely communicate with each other at the System level. Associating Organizations allows vCD instances to securely communicate with each other at the Org / User level. We’ve used this over the versions since it was introduced to bring more and more functionality. Here’s an example Tenant Admin view from vCD9.7.

Here you can see (click the image for the larger version if this is too blurry), that the Tenant UI creates a consolidated view of the resources from a number of sites (although, just one in this lab), showing both Org VDCs (T1-OVDC and T1-K8S) and dedicated vCenters through a feature called “Central Point of Management”, or CPoM for short.

The image above was grabbed from the brilliant "vCloud Director for Service Providers" Hands on Lab. If you would like to explore the features of vCD or any of the other VMware Cloud Provider Platform software features, check out HOL-2083-01-HBD and the other VCPP labs, for free, over at labs.hol.vmware.com

In the early Multisite releases, other than pre-authenticating between sites, neither vCD or the client GUI needed to do much heavy lifting so we didn’t anticipate a limit to the number of vCD instances which could be associated. As the GUI has grown to provide much richer multisite information, this is no longer the case. When the GUI is launched in a user’s browser, there’s a flurry of traffic across the associated sites to collect this information for display. The more sites the user has services in, the longer this takes. Unfortunately, this has led to a limit of ten associated Sites / Orgs.

If you are a Partner looking to design a much larger number of vCD instances, or a customer looking to associate a larger number of Org VDC locations, please get in touch and we'll help you out. See the Contact page above for details, or, speak to your local VMware representatives.

One of the ways in which the Multisite capability was added to vCloud Director without breaking lots of existing stuff was in the area of user authentication. Clearly this is an important part of any web service, so the inter-site authentication relied upon the existing user authentication mechanisms already present. When your connection arrived on a vCD instance, other that an accept header which included version=29.0;multisite=global everything else was pretty much the same, which was good. v29.0 was the new multisite capable API version when we released vCD 9.0. It’s just been deprecated in vCD 10.0 which introduced API v33.0 and you should definitely check the vCloud API Programming Guide for Service Providers for the currently supported version(s) after reading this.

Now a user can authenticate at any site they connect to, and (originally switch to, but now), view content from all their locations, a Provider can offer a more resilient service. Imagine a Provider with three locations and vCD in each. Using a simple Round Robin, or more complex monitor/healthcheck based intelligent DNS service, the Provider can give their customers a single URL to connect to, rather than three separate ones. Here’s the diagram from the Multisite white paper to illustrate.

The stages from the diagram which the connection process goes through are.

Even though a user connects to https://portal.cloud.example.com/tenant/<org-name> the DNS lookup only cares about the Fully Qualified Domain Name (FQDN) part ‘portal.cloud.example.com’.
The intelligent DNS system, often called a Global Load Balancer (GLB) or Global Traffic Manager (GTM), looks up the requested FQDN in its database, checks which answers (DNS results) are valid for the request, and then uses the load balancing algorithm, specified for that FQDN, to choose the best answer.
This is returned to the client in the form the site specific FQDN (or actual IP address) of one of the (in this case) three possible sites.

You can read more about this in the white paper here

The villain of the story

So, where’s the catch? Well, as long as the Customer had an Organization in each of the sites, everything worked well. There were/are a few caveats around the type of account (see the white paper here for more details) but basically a user needed the same account type (local, LDAP or External IDP) in all sites, with the same username and password. The key there was the “all sites” part. If the site selection mechanism deposited a user at a site where they did not have a presence, the login would simply fail. That’s a less than stellar user experience I’m sure you’ll agree.

Could we work around the limitation? Well, yes, we could. It wasn’t simple, but there’s a generalized version of the solution described in the white paper here. You can see how it would work in the diagram (again, stolen from the WP) below.

This is similar to the sequence of events we followed in the example above. This time however, we have to get a bit more clever. The DNS system doesn’t care about the full URI path, just the FQDN, but in this case, we do. We need to terminate the user’s connection request in such a way that we can check the Org name at the end of the URI ("/tenant-1" in the diagram ). Here’s the sequence of events in this connection process.

When a user connects to https://portal.cloud.example.com/tenant/<org-name> the connection is passed to a ‘service’ which can examine the full path to the tenant Org.
The service uses the Org name to query a data source for valid vCD site locations where this tenant has service.
The initial connection is then returned with an HTTP 3xx status code identifying the vCD location the user should be redirected to.
The user’s browser makes a connection to the vCD site specified in the earlier redirect response and, as we know the Org is present in this location, the login process should complete successfully.

While this works (although this example would need to be coded into a real-world application), even at a generic level the design is, what an old colleague of mine would describe as, “sub-optimal”. For one thing, you have to have hardware in some (ideally, multiple) location to run the service. Again, ideally, you really need a Global Load Balancer of some sort to direct traffic, intelligently to one/other of those locations. You then need to bounce the connection back to DNS again to finally resolve the site connection with, ideally, some monitor or health-check to make sure the site selected is operational.

Oh, and we’d better not forget the operational overhead involved in keeping the site-selection database up to date in all the locations where the service runs!

If, as a Provider, you only offer this single URL, Load-Balanced login across a small group of vCD sites, let’s call that a region, and your customers always get service delivered with an Organization / OrgVDC in every site in that region, this works fine. Clearly, there would be a different URL for each region, but within each of those, this model could simply be repeated. Unfortunately, if a customer wants geographic separation for resilience, or more often as a BCDR requirement, this approach forces them to log into each region separately. Again, a sub-optimal user experience.

We’re fixing a problem that, in an ideal world, we wouldn’t actually have. If the vCloud Director development team cared about us, we’d just be able to throw the connection at any vCD site and, if the tenant didn’t exist there, vCD would magically sort things out! Oh, if only there was a new version of vCD with this magic built it…

What’s new in vCD 10.0

My friend Daniel Paluszek wrote a series of great posts here covering many of the new features in the 10.0 release of vCloud Director, and there were a lot of them! However, one small addition made it into the new release without any fanfare either in Daniel’s posts or the product release notes.

Over the last few releases, as the multisite functionality has grown, the Engineering team have developed a library of tools to find, get and set, information across vCD instances. Here’s the really cool part… As of version 10.0, using these, vCD will now step in to help if a user tries to connect to a vCD instance in which they do not have a Organization. Instead of a failed login, vCD will use its list of other association member sites to find one where the Org name requested by the user, does exist. As long as vCD can find a site where the requested Org exists, the connection will be redirected to that site. As if by magic!

Behind the scenes, if a connection arrives for an Organization <org-name> which isn’t present in that site, vCD will, in parallel, make a call to each of the other associated sites using the vCD /query API with a parameter of type=organization&name=<org-name> to see if that Org exists on another site. If it gets a positive response back, vCD will first check that the URL it’s going to redirect to is valid, before sending the user over to that site. We’re fans of Darwin around here, so whichever site responds first gets the prize! If we get other, slower responses, too bad, you snooze, you lose. Although, to be fair, the fastest response is probably the most able site to handle the request at that point in time.

Here’s what happens when a connection lands on a vCloud Director 10.0 site which is associated with other members.

	State	Result
1.	Org is present on this vCD instance	Connection accepted and GUI returned
2.	Org is not present on this vCD, but is on another associated site	Connection is redirected to the first site to respond
3.	Org is not present on this or any associated sites	Connection is not redirected
4.	Org is only present on a remote vCD instance but redirect URL is invalid	Connection is not redirected

As you can see, we have most bases covered, but will only forward the connection to a valid alternate site.

There is a configuration element which can be used to control this behavior using the manage-config sub-command of the vCD cell-management-tool (CMT). The element is com.vmware.multisite.ui.redirect which is a boolean field and defaults to true. Setting it to false disables this behavior and may be useful if a Provider has implemented their own alternative mechanism, or simply wants to stick to direct, per-site, URLs.

So what does this mean?

What seems an eternity ago, (see Global Access High Level Overview and The villain of the story if you missed it earlier) we looked at a simple Global Load Balancing model which took a single URI and let users connect through that to any one of the associated vCD sites beneath it. It looked ideal. We had DNS based load balancing, it’s well understood, resilient and doesn’t need any special client-side support. Except, then we saw how it had a flaw. If the user attempted to log in to a vCD site where they didn’t have an Organization, the login would fail, even if they had a presence at another associated site. In earlier versions of vCD, this led to all sorts of complex malarkey to protect the user experience.

Now, of course, with vCD 10.0 that’s no longer the case. Thanks to the caring folks in the vCD development team, Providers can use the simple Global Load Balancing model, knowing that if a user is “balanced” to a site which doesn’t host one of their Orgs, vCD will spin around in a nearby phone-box and step out to save the day. If you like graphical explanations, here’s a picture to illustrate the heroism at work.

You probably figured all this out by now, but for completeness, here’s the sequence of events depicted above.

The user tries to connect to their Org using the Provider’s “single” portal URI.
Their DNS client strips the /tenant/<tenant-1> part off and requests a lookup for the FQDN only.
The Provider’s GLB solution returns one of the three sites using it’s load balancing algorithm.
The user is directed to Site-B which happens to be a site which doesn’t host either of the user’s Organizations / OrgVDCs
Site-B queries sites [A] and [C] to see if either hosts <tenant-1>.
Site-A is quickest to respond with a positive answer.
Site-B redirects the user to Site-A where their Org does exist.

Updated: Nov 5th 2020

Important - If you want to use the hierarchical DNS model outlined above, there’s one key piece that I should have mentioned in the white paper, and here, that I didn’t (mia culpa - again). VCD uses a number of techniques to make itself more robust for life on the Internet and one of those gets in the way of my nice neat picture. The VCD cells will only accept requests which are made “to” a list of origins they know about.

In the SSL/TLS Certificate, we have to add the various FQDNs which the client might request as Subject Alternate Names so that their browser knows it’s in the right place whichever URI was requested. In reverse, we have to tell the VCD cells which URIs they should respond to requests “as”. On install this list is the IP address of the cell, and you may have seen the Failed Start: An error occurred during initialization. message when you try to login to a new VCD install using the FQDN but before you’ve updated the Public URL field. Our hierarchical DNS model is the same issue but on steroids…

Fortunately, this part of the problem is not too difficult to fix. You can read about it in KB 75305. As we update the Web Portal Base URLs or add more cells to the site, there is a database table which collects these allowed origins so that the various sources of requests will work properly. Fortunately, as the KB explains, we can update the contents of that table without resorting to open-~~heart~~DB surgery. By connecting to one of the cells and running a couple of Cell-Management-Tool commands, we can ensure all our FQDNs are approved. To list the currently permitted sources you use this;

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n webapp.allowed.origins -l

From which you’ll get back something like, from the KB;

Property "webapp.allowed.origins" has value "https://example.com,10.160.2.27,http://10.160.2.27,http://example.com,https://10.160.2.21,https://10.160.2.27,http://10.160.2.21,10.160.2.21,https://example1.com,https://example2.com"

To add any FQDNs which you want to use in your own hierarchical DNS model to this list, we use the same command, but this time we include the list of additional “origins” we want to add to the table. Here’s the format;

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n webapp.allowed.origins -v <comma_separated_list_without_spaces>

Apologies to anyone who read the white paper and the earlier version of this post and got caught out by this. Thanks to Sean and Shady for their help with this and Tomas for, as always, knowing the answer, which collectively led to this lengthy but important update.

Hopefully, if you made it this far (firstly thank you!), you now understand how vCD 10.0 simplifies resilient access to a multisite association, even if customers don’t have Orgs at all the sites. If you have questions or comments about this post, please feel free to leave a comment below. If you have a specific question about your own vCD solution, or that of your Provider, please get in touch via the contact page or via your own local VMware representative and we’ll do our best to help.

In what you could be mistaken for thinking was great planning between us, just as this post was published, Daniel released a post on the association process for vCD 10.0 over on his excellent blog Clouds, etc.. Checkout vCD 10.0 Multi-Site Pairing Guide with Postman which updates some of the sequences I described in the white paper.

Closing Credits

Special thanks to (Abdullah)^2 whose original question on Slack prompted this post, and my (caring) colleagues in the vCD development team for help in keeping my mistakes in this post to an absolute minimum!

Feel free to share this post...