2 February 2024

An experience in the magical realm of Google VPC Networks

This is a simple story, about an unexpectedly complex situation. The situation is bog standard normal and trivial. Developers wish to use a Memory Store tool, called Redis, as a cache and transient database storage layer behind an API. Simple concept; we ask about ten questions of the developers what their intended use-case and work flow will look like. Once we have those answers, sizing the cluster is normally a simple application of logic with a tempering of experience and a sensitivity to costs.

I wont bore you with the questions, as they vary, but all have a common theme: if one can understand basics of quantity, quality, durability and use patterns early on, or plan for things properly, the end experience for the dev to consumer is a positive one. An example of why this is important paints a picture in one go: I once encountered a very expensive Redis installation which was overprovisioned by about 2/3rds. While active, this 3-node cluster was several years out of date. I was told to push the button to upgrade it. The devops in me checked a few things. First, was it backed up? Second, did it matter whether it was backed up? Third, had any alarms gone of? I was admonished for asking.

I got lucky and mentioned the lack of backups at a standup with developers in a meeting completely unrelated to my task, condition and standard, and the result was a complete search and discovery adventure. My caution saved an end of the sprint the disaster. If any of us devops had clicked the little upgrade button on the poorly documented and designed cluster, all production data would have been lost.

No backups existed. An entire project with three years of client data vanishing on a Friday (when I originally got the go-ahead), would have most likely vaporized the project with over 40 developers and support staff. The developers had no idea their data was transient. Since they hadn't asked for backups, none had ever been taken.

It was a near-miss.

1

An alternate story, of similar headsweat was a ticket to clone an application ecosystem with Redis nodes for a small project which was in maintenance mode and being transitioned to a new team. Rather than pay developer time to add a client, the business chose to literally clone the project in its entirety, and launch it under a new API endpoint for said new client. In this case, I followed my same checks and investigation of the un-managed long-neglected assets.

My initial shock was the fact that this project was paying around $68k / 1 year on Redis. The Lambda Functions which served the application and wrote to Redis cost about $150/year in comparison. Assessment of the load indicated the application was touched a few times a day.

I could have blindly followed the request. Instead, I asked those same questions. Backups? Persistence? Usage needs? Health? What I found was that those Redis nodes, in high availability, with multiple large read replicas in four diferent environments were indeed in use. They were each written to exactly once, every six hours in every environment. The write was of a single key with an expiration of 6 hours. There were no reads. There had never been any reads back to the last commit and deployment of the application to prod, two years and seven months prior. No developer had even implemented any read feature in the code. Ever.

In the end with that project, I was able to talk the business gently out of wasting a full $132k the following year and save some money by cutting out the single write statement of a 6-hour auth token from an external vendor, which can be pulled 25k/hour in less than 200ms. Okay, now back to the primary: Devs Want Redis. What does one expect to do?

In a classical server room model, which many of us come from, there is a switch. If you plug into that switch, you will get a port which lets you DHCP or assign an IP on, say, a private /20 if you are lucky. Or like I was, all your kit had public IP's (I started out in education when every building got a /24 and you were on your own.). Either way, you think in terms of Cool, New Appliance. Has ETH port. Grab Cat-5, plug in. Get IP. Route or secure in Firewall as necessary!

I learned early to have a network in my oce with NAT, and any new device or server got setup there rst, then moved from the private zone to the server room. After setting up the rewall rules, of course! For a redis black box, this is simple. It's one port, and one IP address. And if you have good DNS hygine, one DNS entry, so you can change IP's as needed.

Take that experience expectation of plug and chug, and move it to AWS Cloud. Elasticcache has a Redis implementation which has been quite good to me over the years. My server room model translates right to the network. I have a VPC with a deicated /20 CIDR for each environment. I have a subnet already setup by habit for things like RDS instances and Redis, or any other Data Store Thing. I create a security group for any application which needs redis, or a specic application if we have many redis clusters and need isolation. Clicking through a few buttons, or launching 2-3 stanzas of Terraform code will have ourselves a fully minted Redis cluster, with any trimming we plan for, accessible inside our VPC with common routing rules, and accessible outside of our VPC with minimal fuss if needed, and if proper IPSEC/VPN/Peering wizardry is invoked.

The process is identical, really. Ensure ingress and egress are to condition and standard by subnet or better, by security group association. Create cluster/instances. (And don't forget to back them up! And put alarms on them for swap usage! And monitor the CPU!)

2

Now, transition that experience with the same expectations to GCP. On first read, Memorystore in GCP's Redis option has some very strong pluses. The backup optoins and maintenance scheduling are better. The encryption options are easier to provision. The versions are fast to update, and are almost one-to-one with latest offerings from the Redis community. Okay! This is good.

Now... where do I put this thing?

Turns out, there are two options; inside a network owned by the project, or inside a network shared with the project. I'm not going to go through (1: inside a network owned by the project) because while that is the simpler case, it is not the common model proscribed for large coroprate projets, where, just like in AWS, you may have networks 100% owned by a core networking team, and you are just given slices or access rights to use them. Instead, this story follows with the second case (2: to be used by a network which is owned by a host project and shared with you.)
As this is in the documentation as possible, and it is possible, lets walk through the ether together. To start this process, assume we know and can view the VPC Name in the Host Project, that it's already shared with my Service Project Which Needs Redis.

Lets think this through to connect it to our server room model of operations. We have a switch in our next door neighbors house. [ Host Project ] We are in our house. [ Service Project ]

In order to use that switch in the house next door, we have to do some things before we could plug in OUR redis applicance to OUR/THEIR network. The classical ops in me would assume that was a three-step dance. First, run a cable, second, adjust firewall, and third, verify or adjust routes.

In GCP... It's not. Before we can begin building anything, we must enable API's google_project_service for all components, some in both projects, some in just one. To do both for safety, this is at least eight operations. Now we can get started!

First, we must as for a subnet to be created in the network on that switch, which we may not actually ever use directly. It's sort of there as a beach-head. Second, we must request that a virtual range of IP addresses, outside of anything which will ever be classically used by anything physically associated with the primary switch in the neighbor's house be carved out. It can't overlap with anything else. The suggestion is to make this virtual extension zone a /16. And a /24 is the minimum.

We will never EVER have direct access to this zone, as it is External For Services. No subnet can ever be created in future inside the network which overlaps. All of this extended space is essentially A Ghost Switch Above Our Neighbor's House

Okay, we've now done that. But can we create a Redis instance yet? And start using it?

Alas, no. We must also request that a demonologist, a druid and a farmer step in to anoint a connection down from the Ether from that Ghost Switch In the Clouds back into the switch fabric of our neighbor.

3

Then we can create a Memorystore Resis cluster, which will exist in a /28 carved out of that Virtual Extension Zone and will exist in the Ghost Switch. Questions remain: How do we update rewall rules against resources which can change in that range? How many IP's do we need in that external range for Google Services? Is a /24 enough, or should we use the default of /16 to be safe? Is there even a way of creating an alarm on Externally Allocated Range IP Exhaustion? (Which IS a thing!)

I can only ask why. Why on earth is it this hard to launch Memorystore:Redis nodes inside a network space in a classical way? Why must they be carved extra-network thusly? I have been administering Memcache(d) and Redis cache services for quite a while. In that time, I've almost completely tended away from self-managed VM's running Redis. For the first time, in the last 6 years, I'm actually considering going with self-hosted.

In the end, we hope that adding a rare anointed Network Admin Management role will allow us the necessary power to create that Final Connection between the Ghost Switch and the Virtual Neighbor Switch.

Ken Decoteau

Read more posts by this author.

Read More