Record Locators Considered Useful

You know what record locators are, right? Those little 6-8 character random strings that you get with your plane ticket? Have you ever thought about how they're generated and what they mean?

Record Locators Considered Useful
Photo by CardMapr.nl / Unsplash

Whenever you're providing content someone else - either directly or over an API - it's useful to hand out an ID so that when they inevitably return, you can look them up quickly and reliably. That ID should be unique and not leak too many details about your business or your infrastructure.

One common technique is to use a serial number - this remains common on invoices, for some reason.  The advantage here is that unque serial numbers are readily generated by all relational databases, they're small, and they're really fast to use as indexed keys.  The disadvantage is that they make it very obvious to your users how many records have passed between interactions.  This may be good if you're a high volume retailer, but possibly bad if you're a startup trying to act bigger than you really are.

You can try to generate random strings, an approach that's surprisingly common, but they're slightly harder to look up and you have to check them after creation to make sure that they're really unique.  This can be mitigated by making them longer, or using UUIDs, but now the disadvantage is that they can't reliably be transcribed over the phone or even re-keyed without using cut and paste.

This brings us to record locators.  You may have even used them before when doing self-check-in at an airport.  With only 6-8 characters it might seem that they're cycle through them pretty quickly, but that's where exponential math comes in.  At their longest, an 8 character case-insensitive locator has about 2,821,109,907,456 possible values.  Probably more than enough.

That's such a large namespace in fact that we can afford to waste most of it.  Let's start by getting rid of the letters I and O (easily confused with ones and zeros).  That brings us down to only 1,785,793,904,896 options.  Dropping down from eight to six characters gets us to 1,544,804,416.  That's a lot less, but its still one and a half billion - getting close to 2,147,483,647 which is the maximum size of a 32 bit integer.  That means that some very simple math can turn this base 34 number into a normal boring base 10 number that can be generated by - and referenced by - a normal database system.

Assuming that 31 bits of data should be enough for anyone, we still have the problem that a naïve implementation will make sequencing very obvious.  We'd start with 000001, 000002, 000003, ... 00000Y, 00000Z, 000010 and so on.  What we really want is to figure out a way to get that to visually appear random without actually being random, and fortunately there's an easy - and reversible - way to do that.

We can take advantage of the fact that since we're going between bases, large but obivous changes in one base become large, non-obvious changes in the other.  If we were to jump from 100,000,000 to 200,000,000 - a large and obivous change in a base 10 number, we'd actually be going from 26S96G to 4DMICW in straight base 34.  And the easist way to make our input number jump in large, reversible ways is, as you may already have guessed, to pad it and reverse it.

Now with a maximum input of 1,544,804,415, we obivously can't use all the space, because reversing 2 would get us to 2,000,000,000 which would overflow.  The easiest way to guarantee a fit will be to set an upper bound of 999,999,999 - still very reasonable for most systems.  This allows us to simply pad the input decimal out to 9 places, reverse it, convert it to base 34, left pad it back out to 6 places with an initial Y, remap zero to Y and one to Z, and call it good.

Moving to seven characters comfortably covers the integer space, and 13 characters is enough to safely hold a long should you really need to.  Allowing all characters and lower case the way that Youtube and Bitly do would only bring those numbers down to six characters for a 32 bit int, and only 11 for a long - which to my mind doesn't justify the potential increase in confusion.

Either way, your APIs will be both clean and opaque, and when the backend system gets one of these locators they can very efficiently turn them into a numberic database identifier.

java-utilities/KeyUtil.java at main · rjstanford/java-utilities
Contribute to rjstanford/java-utilities development by creating an account on GitHub.