An argument against defaulting to auto-increment primary keys

Shannon Griswold Databases, Java 11 Comments

The de-facto “easy” primary key for ages has been the auto-incrementing integer. Upon insertion of a new record, the database simply increments a counter for the table and uses the new value for the primary key of the inserted row. It’s simple to setup, requires no maintenance, and guarantees a predictable key.

For my own purposes, I was under the notion that I should always use auto-incrementing primary keys. They were an obvious choice for many of my databases and I never needed to consider alternatives.

Then the cloud came along, with all of its distributed fanciness.

Specifically, I started a project on top of Google App Engine (GAE), an auto-scaling application runtime hosted by everybody’s favorite search monster. App Engine provides a distributed cloud database following a NoSQL paradigm called the Datastore. You store an object (POJO in my case) using a Long or String identifier. The Datastore is a powerful facility that couples well with App Engine’s scalable application runtime architecture.

I started My Gear Pack, a cloud-backed Android application and Google Web Toolkit (GWT) client used for helping outdoor savvy folks organize their gear. I immediately, without even thinking about it, constructed all of my Entities (that’s what GAE calls them) using auto-generated Long IDs. Upon insertion of a new Entity, the Datastore would automatically assign a Long that would be guaranteed to be unique.

This was rather convenient, as I wouldn’t have to manage ID generation code and handle any conflicts.

Then a user requested a new feature. They wanted to be able to use the Android application to manage their gear lists offline, without a data connection.

No problem, I thought to myself! I’ll just sever my layers and create a replicated database on their mobile device. I wrote a lot of code and spent hours designing my grand scheme to store the local Entities (in Android) with an empty “remoteID” field that would be populated when they were eventually synchronized with the Datastore, which would create the real Long ID.

It was complicated, and I had to walk away from coding several times out of frustration. I would glare at my dogs as if they had wronged me terribly. Ultimately dropping to the floor and wrestling around with them made things better. It always makes things better.

Solution: Universally Unique Identifier

After significant research and some experiments with various possible solutions, I landed on one of the most underrated features of modern programming languages, the Universally Unique Identifier (UUID).

A UUID is a pseudo-randomly generated 128-bit number, pure and simple. Due to several parameters that go into the creation of the number, it is generally accepted as unique enough that our meager Earthling brains can’t fathom a way to generate a duplicate.

UUIDs are frequently represented as hexadecimal Strings. So instead of using the Datastore-generated Long IDs, I built a prototype version of my entire project stack (GAE Java backend, GWT client, and Android client) using Strings for Entity identifiers. This enabled my offline clients to generate IDs when a user wanted to create a new gear list or gear item without being connected to the Datastore.

The solution turned out to be better than I could have anticipated. I could still use my replicated database design to synchronize back and forth between the clients and the cloud backend, and I could now uniquely identify any Entity of any type in the Datastore.

Life was perfect! I couldn’t have crafted a better architecture if I had chugged a couple more Mountain Dews…

Then I considered deployment.

How was I going to migrate all of my Datastore Entities (roughly 10,000) to use String identifiers instead of Long?

Stay tuned for Part 2 for a simple technique to mass-migrate Datastore schemas and retain Entity relationships.

— Shannon Griswold,

About the Author
Shannon Griswold

Shannon Griswold

Share this Post

Comments 11

  1. The title is not really correct: “An argument against defaulting to auto-increment primary keys”. The default should still be auto-increment primary keys because 99.9 percent of tables benefit from their use.

    New title: “When auto-increment primary keys just won’t do”.

    And the price you pay is performance. I hope the next article tells how much slower the keys are when doing joins and stuff.

    1. Steve, you are absolutely correct! I was just having this discussion with Zach Gardner (of Keyhole) and we ascertained that in highly data-intensive applications, especially scientific computing, there may be an overall application performance benefit by retaining numerical primary keys.

      The appeal of UUID String keys in this instance is specifically for distributed cloud applications that require offline management, in which case there is no discernible performance hit when using String keys because much of the user’s interactions are taking place disconnected from the database.

      The title is intended to convey that a developer should not automatically utilize auto-incrementing keys without first considering the problem space. In this case, I would have been better off having considered my particular problem before assuming that I should use numerical keys.

      Thank you for your feedback, Steve!

  2. When using a UUID, could it be used as a long primary key by preventing a HEX conversion (or by converting the hex to a long number)? I figure if you allow manual input of the primary key instead of auto-increment, then you could still keep an offline structure and the UUID implementation would allow synchronizing offline-created data to be smooth with no duplicate key problems.


    1. That’s a good thought, unfortunately the Long primitive in Java (64 bits) and the numerical keys in the GAE Datastore are too small to fit the 128 bits required for the UUID in its raw form.

      1. But there is no need to use the standard verbose UUID string representation, but something shorter like


        You could probably find something even shorter

  3. Hi,

    Good points, but you are covering a lot of ground which has already been covered. See these links:



    Performance in MSSQL:

    And my preferred solution:

    The primary key should be a natural key from the identifying attributes of the tuple. For example, in a table of employee names, the employeeid is a candidate for being the PK (but only if it is guaranteed to never change). If your tuples do not have a natural candidate key for selection as the primary key, then most likely you need to perform further normalisation.

    Now, in many cases, further normalising your data is not an attractive option, usually because it is just too onerous. In these cases, people tend to add a uniquely identifying attribute to the tuple (an incrementing integer, UUID, IDENTITY etc). However, this attribute has zero, nada, nilch, nothing whatsoever to do with tuple. It is entirely external, technical, and irrelevant to the meaning of the data. It therefore has no value beyond fingerprinting a specific tuple, which otherwise would not be uniquely identifiable from n other tuples. Ask yourself: “Why can I not identify this row from the rest of the rows, why are their multiple identical tuples?”. Solving that will solve your PK woes, young Padawan.

    This is relational theory hell, and you shouldn’t go there.

    1. Good point about natural keys! I love them and use them heavily, unfortunately my problem space is primarily user-entered data which is designed to allow duplicates at all levels. For many of my Entity classes, there is no natural key.

      I’m specifically discussing the App Engine Datastore, a distributed noSQL repository which does not allow composite keys and has the limitation that an altered key creates a new entity. If one were using a traditional relational database, one would most certainly have more control.

  4. Hi Shanon Grisworld,

    Just another thought on the primary keys:
    When it comes to primary keys, auto-increment primary keys can be used for a improved performance on queries on primary keys (which is usually the case) and another unique key (pretends to be primary) – an alphanumeric Library of Congress which user will visualize it as a primary key. This LOC will do much of the work similar to UUID for us as it can be displayed to the user as the primary key (unlike a boring set of numbers or alphanumeric values). LOC is usually defined by a various set of values like for Employee, it could be Employee Number, DOB, Date of Joining and so on. So the LOC will be something like EMP1_0512_20121201.

    Compare to UUID (128 bits) LOC will always be unique as the decision of what corresponds to LOC depends on the data. It is different from table to table, design to design or structure to structure. One more advantage with LOC is that its clean and easily readable. By just looking at this LOC, we will know what this row of data corresponds to (unlike with UUID)

    What’s your thought on this?


    1. Triguna, that sounds a lot like natural keys that Richard Quinn mentioned above. However, instead of creating a multi-column composite key for the uniquely identifying tuple, you are suggesting to combine them into a single string that is then stored as an indexed secondary key, is that right?

      There is value in the human-readable aspect of what you’re proposing. An identifier that can be used by the user for the purposes of search and data entry should be as easy to consume as possible just like you have defined.

      If the human-readable identifier is truly unique, the auto-increment key is redundant.

      I don’t think that we should fear string primary keys. I was an architect with a company that used string primary keys on both SQL Server and Oracle with billion-record tables and once the tables were indexed correctly, the database was not a bottleneck. There are certainly exceptions, but I think that most user-input-bound applications can safely use properly-indexed string primary keys with negligible performance considerations.

      Thanks for the suggestions, a human-readable identifier can be quite useful!

    2. Hi Triguna,

      Your suggestion is intriguing, but raises a question…

      If one component of your artificial PK changes (and these things do change because of incorrect accession and subsequent correction) then the PK you created will no longer be truthful. The meaning that people have read into it, and continue to associate with it, is wrong.

      Of course you could change the PK and cascade the update, but for any external system which had created an immutable reference (compliance, financials) they would now be in a very bad place, the references would be broken.

      For a further discussion, I refer you to the C2:

Leave a Reply