Base64 encoded strings are something that we as software developers encounter constantly in our day-to-day work. But, we usually pay little mind to how it operates and what it is actually doing for us.
In this post, we will help shed some light on Base64. We will discuss what it is, how it works, and why we choose to use it.
In June 1992, MIME defined a group of methods for representing binary data in formats other than ASCII under the content-transfer-encoding header, with one of the methods described in the RFC being Base64.
Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation.
To put it in slightly more practical, simpler terms, Base64 is an encoding scheme used when there is a need to encode binary data that needs to be stored and transferred over media that only reliably deals with textual data. One prevalent use-case of this today is embedding images or other binary assets inside HTML and CSS textual assets.
How It Works
As the name implies, the Base64 encoding scheme encodes binary data into an ASCII string using the 64-character alphabet defined below.
The characters chosen are common to most encodings and are also printable, leaving the data unlikely to be modified in transit through systems that were not always traditionally 8-bit clean.
With 64 characters, we can encode 6 bits of data per character, meaning that Base64 can encode 3 bytes of data per 4 characters. As a result, one can surmise that the length of the Base64 encoded string will always be a multiple of 4. Even if you encode just a single byte, you will still have an encoded string with a length of 4.
Learn By Example
To better understand the encoding process, let’s walk through a simple example of encoding the ASCII string
Key. Note that normally there would not be a need to Base64 encode this string as it can already be safely transferred across any system that can handle Base64 and is purely being used for demonstrative purposes.
To start, each character is converted to its corresponding octet value, giving us eight bits of data for each character to encode. From here, groups of six bits are defined starting from the left and working right. With these sextets defined, you can then map each to a Base64 encoded character defined in the alphabet. So for this example, Key becomes S2V5 when Base64 encoded.
This is a fairly straight-forward conversion, but what happens when the math doesn’t work out perfectly and you have some bits leftover? Well, you pad of course! Base64 defines ‘=’ as a special padding character and utilizes it to ensure that the encoded string has four Base64 encoded characters.
Let’s see what that looks like in action by just encoding
As you can see, the encoding process begins as normal until we reach the third sextet. Since there is not enough data to complete the encoded character, the two least significant bits of the last content-bearing 6-bit group will be zeroed out and subsequently discarded upon decoding. Finally, the padding character is added to finish out the four-character group, and it will also be discarded upon decoding.
With padding, there are 3 scenarios that could be encountered when encoding data.
- There are 0 bits leftover – No padding is needed as there is nothing left to encode
- There are 8 bits leftover – 2 padding characters are needed as the 8 bits will be encoded into 2 characters, leaving 2 needed to fill the 4 character group.
- There are 16 bits leftover – 1 padding character is needed as the 16 bits will be encoded into 3 characters, leaving 1 needed to fill the 4 character group.
So at this point, you may be asking yourself why use Base64 and not Base32 or even Base128? Well, you can! A reason why Base64 is chosen as the most popular binary-to-ASCII encoding boils down to 64 being the highest power of 2. This is such so that you can achieve an encoded string in all printable ASCII characters, as there are 95 of them. Using an encoding with the highest power of 2 also allows you to decrease the bloat in size as much as possible. With Base64, you are able to encode 3 bytes into 4 characters, meaning you incur a bloat of 4/3, or 33% in size.
Hopefully, after this short exercise, you have gained a deeper understanding of the fundamentals of Base64 encoding and an appreciation for its simplistic yet powerful inner workings.