Encoding in Web Development. Why? How? URL, JSON , Base64 & beyond

A 10 minutes story written on Oct 2017 by Adrian B.G.

image

Encode: verb, gerund or present participle: encoding. Convert into a coded form. Convert (information or an instruction) into a particular form.

You are already using encoding & decoding functions, but you don’t know (yet) why and what the difference between Encoding, Encryption and Hashing is. This post will guide you trough all the differences and the most popular code formats: JSON, URL, Base64, HTML, CSV and XML.

In communications and information processing, code is a system of rules to convert information — such as a letter, word, sound, image, or gesture — into another form or representation, sometimes shortened or secret, for communication through a channel or storage in a medium. — Wikipedia

TL;TR 🚩

  • Encoding is the transformation to a specific code
  • A code is just a format, it doesn’t try to conceal the data like Encrypt or Hash formats

The most common usage of an encoding function is to convert data to a format in order to transport or store it. What is a code you may ask, here are some common examples:

In web development “codes” are usually text or binary formats, you can compare them with languages. In order for 2 parties to communicate (ex: server and client) they need to decide in which language (format) they want to speak (ex: json).

Needless to say that “decoding” is the same process as the “encoding” but with reverse codes (formats).

Encoding vs Encryption vs Hashing vs Obfuscation vs Minification

Encryption is a process that deliberately alters the data to conceal its content. If you “find” a string that was encoded, you can decode it and see what’s in there. If the text is encrypted you need the decryption key to see the content.

Encryption algorithms: RSA, DES, AES, Blowfish,

A hash function is a one way mapping of data, you can see it as a non reversible encryption method. It is mostly used to store data. Example: you have a text (password), hash it (no one can reverse the algorithm in order to find out the password), and then compare it with other hashes to see if the password match.

Hashing algorithms: SHA-256, CRC, MD5

In software development, obfuscation is the deliberate act of creating source code that is difficult for humans to understand. The source code of web application is public (HTML, JavaScript, CSS) in order to be compiled by the browsers. Why would you want to obfuscate your code? To protect your code from reverse engineering, tempering and stealing. Unfortunately it is very easy to de-obfuscate, for example Chrome has this functionality builtin.

You can also obfuscate your data, the process is called Data masking.

In web development we also have a technique called minification, which is used, like compression, to optimize a web application.

Minification (also minimisation or minimization) is the process of removing all unnecessary characters from source code without changing its functionality. These unnecessary characters usually include white space characters, new line characters, comments, and sometimes block delimiters, which are used to add readability to the code but are not required for it to execute. — Wikipedia.

Here are a few examples:

You may get confused because, in the real world you will use all the concepts, multiple times, in any order for the same operation, for example:

Hash a password, attach it to a data structure, format is as a UTF-8 string, encode it in Base64, encrypt the result with the user’s private key, store the result in a JSON, archive the packet with GZIP and send it trough SSL (which is an encrypted tunnel) to a server as a HTTP request, using the HTTP protocol formatting rules.

Related to encoding processes you will find other concepts like:

  • Serialization— Copy the data (encode) into a primitive such as a byte stream in order to be transported or stored.
  • Marshaling— The term is used in RPC to describe the encoding process before sending the object trough the network. Most of the time is used as a synonym for “encoding” or “serialization”.

Rules 👮🏽

Each code has its own rules and special characters, like a human language needs to have grammatical rules and an alphabet.

If you have a space “ “ in an URL, most probably it will not be recognize by the browser, all the special charactersmust be encoded (transformed):

https://coder.today?name=Adrian BG
#if you click the URL it will open the URL without " BG" part
#correct URL using the URL encoding function:
https://coder.today?name=Adrian%20BG

Another example is in XML, where you have to escape the character < to &lt; , because the format uses the character to declare elements .

Extensions, encoding & headers 🛩

If you store the encoded data in a file, it is a common practice that the file should have the format name as extension, example: “log-data.json” or “avatar.jpg”. This is just a convention, you can store any data in any file, even multiple data in the same file, it’s up to you, the developer of the system.

In the WEB ecosystem we do NOT trust file extensions, instead we use Content Encoding entity headers.

Response Headers
content-encoding:gzip
content-type:application/json; charset=utf-8

date:Fri, 20 Oct 2017 16:45:42 GMT``Request Headers
:method:GET
accept:/
accept-encoding:gzip, deflate, br
accept-language:en-GB,en;q=0.8,en-US;q=0.6,ro;q=0.4

When building clients for API’s, it’s often required to ask them “what language do you speak” (content negotiating), or “do you support gzip?” (accept encoding header), before initiating a communication channel.

Resilience 🚧

When I talk about HTML, XML, CSS, WWW & TCP/IP I often recommend watching the following video, it is a good history lesson and you will learn why do we have HTML5 instead of xHTML2.

the talk

image

Text and data structures 🎨

In web development we mostly work with characters, strings (text) and data file formats. Enough theory, let’s talk about the most common codes.

URL encode 🔗

Also known as the percent encoding (because it uses the % character), it is a function used to encode information in an Uniform Resource Locator (URL). The encoding function is also used by the Uniform Resource Identifier (URI).

The reserved characters are !*'();:@&=+$,/?#[], here are some examples:

image
image
"h?*$&#!" => %22h%3F%2A%24%26%23%21%22

In JavaScript it is a common mistake when forming dynamic URL’s to forget calling the encodeURI() or encodeURIComponent() functions.

Encode to Base64 🚛

Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. Each base64 digit represents exactly 6 bits of data. — Wikipedia

image

Although Base64 is designed to send binary data, in web development we use it to send strings and data structures too.

It is a common practice to send JSON data structures in URL’s, in this case some characters used by Base64 must be encoded to URL’s standard: (‘_+_' becomes '_%2B_', '_/_' becomes '_%2F_' and '_=_' becomes '_%3D_'), which makes the string unnecessarily longer. To avoid this send the data as a payload.

Another common usage of the Base64 format is to store 😎 small images in the JavaScript or CSS source code for performance reasons (you don’t need to make HTTP requests for each image or icon).

image

Google Images has been using this technique for a long time

Encode to JSON 🤖

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition — December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language. — Wikipedia.

JSON is also used as a storage document format and as a query language for some NoSQL databases.

image

It is the most common data format in the web ecosystem, you already knew this so let’s continue.

Data as a text on a letter in a mailbox 📬

Now that we learned about URL, Base64 and JSON let’s see a real example, how to send some user data to an API:

Using this method we simplify the communication process, we do not need to worry about the special characters inside the user name for example, or the JSON special characters (ex : ‘“‘) in the URL and so forth.

The format supports multiple data types (numbers, booleans …) and there are derived formats like BSON (Binary JSON) that supports even more of them.

There is no Encryption or Hashing in the example, anyone who intercept the message can find out what is stored in “myData”. For security see this article: Securing your client-server or multi-tier application.

image

HTML 🌐

I don’t think HTML needs an introduction, you wouldn’ t be here if you didn’t know what HTML is. It is similar with the URL Encode format but instead of the % character it uses &amp;, here are some encoded characters:

image
I saw a sign that said <write here>
I saw a sign that said &lt;write here&gt;

CSV🦎

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. — Wikipedia

CSV is the ancestor of the databases and JSON, it is a living dinosaur. It’s a very flexible format (you can specify the delimiter character) and is still used by many industries like Data Science and office. Converting an Excel to CSV means to Decode the Excel and Encode the data into the CSV format.

XML 📑

In computing, Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. — Wikipedia

Probably the reason we have JSON is that XML was too strict and verbose. It is still used at a large scale, especially by Microsoft, Androidand micro services systems, but its glory era is long gone.

The most common operations when working with API’s are to encode/decode your data records in/from JSON and XML formats.

#json 
[
  {
    "id":1,    "name":"Johnson, Smith, and Jones Co.",
    "amount":345.33,    "Remark":"Pays on time"
  }
]  
#xml  
<?xml version="1.0" encoding="UTF-8" ?>
<root>
  <row>
    <id>1</id>
    <name>Johnson, Smith, and Jones Co.</name>
    <amount>345.33</amount>
    <Remark>Pays on time</Remark>
  </row>
</root>

Beside transferring the actual data values between 2 parties (servers, programming languages, clients etc) we also need to communicate other meta data, like the data structure, the relationship between 2 records or their data types. This is where XML and JSON come to the rescue.

Compressing is encoding 🗜

A compression algorithm result can be seen also as a code. GZIP (or Deflate) are the most popular algorithms in the Web Development world because they handle text very well. They are mostly used to compress the source code of the web sites (JS, CSS, HTML) in order to minimize the amount of data transferred between servers and browsers.

Compressing is the process of encoding to a format that uses less bits to represent the data. Optimizing Encoding and Transfer Size of Text-Based Assets | Web Fundamentals | Google Developers

The following example is a CSS library of 10.000 LOC and its GZIP code:

image

9x smaller size

Mentions 📢

A few other worthy and popular encoding formats (for web development) are:

  • WebSocket— full duplex socket communication between client and server
  • SSL— a security layer on HTTP requests
  • SCSS, SVGand the list goes on

Character encoding 🔡

This topic (ASCII, UTF-8, Unicode) is too large to fit in my post so I recommend other good articles:

Thanks! 🤝

Please share the article, subscribe or send me your feedback so I can improve the following posts!

I curate a list of articles, talks and papers for one/two times per month. They are mostly related to computer science, distributed systems, databases, Go, containers and Cloud solutions.

comments powered by Disqus