Identification patterns for your resources: URI/UUID/AWS ARN...

A 14 minutes story written on Dec 2022 by Adrian B.G.

image
When dealing with abstract entities or physical resources, the first thing we need to consider is how can we uniquely identify one of them. In this article, we’ll go through a set of common standards and practices in technology, with a focus on Web/Internet applications.

Resources

Resource: a stock or supply of money, materials, staff, and other assets that can be drawn on by a person or organization in order to function effectively.

Let’s go through a list of possible resources, it can by anything really:

  • an abstract concept: an ISO standard, a cooking recipe
  • a physical resource: a person, a pet, a city
  • an instance of a data model in your database/API, e.g. users, images, orders
  • a sub-system/component, e.g. microservice, database, logic layer in your app
  • a blob of data or metadata, e.g. files, folders, objects, a slice of bytes
  • a relationship between resources

In computer programs we always group resources in collections. Because they share the same properties we can share the features and logic we build across the resources in the same collection. Most of the time, we need to uniquely define and address a resource, and this lead us to the following term: Identifier.

Identifiers

In my opinion, an Identifier is a unique attribute in the bounded scope of a collection (that defines its type and schema) that belongs to a specific system (provider or authority). The Identifier is composed of one or more properties of the resource (composite key).

Multiple systems can refer to the same resource using different or same identifiers in a different format, but a system (provider) has to choose at least one property and a set of defined rules (schema) to uniquely reference it. Systems that use resources from a different provider can reuse the same identifier or create their own. Examples:

  • A country identifies a person by his/her Social Security Number, but in the scope of a WordPress installation that person will have its own generated ID representing the same individual, generated in the registration process.
  • An address (city, street …) may represent the place where you live, but from a GPS system perspective it is a series of coordinates (latitude,longitude).

The most popular standard for identifiers is the URI defined in RFC 3986 which supersedes the original URL specs in RFC 1738 (the year 1994). We will go into more details about URI, UUIDs, and other common standards further on.

Representations

As we saw what Identifiers are, we realize that this information alone is not really helpful without context. Most of the time, having only a specific ID is useless information if we do not know how to get more information about that resource. To solve this problem we need to “interpret” the identifier in a context.

While Identifiers are abstract, Representations are the actual manifestation of a resource in a specific context. For example, a customer can be represented in the billing system by its passport and company details, but in the marketing database the user is represented by its email address. Even in the same context, a resource can have multiple representations:

  • a document can have different encodings, formats, and/or compressions (eg: json, xml, gzip text)
  • an article can be translated into multiple languages
  • a company account may have multiple users

Context

To define, create, or access a specific resource we need its identifier and the context of the existing representations, and most importantly an address, or at the least its authority. A number is useless on its own, but knowing what that number represents is crucial, for example, if you know that the number is a Romanian social security ID you have context. Knowing its authority and the resource type, you have the mechanisms of interpreting its representation, by reading their specs.

Let’s define a simple example to put all these concepts in perspective.

curl -d @request.json -H "Content-Type: application/json"
 -H "Accept: application/json" http://localhost:8082/users/Bob

By executing this HTTP request:

  • We ask for more details about the user identified as “Bob”
  • We already know (in advance) that the authority of these users is the web server at “localhost” address (and the company behind it).
  • We communicate the fact that we want/support the encoding format JSON and if the server does not know how to return this representation of the resource, it will return an error.
  • We send and receive other context details that are hidden in the HTTP protocol and our tools. For example, the response may also be compressed (gzip) and its compression/decompression is done behind the scene (which can be seen as a representation of the resource).

You can also see this request from a different perspective, our initial intent being full of presumptions:

  • We presume the user Bob exists, it may not.
  • We presume that this authority/provider (web server) knows about Bob. The user may exist but maybe we have the wrong address/authority.
  • We presume the web server has the knowledge and ability to generate a JSON representation.
  • We presume that we are allowed to access the resource “Bob”.

Authentication and Authorization (AuthN/Z) are important aspects and strongly related to Resources, but they are beyond the scope of this article.

image

URI, URN, URL, and a lot of confusion

The following concepts were developed as early as the 1990s when the “internet” was defined, and later they became official standards as defined by IETF. These identifiers are no more than a series of characters representing a human-friendly text that contains one Identifier and some context.

The URI(Uniform Resource Identifier) as defined in RFC 3986 is a generic way of defining identifiers, but they can also contain the location or other contextual information.

The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier.

Examples of valid and popular URIs:

  • https://[email protected]:123/forum/questions/?tag=networking&order=newest#top
  • mailto:[email protected]
  • urn:oasis:names:specification:docbook:dtd:xml:4.1.2
  • books: urn:isbn:0-486-27557-4

A Uniform Resource Name (URN) is a URI that identifies a resource by name in a particular namespace without implying its location or how to access it. They are typically “emitted” by the authority which is responsible for that namespace. Having a global central authority makes it easier to enforce other URN properties like persistence and global uniqueness.

Examples of URNs:

  • urn:isbn:0451450523 The Last Unicorn book from 1968, identified by its book number.
  • urn:isan:0000-0000-2CEA-0000-1-0000-0000-Y The 2002 film Spider-Man, identified by its audiovisual number.
  • urn:ISSN:0167-6423 The Science of Computer Programming scientific journal, identified by its serial number.
  • urn:ietf:rfc:2648 The IETF’s RFC 2648.
  • urn:mpeg:mpeg7:schema:2001 The default namespace rules for MPEG-7 video metadata.
  • urn:oid:2.16.840 The OID for the United States.
  • urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66 A version 1 UUID.

A Uniform Resource Locator (URL) as defined in RFC 3986 can be used to identify resources by specifying their locations in the context of a particular access protocol. This way you also clarify its access mechanism and its location.

Examples of URLs:

  • https://www.google.com/search?q=examples+of+urls
  • https://stackoverflow.com/questions/176264/what-is-the-difference-between-a-uri-a-url-and-a-urn

Many URLs used in web applications do not refer to a unique resource but rather a logical command used to access dynamic content. In the following example "https://www.dummy.rog/users?query=*alpha" the command consists of a string search query in the user’s catalog. Other commands are supported at protocol level, for example POST/GET verbs in HTTP.

Multiaddrr

MultiAddr is an interesting project that wants to create a standard for network addresses. It supports most protocols and by design should be compatible with network addresses that are not yet invented. It relies on URI format, a few examples:

  • /tls/ws - websocket over TLS
  • /ip4/127.0.0.1

The addresses can be more specific like a specific port, protocol on a specific machine /ip4/127.0.0.1/udp/1234. Components that represents protocols or implementation details can be chained in the same multiaddr: /ip4/1.2.3.4/tcp/1234/tls/p2p/QmFoo refers to the resource with the ID QmFoo that can be accessed via p2p, layered by the security TLS protocol, exposed on 1234 TCP port on the machine with the respective IP.

Auto Increment SQL and UUIDs

Generically referred to as IDs, which are simple Identifiers with a narrow scope/bounded context. In contrast to URIs which are usually global, IDs mostly identify a local or temporary resource.

Because of the popularity of SQL databases, I think that positive integers are now one of the most prevalent resource unique identifiers. Developers can leverage a native functionality of SQL databases called autoincrement which guarantees a monotonically unique number assigned to each resource (row) in a table. Examples of URLs that refer to integer IDs:

Another popular format of identifiers is UUID/GUID (RFC 4122), which can be randomly generated in a distributed manner (version 4, 6 and 8), or deterministically (based on some unique properties of the resource and its namespace, UUID version 3, 5 and 7)

Example of a UUID in its canonical representation: e7676de8-7a3c-11ed-a1eb-0242ac120002 which is actually a very large integer value 307588702804154946639077556132305305602.

For a better comparison between the formats and when should you not use auto increment IDs please read Software engineer — from monolith to cloud: Auto Increment to UUID .

Making sense of IDs with URIs

With simple IDs in the context of one sub-system, an application or a group of microservices solves most problems. However, there are cases in which a system wants to address/access resources from different namespaces or applications. The most popular business requirements that drive such need are:

  • Identify or define relationships between different resources e.g. all users that bought a product
  • Analytic and monitoring resources across applications e.g. resource usage across microservices

In practice, the context of “user” or “product” are not explicit, but rather inferred from its storage location (e.g. SQL Table “users”), but in more complex systems (analytical databases, data lakes) or document databases we may need to store this contextual information (provider, context) in its ID.

An example could be a Cassandra table in which multiple types of entities are stored or an Audit Log SQL Table.

Keep in mind that each individual provider/system did not (and should not) guarantee the global uniqueness of their IDs. There are also different types of resources, different authorities, and possible multiple namespaces to be considered. Putting multiple IDs in the same namespace can lead to collisions and other problems. To avoid this, instead of storing the original resource Identifier, we can leverage more advanced identifier patterns like the URI, for example:

  • uri:user:2222 bought 5 pieces of uri:product:28
  • uri:marketing_campaign:summer_sale referred to uri:product:28
  • uri:server:42 used 45% of the uri:network:volume:main_disk2
  • uri:provider:aws reports the costs for uri:network:volume:main_disk2

These new URIs can also serve as a unique Identifier in our new system (e.g analytical database), as a string ID/primary key, that can be deterministically generated and decomposed based on their provider’s identifiers and context.

image

AWS use-case

Let’s dive into a practical real-world complex example of grouping IDs, more exactly how AWS public cloud provider handles the global identification of resources, into a multi-geographical and multi-tenant distributed system.

A resource can be a server, a user (in AWS IAM or AWS Incognito), an object (in S3), a volume (in AWS Network storage), a load balancer, and so on. At generation, each resource instance gets assigned an identifier for example i-1234567890abcdef0.

AWS has the following concepts, used as namespaces in their semantics:

  • partition - a set of regions
  • region - a geographical set of logical data centers
  • availability zone - a logical data-center
  • account - a customer namespace

Most resource IDs are randomly generated and scoped to a single region, with the exception of S3 bucket names, which are user-defined and scoped to a partition (group of regions). This means the resource ID could collide with other resources in other regions, and also to access them we need more context, so let’s see how Amazon designed its identifier representations and contexts:

AWS Resource Name (ARN)

The AWS ARN follows the URI format arn:partition:service:region:account-id:resource-type:resource-id and it solves multiple problems:

  • It has the role of a global identifier because it contains all three dimensions (partition, region, account) and the resource ID.
  • It provides context because it contains the physical and logical location (partition, region)
  • Often defines its representation by knowing the AWS Service, and optionally, can define its type explicitly, e.g. you know it is a resource of type account if the service is AWS Incognito
  • Provides context of ownership and authority with the account ID.
  • By supporting paths and queries, the AWS ARNs can be used as “commands”, to represent a subset/group of resources. For example, applying an IAM Policy to all users that contain that starts with the prefix "adm_": "Resource":"arn:aws:iam::123456789012:user/*"

Example of AWS ARN: arn:aws:ec2:us-east-1:123456789012:vpc/vpc-0e9801d129.

Internet use-case, DNS and IPs

Let’s switch our attention to a simpler example but a more popular one. Given the example of the URL http://www.google.ro we can say that it identifies the “google” web server resource, and the URI (URLs are a subset of URI) is also the location and the mechanism on how to access it (with an HTTP protocol client). The tools to access such resources are called HTTP Clients (browsers, cli tools ..), and those leverage the DNS protocol, which defines how the address will translate to an IP address. With the IP address, the client now has the unique address of the web server that acts as an authority for the resource we are looking for.

For illustration purposes, DNS and IP protocols are oversimplified in this example, as they consist of multiple intermediate schemas, technologies, and providers.

Public IPs v4 Internet Protocol addresses are also global identifiers governed by a central authority called IANA. An IP is just a number, but they have a human-friendly canonical representation format of “192.0.2.1”, and serves the purpose of identifying an actor in a network via a network interface.

Geo URIs

So far we’ve seen examples of URIs and Identifiers formed mostly of letters, but it is not always the case. The Geographical URI schema as defined since 2010 is worth mentioning not only that works with digits, but also is designed to handle multiple reference identification systems. As mentioned earlier, global coordinates (latitude, longitude and altitude) can be represented in more than 10 standards, and the Geo URI “supports” all of them.

A spatial reference system (SRS) or coordinate reference system (CRS) is a framework used to precisely measure locations on the surface of the Earth as coordinates.

There are so many reference systems with their own URNs that define what standard and version is used in a particular URI and can be traced back to its documentation/set of specifications. In the geo:323482,4306480;crs=EPSG:32618;u=20 example we see that beside an Identifier (coordinates) the URI also contains:

  • Representation details, how to interpret the identifier. We use the crs value and form another URI urn:ogc:def:crs:EPSG::4326 which uniquely represents one model of coordinates. By following this URI and knowing the Authority of OGC we can reach the wgs-84 specs

wgs-84 reference 2D coordinates on an ellipsoid model that leverages fixed points on earth to measure relative distances.

  • Context: the accuracy of the identifier is measured in meters and defined as parameter u.

Restful APIs

I think that another good URI use-case that is very popular amongst web applications can be found in Resource Relationships as defined in a Restful API.

AHTTP Web server can implement this pattern to expose resources, and most of the time the server is also the authority (or a delegate like an API) that is responsible for the resource. A GET http://web.com/users/Adrian resource, in a JSON representation for example may contain relationships to other resources that can be represented as links as follows:

HTTP/1.1 200 OK
Content-Type: application/vnd.api+json

{
 "user_id": "Adrian",
 "user_metadata": {...},
 "services_purchased": [
   "/service/host_plan_premium",
   "/service/daily_backup"
 ]
}

The referred resources (e.g. "service_purchased") representations serve multiple purposes and can be seen as special types of URLs (which are URIs):

  • Contain the identifier of the service, eg: "daily_backup".
  • The representation format is found in the HTTP Response headers “Content-Type”. This information helps the request author to identify the read mechanism of the result, in this case, it needs a JSON decoder.
  • Provides context, meaning the location where the resource can be accessed. By convention, the address is relative to the origin address, so the client can access the resource by initiating another request to the address "http://web.com/service/daily_backup".

Conclusion

When choosing an identification schema for a resource you should first ask yourself the following:

  • will humans interact with them?
  • are they internal or publicly exposed?
  • how many unique resources could exist at a time?
  • the authority that assigns the IDs is central or distributed?
  • how about security? (often deterministic simple IDs like auto increment numbers are treated as a vulnerability/weakness)
  • can the identifiers be reused or not?
  • how can collisions be detected and avoided?
  • what are the access patterns? (do we need IDs ranges?)
  • the identifiers needs to be comparable, so by what criteria?
  • are there any business or technical requirements? (e.g storage considerations)
  • do they need a canonical form or should they be used as such?

As you can imagine, this article only scratches the surface of the identification problem, and my hope is that it cleared out some confusion regarding the terminology, but most importantly your next contributions and design decisions will take into consideration all the aspects of Resource Identification.

In a future article, I will dive deep into the Identifier part, different types and generation techniques from a single authority or distributed identifiers Decentralized Identifiers (DIDs) with practical examples ranging from SQL and Documents up to the new kids in the house Blockchains and IPFS. Pros and cons of different ID generation techniques: deterministic (hashes, multi-hash, blockchains), random and hybrid algorithms (snowflake, social security numbers).

Other references

Thanks! 🤝

Please share the article, subscribe or send me your feedback so I can improve the following posts!

comments powered by Disqus