Globally unique identifiers (GUIDs) are an integral part of the global biodiversity information network as it is currently envisioned. In order for users of a global network to know unambiguously what resource they are encountering, the resource must have an identifier that is unique from all other identifiers in the world. But GUIDs are expected to be more than just globally unique. There is now an expectation that they should also be actionable and persistent.
The current consensus in the biodiversity community is that the best way to ensure that an identifier is globally unique is to include an Internet domain (or subdomain) name in the identifier, since the rules of the Internet ensure that no two entities can be assigned the same domain name. In the case of Bioimages, the subdomain bioimages.vanderbilt.edu is used in all GUIDs to ensure uniqueness.
It is the domain owner's responsibility to make sure that the local identifiers added to the domain name are unique with the organization. This ensures that the composite identifier differs from all other identifiers used by the organization. In the biodiversity community, there is a history of using a collection code and catalog number to differentiate among resources in the institution. As an example, the collection code ind-baskauf is used to group individual plants photographed by Steve Baskauf and identifiers (functionally "catalog numbers") assigned to individual plants (e.g. 66920).
Thus the GUID http://bioimages.vanderbilt.edu/ind-baskauf/66920 uniquely identifies a particular valley oak (Quercus lobata) tree located near the parking lot of the Mitchell Canyon visitor center of Mt. Diablo State Park in California.
There is now a consensus that GUIDs should not only be globally unique, but that they should also be actionable, i.e. it should be possible to use the Internet to find out about the object that the GUID identifies. When "http://" is a part of the GUID, it is an indication that it is actionable. There is an expectation that if you put that GUID into a web browser, something should happen. But what???
When you put the URL (uniform resource locator) of a web page into a browser, you send a request for a webserver to send you via the Internet the html document (web page) that is identified by the URL. However, if you put the GUID for that valley oak in a web browser, the server will not send you the oak tree via the Internet! In this case, the GUID is a URI (Uniform Resource Identifier), a broader category that includes URLs of web pages, but also identifiers for non-information resources (e.g. physical objects). Because the URI of the oak tree begins with "http://", it falls into a category of GUIDs called "HTTP URIs".
So what does happen if you put the HTTP URI
http://bioimages.vanderbilt.edu/ind‑baskauf/66920 into a web browser?
Open a new browser tab then copy and paste the HTTP URI into the address bar and
see what happens, or click on this link:
http://bioimages.vanderbilt.edu/ind‑baskauf/66920
The Bioimages web server is configured to do what is known as content
negotiation. When the web browser asks the server to resolve the HTTP
URI, it requests an html document (the kind of file that it knows how to
handle). When the server receives the request, it knows that it doesn't
have any file having that URL (i.e. the tree isn't a computer file), so according to the content negotiation rules it
follows, it's response is "see this other representation" and sends the file having the URL
http://bioimages.vanderbilt.edu/ind-baskauf/66920.htm (the same as the HTTP
URI for the tree, but with a ".htm" stuck on the end). That HTML
file (66920.htm) contains some Javascript that causes the web browser to load a
different page with a more complicated URL (at the moment:
http://bioimages.vanderbilt.edu/metadata.htm?ind-baskauf/66920/metadata/ind/Quercus/lobata/19370
although that could change in the future).
This last example brings up an important philosophical point about HTTP URIs. An HTTP URI for a non-information resource (like a tree) cannot be the same as the HTTP URI of a representation of the tree (e.g. a web page about the tree). The tree and its webpage are two different things and they must have two different HTTP URIs:
| tree: | http://bioimages.vanderbilt.edu/ind-baskauf/66920 |
| web page: | http://bioimages.vanderbilt.edu/ind-baskauf/66920.htm |
After the success of the World Wide Web was demonstrated, it was felt that
the next step on the evolution of the Web would be to make it more friendly to
computers. For example, if I put an image on my web page, Google's web
crawling computer will find it. However, depending on how I tag the image,
Google may or may not actually be able to figure out what the picture is.
To a human, the picture may obviously be of a dandelion in fruit, but if the
title of the image is "Blowin' in the Wind", Google might think that the picture
had something to do with Bob Dylan. To enable computers to understand what
Internet resources are and how they are related to other resources, a version of
XML called Resource Description Framework (RDF) was developed. A computer
can "understand" an RDF file much more easily than an HTML page designed to be
examined by a human. The RDF file for the valley oak tree is:
http://bioimages.vanderbilt.edu/ind-baskauf/66920.rdf
If you click on the link above using Firefox or Internet Explorer, you will
see the XML file for the RDF displayed reasonably well.
For this reason, it is an expectation that HTTP URIs that are used as GUIDs for biodiversity resources provide a representation in RDF format when a computer presents the HTTP URI and asks for an RDF file (i.e. content negotiation requesting RDF). You can try this out using an "RDF browser". There are several RDF browsers available that can be used for testing HTTP URIS:
Zitgist
http://dataviewer.zitgist.com/
OpenLink
http://demo.openlinksw.com/rdfbrowser2/
Marbles
http://www5.wiwiss.fu-berlin.de/marbles/
Dipper
http://api.talis.com/stores/iand-dev1/items/dipper.html
Disco
http://www4.wiwiss.fu-berlin.de/rdf_browser/
These browsers have the bad habit of not working some of the time, but usually at least two of them are operating at any given time. Paste in the HTTP URI of the oak tree and see what happens. You see the same information that was in the XML file, but it is easier to read and contains active hyperlinks to other URIs that have RDF. By clicking on them, you can surf the web the way a computer would.
So what kind of computers are running around the Internet now looking for RDF files? Not many, because there aren't a lot of people creating RDF files. However, if the recommendation of creating RDF for all GUID-identified biodiversity resources is followed, it would be possible for a "biodiversity web-crawler" to locate and compile biodiversity information without requiring a human to submit it. Then the promise of "linked data" might be realized. See http://linkeddata.org/
Tim Berners-Lee, the inventor of the World Wide Web, suggested that that you
"give yourself a URI".
I have done this and you can see the result at
http://people.vanderbilt.edu/~steve.baskauf/foaf.rdf#me
. If you click on the link to my URI, I will NOT be delivered to you via
the Internet. Instead, a rather boring RDF file about me will show up on
your screen. (I made the page using the "FOAF-a-Matic"
- FOAF stands for "Friend of a Friend, the standard for encoding information
about people in RDF). This example demonstrates another approach to
creating HTTP URIs that differentiate between a physical object and its
representation:
| Steve Baskauf: | http://people.vanderbilt.edu/~steve.baskauf/foaf.rdf#me |
| RDF file about Steve Baskauf: | http://people.vanderbilt.edu/~steve.baskauf/foaf.rdf |
The difference between my URI and the URI of the RDF file about me is the presence of "#me" at the end of my URI. This is called a "fragment identifier". When the HTTP URI is sent to the server, it ignores everything after the "#" character and just sends the file represented by the first part of the URI (the URL of the RDF file).
This method of differentiating between an object and its representation is a sort of "poor man's content negotiation" because it doesn't require any technical knowledge to use, nor any special server settings to implement. All you have to do is to tack a "#" and some characters onto the end of a file URL and you have an HTTP URI of a physical object. However, it is probably not as good as using content negotiation as described above because the method only differentiates between a non-information resource (e.g. a physical object) and one kind of representation. It can't handle a physical object and both its web page and its RDF representation.
Providing access to RDF without a server set to do content-negotiation is possible if the server will provide access to HTML from URIs that don't end in .htm or .html . Some are set to do this automatically. Others don't and would require the server to be set to rewrite the URL with a ".htm" tacked onto the end. See the Content Negotiation Tests page for more on this.
The final requirement of a GUID is that it should persist. That means that the same HTTP URI should be associated with an object forever and that a serious effort should be made to ensure that the URI will continue indefinitely to be resolved when somebody puts it into a web browser. However, forever is a very long time and considering that the Internet has only been around for a few decades, it might be more realistic to plan for an HTTP URI to remain stable and resolvable for a very long time (years rather than months as is often the case for many URLs).
The first implication of the requirement of persistence is that HTTP URIs should only be constructed using domain (or subdomain) names that are stable. An institutional domain name is good. The domain name that you paid $8 at godaddy.com for a year is NOT good. If you stopped paying for the name and someone else bought it, your HTTP URIs stop working.
The second implication of the persistence requirement is that you should only
assign HTTP URIs that will not change if you decide to use different software or
a different web model. For example, this would not be a good HTTP URI for
me:
https://medschool.mc.vanderbilt.edu/biosci/bio_fac.php?id3=13257#me
If Vanderbilt decided to stop using php scripts or changed my id number, the URL
could change at any time.
The third implication is that it is best if there is some mechanism for someone else to maintain the website that services the HTTP URIs if the current webmaster leaves the institution, dies, or just gets bored with GUIDs. Creating GUIDs is a long-term commitment.
It is possible to create HTTP URIs that meet the three criteria listed above
(globally unique, actionable, and persistent), but which are hard to use and
remember. For example, the HTTP URI
http://bioimages.vanderbilt.edu/3n555j-65il1I-_o0O/1/3/a/querty:rrn/5js_SbBp#ab09kklw
would be perfectly legal, but who could remember it or possibly type it into a
web browser? See Tim Berners-Lee's article "Cool URIs don't change",
which has the very reasonable and stable URI
http://www.w3.org/Provider/Style/URI
For some time, Life Science Identifiers (LSIDs) were hailed as the GUID of the future. They were globally unique and they certainly could be made to be persistent. However, as they were originally described, they were not actionable unless a user's web browser had a special plug-in. A work-around was described by creating a "proxy form" of the LSID which had http:// and a domain name tacked to the front of it. This essentially transformed the LSID into an HTTP URI, so why not just use an HTTP URI from the start? Another problem with using LSIDs is that to properly implement them, an "LSID server" had to be implemented. Without special communication between LSID servers, one LSID server would not be able to resolve LSIDs hosted on a different server. In contrast, the existing Internet infrastructure was already set up to handle HTTP URIs. With a few notable exceptions (e.g. Biodiversity Collections Index), few people seem to actually be able to actually get LSIDs to work, and little effort seems to be exerted toward changing that.
Guid-O-Matic is a collection of a computer program and an XSLT stylesheet that allows you to create actionable GUIDs from comma separated values output from a database or Excel. It uses a simple method called RDF And XSLT (RAX) to create human-readable webpages from RDF metadata. Click here for more information.