Next Generation URLs (Part One Of Two)
For many years we have heard about the impending death of URLs that are difficult to type, remember and preserve. The use of URLs has actually improved little thus far, but changes are afoot in both development practices and Web server technology that should help advance URLs to the next generation..For many years we have heard about the impending death of URLs that are difficult to type, remember and preserve. The use of URLs has actually improved little thus far, but changes are afoot in both development practices and Web server technology that should help advance URLs to the next generation.
Dirty URLs
Complex, hard-to-read URLs are often dubbed “dirty URLs” because they tend to be littered with punctuation and identifiers that are at best irrelevant to the ordinary user. URLs such as “http://www.example.com/cgi-bin/gen.pl?id=4&view=basic” are commonplace in today’s dynamic Web. Unfortunately, dirty URLs have a variety of troubling aspects, including:
1. Dirty URLs are difficult to type.
The length, use of punctuation, and complexity of these URLs make typos commonplace.
2. Dirty URLs do not promote usability.
Because dirty URLs are long and complex, they are difficult to repeat or remember and provide few clues for average users as to what a particular resource actually contains or the function it performs.
3. Dirty URLs are a security risk.
The query string which follows the question mark (?) in a dirty URL is often modified by hackers in an attempt to perform a front door attack into a Web application. The file extensions used in complex URLs such as .asp, .jsp, .pl, and so on also give away valuable information about the implementation of a dynamic Web site that a potential hacker may utilize.
4. Dirty URLs impede abstraction and maintainability.
Because dirty URLs generally expose the technology used (via the file extension) and the parameters used (via the query string), they do not promote abstraction. Instead of hiding such implementation details, dirty URLs expose the underlying “wiring” of a site. As a result, changing from one technology to another is a difficult and painful process filled with the potential for broken links and numerous required redirects.
Why Use Dirty URLs?
Given the numerous problems with dirty URLs, one might wonder why they are used at all. The most obvious reason is simply convention – using them has been, and so far still is, an accepted practice in Web development. This fact aside, dirty URLs do have a few real benefits, including:
1. They are portable.
A dirty URL generally contains all the information necessary to reconstruct a particular dynamic query. For example, consider how a query for “web server software” appears in Google – http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Web+server+software. Given this URL, you can rerun the query at any time in the future. Though difficult to type, it is easily bookmarked.
2. They can discourage unwanted reuse.
The negative aspects of a dirty URL can be regarded as positive when the intent is to discourage the user from typing a URL, remembering it, or saving it as a bookmark. The intimidating look and length of a dirty URL can be a signal to both user and search engine to stay away from a page that is bound to change. This is often simply a welcome side effect, rather than a conscious access control policy – frequently nothing is done to prevent actual use of the URL by means of session variables or referring URL checks.
Cleaning URLs
The disadvantages of dirty URLs far outweigh their advantages in most situations. If the last 30 or 40 years of software development history are any indication of where development for the Web is headed, abstraction and data hiding will inevitably increase as Web sites and applications continue to grow in complexity. Thus, Web developers should work toward cleaner URLs by using the following techniques:
1. Keep them short and sweet.
The first path to better URLs is to design them properly from the start. Try to make the site directories and file names short but meaningful. Obviously, /products is better than /p, but resist the urge to get too descriptive. Having www.xyz.com/productcatalog doesn’t add much meaning (if a user looks for a product catalog, they might well expect to find it at or near the top-level products page), but it does needlessly restrict what the page can reasonably contain in the future. It’s also harder to remember or guess at. Shoot for the shortest identifiers consistent with a general description of the page’s (or directory’s) contents or function.
2. Avoid punctuation in file names.
Often designers use names like product_spec_sheet.html or product-spec-sheet.html. The underscore is often difficult to notice and type, and these connectors are usually a sign of a carelessly designed site structure. They are only required because the last rule wasn’t followed.
3. Use lower case and try to address case sensitivity issues.
Given the last tip, you might instead name a file ProductSpecSheet.html. However, casing in URLs is troubling because depending on the Web server’s operating system, file names and directories may or may not be case sensitive. For example, http://www.xyz.com/Products.html and http://www.xyz.com/products.html are two different files on a UNIX system but the same file on a Windows system. Add to this the fact that www.xyz.com and WWW.XYZ.COM are always the same domain, and the potential for confusion becomes apparent. The best solution is to make all file and directory names lowercase by default and, in a case sensitive server-operating environment, to ensure that URLs will be correctly processed no matter what casing is used. This is not easy to do under Apache on Unix/Linux systems, although URL rewriting and spellchecking can help.
4. Do not expose technology via directory names.
Directory names commonly or easily associated with a given server-side technology unnecessarily disclose implementation details and discourage permanent URLs. More generic paths should be used. For example, instead of /cgi-bin, use a /scripts directory, instead of /css, use /styles, instead of /javascript, use /scripts, and so on.
5. Plan for host name typos.
The reality of end user navigation is that around half of all site traffic is from direct type or bookmarked access. If users want to go to Amazon’s web site, they know to type in www.amazon.com. However, accidentally typing ww.amazon.com or wwww.amazon.com is fairly easy if a user is in a hurry. Adding a few entries to a site’s domain name service to map w, ww, and wwww to the main site, as well as the common www.site.com and site.com, is well worth the few minutes required to set them up.
6. Plan for domain name typos.
If possible, secure common “fat finger” typos of domain names. Given the proximity of the “z” and “x” keys on a standard computer QWERTY keyboard, it is no wonder Amazon also has contingency domains like amaxon.com. Google allows for such variations as gooogle.com and gogle.com. Unfortunately, many Web traffic aggregators will purchase the typo domains for common sites, but most organizations should find some of their typo domains readily available. Organizations with names that are difficult to spell, like “Ximed,” might want to have related domains like “Zimed” or “Zymed” for users who know the name of the organization but not the correct spelling. The particular domains needed for a company should reveal themselves during the course of regular offline correspondence with customers.
7. Support multiple domain forms.
If an organization has many forms to its name, such as International Business Machines and IBM, it is wise to register both forms. Some companies will register their legal form as well, so XYZ, LLC or ABC, Inc. might register xyzllc.com and abcinc.com as well as primary domains. While it seems like a significant investment, if you use one of the new breed of low-cost registrars (like itsyourdomain.com), the price per year for numerous domains for a site is quite reasonable. Given alternate domain extensions like .net, .org, .biz and so on, the question begs — where to stop? Anecdotally, the benefits are significantly reduced with new alternate domain forms (like .biz, .cc, and so on), so it is better to stick with the common domain form (.com) and any regional domains that are appropriate (e.g. co.uk).
8. Add guessable entry point URLs.
Since users guess domain names, it is not a stretch for users — particularly power users — to guess directory paths in URLs. For example, a user trying to find information about Microsoft Word might type http://www.microsoft.com/word. Mapping multiple URLs to common guessable site entry points is fairly easy to do. Many sites have already begun to create a variety of synonym URLs for sections. For example, to access the careers section of the site, the canonical URL might be http://www.xyz.com/careers. However, adding in URLs like http://www.xyz.com/career, http://www.xyz.com/jobs, or http://www.xyz.com/hr is easy and vastly improves the chances that the user will hit the target. You could even go so far as to add hostname remapping so that http://investor.xyz.com, http://ir.xyz.com, http://investors.xyz.com, and so on all go to http://www.xyz.com/investor. The effort made to think about URLs in this fashion not only improves their usability, but should also promote long-term maintainability by encouraging the modularization of site information.
(Stay tuned next week for Part Two of this article, as well as a list of related articles on “clean” URLs and several links to Apache and Microsoft IIS tools!)
Thomas Powell is founder of PINT, Inc. and a lecturer in the Computer Science department at University of California San Diego. His articles have appeared in several magazines and sites, including Network World, Internet Week and ZDNet. He has also published numerous books on Web technology and design, including the best-selling Web Design: The Complete Reference. Visit pint.com.
Joe Lima is the Director of Product Development for Port80 Software. He has worked for a variety of Internet, wireless and software development companies, specializing in research and development for server-centric technologies. Visit port80software.com. Additional inquiries can be sent via email to Chris Neppes at cneppes@port80software.com.