Writing by Peter Hilton

Web site Architecture

Design guidelines intended to help create a viable web site design in which Uniform Resource Locators (URLs) are never changed.

1. Introduction

1.1 Purpose

These design guidelines provide practical details to help implement a web site. Whilst a lot of people have written about how to make web pages look nice, there is little advice about how to design a web site for simplicity, scalability and usability. These guidelines represent an effort to fill the gap.

These guidelines are not definitive; they merely represent one approach that avoids many common pitfalls of web site design and implementation.

1.2 Scope

These guidelines concern a web site's physical implementation. As such, the guidelines are relevant to whoever will assemble a collection of web pages into a web site, and to the 'webmaster' who subsequently administers and maintains the site.

Here a 'web site' could be a public Internet site, an intranet site or a web-based front-end to a server application.

These guidelines do not deal with information architecture, which deals with how to organise a web site's content as a set of pages, or web page design, which only deals with single pages. Most importantly these guidelines are not concerned with the content of any page on the web site.

1.3 Summary

These design guidelines are intended to help create a viable web site design in which Uniform Resource Locators (URLs) are never changed. With most designs, never breaking URLs involves either an increasingly disorganised file system, as new sections are added, or a complex design, even for very simple sites. These guidelines should reduce, or even remove, these problems.

1.4 Abbreviations & glossary

sub-site
A self-contained section of a web site, with its own look and subject area. For example, on a magazine site there might be a sub-site for one writer's column; Jakob Nielsen's site has the Alertbox sub-site , and http://photo.net/ has the Web Tools Review sub-site.
URL
Uniform Resource Locator - the address of a web page.
URL file
A file that is the target of one of a site's valid URLs.

1.5 Conventions used in this document

  • Book and web page titles are set in italics.
  • HTML and code fragments are set in the courier typeface.
  • References and external examples are given in footnotes, preceded by 'See also'.
  • Additional explanations and justifications are given in footnotes, preceded by 'Why?'.
  • Examples are shown on a grey background. Several of the examples will refer to the design of Hilton Harbour, which illustrates several of the design guidelines.

1.6 Background

The following two web sites have contributed to, or strongly influenced, the design goals. These sites are good background reading.

1.7 URL and file system structures

The guidelines concern URL structure and file system structure, which are introduced in this section. The URL structure is the scheme that results in the set of a site's URLs, where related pages have similarly structured URLs.

The file system structure is the organisation of the web site's files on a file system. Here, the structure is a hierarchical directory structure.

HH file system structure example

/about_hh.html
/peter/
 about_hh_text.html
/part/
 hh_page_template.html
 hh.css

Example 2: The page http://hilton.org.uk/about_hh.html is built from the five files listed above. about_hh_text.html contains the content; the part directory contains the page template and style sheet. about_hh_text.html, can be moved without breaking the URL, which corresponds to about_hh.html.

These URL structure and file system structure are the same when each file associated with a valid URL. Many of the following design goals rely on the premise that there is not necessarily any fixed relationship between the file system, navigation model and the URLs.

For most web sites site navigation, URL and file system structures are not independent. However, unless they are separated then some of the design goals are mutually exclusive.

2. Design Goals

musicians in Vianden, Luxembourg This section describes web site design goals for URL structure and file system structure. They are based on the experience of what is annoying about other people's web sites, and what is annoying about trying to maintain my own. Ideally a web site should meet all of these goals.

This section does not explain how to implement a web site that meets these goals; that is the purpose of the guidelines in the section 3.

2.1 Uniform Resource Locators (URLs)

Goal. Every page should exist for ever at the same URL

Purpose. This is the most important of all the requirements because you have no control over who links to your site, with the dubious exception of search engines. This goal is most important on public Internet web sites. On the assumption that you want people to be able to read your pages, you should leave each page where each reader first found it. They will appreciate it.

Notes. You also need to have your own domain name so that your URL does not change if (i.e. when) you change service provider (public Internet) or web server machine (intranet).

Goal. URLs should 'read well'

Purpose. If a URL reads well then it will be more recognisable out of context, in an e-mail or on a piece of paper. The URL will also be easier to transcribe or dictate correctly. URLs that read help the site maintainer in a similar way. In particular, the site's internal links are less likely to be incorrect (broken).

Notes. A URL 'reads well' if it is meaningful to humans: something like /reviews/restaurants/cambridge.html helps the user more than /file/all/03/rr9801219.html. Unfortunately, this goal often conflicts with the goal 'have a well-organised file structure', especially when URLs are associated with dates, as for an online news service, say.

It is difficult to keep URLs for a search engine's results pages readable. For example, kl=XX&pg=q&text=yes&act=search&q=one is the 'query string', appended to the URL that specifies an Altavista search. If all of the name=value pairs used whole words then the URL would become too long very quickly, so some compromise is required.

Goal. There should be some limit to the length of URLs

Purpose. A web site design should ensure that URLs don't become arbitrarily long, as the site grows and more sections are added. If your site structure strategy involves a deep hierarchical structure then each new sub-section gives longer URLs. If this happens the URLs become too long to remember, to be visible in the browser's location text box, or to write down.

Notes. A good maximum length is 80 characters, as URLs that are longer than this will not typically fit on one line of a paragraph.

Goal. URLs should hide the implementation technology

Purpose. Your web site's implementation might suggest exotic URL file extensions, such as tcl, pl, shtml, php3, asp, etc. Hiding the implementation allows you to change technology without changing URLs.

Notes. It should always be possible to configure the web server to process html files as if they had some other file extension, or to replace the file extension internally using something like Apache's URL rewriting ability.

Goal. URLs should be 'choppable', so that the result is a valid URL

Purpose. Given a URL like reviews/restaurants/cambridge.html many users will look in the parent directories, hoping to find more restaurant reviews at reviews/restaurants/, perhaps. This should be a valid URL.

Users, such as myself, only do this after giving up on the site's navigation interface. This goal is relevant because in practice, you can never guarantee that the web site interface design is perfect.

Notes. Assuming that the restaurants directory really does contain more restaurant reviews, which it probably should, then you have at least three choices. You could let the user see a directory listing (naff), show the user a 'no such page' or 'directory listing not allowed' error (unfriendly) or show a restaurant reviews index page (nicest). However, unless you resort to trickery, choosing the last option results in the need to maintain a whole raft of identically named index files, which violates Goal 0.

2.2 File system structure

Goal. Small sites should have a simple structure, but be able to evolve

Purpose. It should be easy to start small, without making it hard to grow. A web site with a 'simple structure' consists of only a few files and directories, whereas some designs involve a complex directory hierarchy from the start or a complex middleware solution such as Cold Fusion.

On the other hand, it should be possible to add 1000 pages to the site such that none of the other guidelines are broken for all 1004 pages.

Notes. This is hard! This goal is essentially a requirement that a design scheme should be simple for small sites of a few pages, and allow the site to scale up to a large site of thousands of pages without conflicting with the other goals, such as the requirement not to break URLs.

HH file system structure example

The first version of Hilton Harbour started off with only a couple of HTML files for restaurant reviews, and images in another directory. I could have used the structure

restaurants/
 paris.html
 cambridge.html
images/
 ...

This is a simple enough structure, but I didn't write about any more restaurants, so the new pages on other topics would need new directories for each topic.

Goal. The file structure should be well-organised

Purpose. There must be a coherent scheme for locating files on the web server, so that the webmaster knows both where to find each file and where to put a file found out of context. For a large site with many maintainers a coherent strategy is essential so that everyone involved saves or looks for each file in the same place.

Notes. The webmaster would need to know where to put a file found out of context if the file has been edited elsewhere, or gets put in the wrong place. This means knowing exactly where about.html belongs, for example.

HH URL structure example

The first version of Hilton Harbour used a combined directory/URL structure like

resources/
 cambridge/
  cafes.html
  restaurants.html
 paris/
words/
 opinion/
 poetry/

This structure was well organised in the sense that there was a fairly coherent hierarchy of categories. My first problem was that each directory needed an index.html file to act as a navigation page or content page. This meant that I had lots of files with the same file name; I put an index.html file in the wrong place more than once.

The second problem was that I inevitably had chosen the categories badly; for example, I left Paris in May 1997 and never wrote another Paris page. If I had used few directories - a simpler structure - this problem would have been worse.

3. Design Guidelines

The Ministry Of Health, Luxembourg This section contains the web site design guidelines for the URL and file system structures. Several of the guidelines are mutually dependent: some are only practicable or worthwhile if certain other guidelines are followed.

The intention is that a web site design that follows these guidelines will meet the design goals of the previous section.

3.1 URL structure

Guideline. Never change a page's URL

Purpose. This is just the first of the goals, and is the most important guideline. It is easy to state by itself, and hard to implement in practice. The purpose of many of the remaining guidelines is to make this guideline practicable.

Notes. 'Never' really does mean never, not merely 'hardly ever'. However, sometimes a page may be withdrawn, in which case the URL should still work, but point to either a withdrawal notice, or some navigation page.

Guideline. Put all URLs at the web site root

Purpose. If all of the URLs are at the site's root level then you automatically satisfy the following design goals.

  • Goal. There should be some limit to the length of URLs.
  • Goal. URLs should be 'choppable', so that the result is a valid URL.
  • Goal. Small sites should have a simple structure, but be able to evolve.

Notes. This gives you http://your.site.com/some-page.html rather than http://your.site.com/lots/of/directories/page.html. As your site grows, you get lots of files in the root directory, but that doesn't turn out to be a problem. You might want another scheme once you have more than a thousand or so pages, though.

Putting all of the URL files in the same directory means that their names must be unique, which creates a problem if more than one person is adding pages to the site. The webmaster could control URL file names, which would help give a coherent scheme, but might not be practicable if too many pages are involved. Alternatively, if different people add to different sections of the site, then a unique prefix could be used for each section, e.g. review-, press-release- or product-.

HH URL file name example

As of July 1999 Hilton Harbour consists of about 80 pages. It would probably be impossible to neatly classify these pages into categories and sub-categories, so any URL structure other than a flat one would be flawed. Instead they have URL file names that stand alone like

cambridge_restaurants.html
cambridge_cafes.html
international_assignment.html
gourmet_dish.html
questions.html

Guideline. Use relative URLs for internal hyperlinks

Purpose. This makes your pages are more portable, because you use relative URLs, and your links are less likely to be broken, because they are simpler. The previous guideline means that all of the HTML anchor tags in your pages contain something like href="just-a-file-name.html".

Notes. This also applies to links to images. If they are all in an images sub-directory of the web site root then all of your HTML image tags contain something like src="images/sofa.jpeg".

Guideline. Only use lower-case letters, numbers and hyphens in URLs

Purpose. This is to satisfy the goal that URLs should read well This makes the URLs simpler by reducing the number of possible variations, which makes them less likely to be broken. The path portion of a URL is case-sensitive, in general, so only using lower-case makes it easy to get the case right. Use hyphens to separate words, which makes the URL easier to read.

Notes. 1. This guideline does not apply to the query string that can appear at the end of a URL, after a question mark. 2. Underscores are a good alternative to hyphens. Choose whichever the webmaster can type fastest.

Guideline. Only use unabbreviated words or common acronyms in URLs

Purpose. This is also to satisfy the goal that URLs should read well. This guideline makes URLs easier to remember and use correctly, which helps both the webmaster much as the user.

Notes. Words should be spelled correctly; 'common acronyms' are those found in a dictionary or understood by the page's audience, rather than abbreviations the webmaster made up in the same way that Unix shell commands were named. Presumably the acronym will be expanded the first time it is used in the text of the page anyway, as it should be.

This is not practicable for search results pages, where the search is specified in the URL. A good idea is, for each name=value pair, to restrict the name to one or two letters.

Guideline. Avoid digits in URLs, where the digits are not essential

Purpose. This is just a sensible naming convention - you should use meaningful names. Numbers are meaningful in dates, when they are part of a name, and when they represent a series of similar pages.

Notes. For example, numbers occur in names like 68000_processor.html; serial numbers are essential for press releases, for example, where there may be many hundred issued per month. These examples are the exception though: most URLs do not need numbers. For example, the digits in review1.html, review2.html and review3.html do not distinguish the different reviews in any way; longer names could do this by saying what each one is a review of.

Guideline. Create URL file names in a consistent way

Purpose. Since (unfortunately) URLs must be human-readable they must be meaningful. A consistent naming scheme is the only way to have a large number of meaningful and related URLs.

Notes. For example, to construct a URL file name that follows these guidelines,

  1. take the page title
  2. convert the title to lower case
  3. separate words with hyphens
  4. remove all punctuation, and other non-alphanumeric characters
  5. add the file extension .html
  6. if the name is too long - more than three or four words - then remove any words that are not essential to the meaning of the title and are not required to make the file name unique
  7. move important words to the front of the URL, so that similar pages have similar names that appear next to each other in alphabetic lists.

This scheme should not give URLs that are too long if you follow the guideline to 'put URLs in the web site root', because there will be no directory names.

HH URL file name example

Hilton Harbour has URL file names are constructed from the page titles, and do not use abbreviations. For example:

european_phrase_book.html
photos_luxembourg_city.html
luxembourgish_verb_table.html
why_i_hate_computers.html

Because these file names are constructed from the page titles and do not use abbreviations, I only have to remember the title to know the URL.

Guideline. Do not have URL file extensions, or use .html

Purpose. This is to satisfy the goal that the implementation should be hidden. Also, non-standard extensions make URLs harder to memorise or transcribe.

Notes. The best idea is for URLs not to have a file extension at all, as there is no need to indicated the content type in the name of the file: this is already handled in the HTTP headers.

Ideally the web server will serve the correct right page back for any URL ending with .html, whatever the real extension. The Apache web server can be configured to correct such 'spelling errors' automatically.

HH URL file name example

The first pages on Hilton Harbour were just static HTML files, so the files had .html file extensions. Then I used Server-Side Includes, so I had to change the file extensions to .shtml, breaking all of the URLs. Again, when I changed to PHP for server-side scripts I had to change the extensions to .html. If I had the option, I would have enabled PHP for all files, regardless of extension, in order to meet the guideline, and not ever break the URLs.

3.2 File system structure

Guideline. File names should be unique

Purpose. This is necessary to satisfy the goal that the file system be well-organised. Otherwise, when files are in the wrong place, or have been edited in another location, you will not know where some file belongs.

Notes. If you are editing a lot of index.html files in different directories then it can be difficult to remember which is which: copying one to the wrong place and overwriting some other pages is an easy mistake to make.

In the case of index.html directory index files it is probably easy to keep file names unique by automatically redirecting requests to restaurants/ or restaurants/index.html, say, some canonical name such as restaurants/restaurants-index.html.

Guideline. For small sites, put all of the files in the site's root directory.

Purpose. This is to satisfy the goal that small sites be simple.

Notes. If your site only consists of four pages then it you probably only want to have four HTML files together in the same directory, with URLs that point directly at the files. This could include image files, but you are probably better off putting all of the images in one subdirectory.

HH file system structure example

I should have started Hilton Harbour with just three files in the root directory:

index.html (the home page)
peters_peachy_paris_page.html
cambridge_restaurants.html

Then, when the site grew and I implemented page templating, I could have created a hierarchical directory structure to organise everything, leaving the original three files in place along with all of the new URL files.

Guideline. For large sites, put the files in a hierarchical directory structure.

Purpose. This is for the goal that the file structure be well-organised.

Notes. If you want to follow this as well as all of the other guidelines then you need to use some kind of page templating mechanism. In particular, you need to separate page content locations from URLs. See the next section.

HH file system structure example

Hilton Harbour is currently organised with one sub-directory for each author:

peter
robert
marion

marion has no sub-directories because it only contains two HTML files that contain the page content for her two pages. If Marion writes another hundred pages then I can create a subject hierarchy for the page content files, and move the content files for the two original pages without breaking their URLs, which are fixed to files in the root directory. Originally, peter had half a dozen subdirectories, but after getting annoyed with this for long enough, I just put everything in the same directory. It turns out that 137 files in the same directory is not too many at all, and now I don't have to try to remember which subdirectory a given file is in.

4. Notes On Implementation

4.1 Make URLs independent of content location

The Ministry Of Health, Luxembourg The crucial step in being able to implement all of the guidelines is to separate the URLs from the page content. In the simplest case, each page is a static HTML file, such as /reviews/restaurants/cambridge.html. Do not use this as the URL; instead have a URL file called /cambridge-restaurants.html that refers to the content file.

You can use one file to refer to another in many ways, such as

  • using Server Side Includes - the URL file contains the single line
    <!--#INCLUDE file="/reviews/restaurants/cambridge.html" -->
  • use a server-side script to do the same thing as the Server Side Includes
  • redirect all requests to a single server-side script that parses the URL and 'includes' the appropriate file
  • use Unix symbolic links, or their equivalent.

4.2 Using page templates

Page templating generally works using a server-side script that performs the following steps.

  • Use the URL to locate the page's meta-data.
  • Get the page's meta-data.
  • Get the page content.
  • Combine the meta-data and page content with the page template.

The meta-data includes all of the information that will go in the HTML page's HEAD section. The page content is what goes inside the HTML page's BODY section. This information could be retrieved from a database or read from one or more files.

There are a variety of technologies that could be used to implement page templating, and therefore a web site that follows all of the guidelines. These technologies include

  • Perl scripts that use the Text::Template module, run offline, or as server-side scripts
  • PHP, an Open Source server-side scripting language
  • Active Server Pages (and presumably GNU Server pages and JavaServer Pages and GNU Java Server Pages)
  • Zope, an Open Source web application server package.

Descriptions of these technologies, and how to use them for page templating are outside the scope of this document.

4.3 Very large sites

I have not tried the guidelines on a very large site of many thousands of pages. Such large sites are always divided into sub-sites, and there are two obvious ways to approach these sub-sites.

The first would be to have one sub-directory for each sub-site. Then, simply apply the guidelines to the sub-site as if it were a whole web site, with its directory as the 'web site root'.

The second approach is to have a different domain name for each sub-site and, for the purposes of these guidelines, treat it as a separate web site.

In both cases, the problem is in evolving a group of pages from the main web site into it's own sub-site. Introducing a new URL-subdirectory or domain name means new URLs, which is exactly what my guidelines have tried to avoid.