Web site architecture requirements

This article for web site managers lists, explains and attempts to justify my choice of requirements for a web site architecture.

This page has now been superceded by my newer page on Web site Architecture.

summary

This article is not about web page design, this article is about web site design and attempts to justify my choice of some requirements for a web site architecture. My follow-up article describes my Web site design guidelines, which show how to meet these requirements. I am writing for web site managers or maintainers, webmasters, web site gardeners or whatever you call yourself - you know who you are.

I wrote this article because although a lot of people have written about how to make web pages look nice, there is little advice about how to manage a web site. These requirements are about scaleability and being nice to users and maintainers, and are of varying importance. The requirements on this page are that

every page should exist for ever at the same URL (important)
URLs should 'read well' (helpful)
there should be some limit to the length of URLs (helpful)
file names should be unique across the web site (not essential)
URLs should be 'choppable', so that the result is a valid URL (helpful)
in URLs, HTML file names should have a .html extension (helpful)
small sites should have a simple structure (helpful)
users should still be able to find pages, however big the site (important)
the site maintainers should be able to find files, however big the site (important).

the file system, navigation model and URLs

Three aspects of a web site's implementation are the organisation of the files on the web server, the form that the site's URLs take and the navigation model presented to the user on the pages. Many of the requirements above rely on the premise that there is not necessarily any fixed relationship between the file system, navigation model and the URLs. For most web sites these three aspects of the site are not independent. However, unless they are separated then some of the requirements described in this article are mutually exclusive.

For example, if the URLs always refer to .html files on the web server directly, then the URL is determined by the file's location. This means that files at the 'bottom' of a deep hierarchy where directory names are whole words must have long URLs. Also, if a site's structure is initially simple, then a coherent hierarchical structure cannot be maintained without creating new categories as the site grows. This in turn requires that files are moved into the new directories, which cannot be done without changing a page's URL.

'niceness' requirements

DG1. every page should exist for ever at the same URL

This is the most important of all the requirements because you have no control over who links to your site, with the dubious exception of search engines. On the assumption that you want people to be able to read your pages, you should leave each page where each reader first found it. They will appreciate it.

Note: if you are really serious about this then you need to have your own domain name so that your URL does not change if (i.e. when) you change service provider.

DG2. URLs should 'read well'

It is better to have URLs like /reviews/restaurants/cambridge.html than /file/all/03/rr9801219.html. That is, URLs should be composed of whole words, with no abbreviations other than common acronyms, such that the URL 'makes sense' when the words are read, in order.

If a URL reads then the user is more likely to be able to copy it onto a piece of paper without getting it wrong, and then to remember what the URL refers to a week later. URLs that read help the site maintainer in a similar way. In particular, the site's internal links are less likely to be incorrect - i.e. broken.

DG3. there should be some limit to the length of URLs

If your site structure strategy involves a deep hierarchical structure then you can end up with very long URLs after the site has grown, especially if you don't use abbreviations. The disadvantage here is that the URLs become too long to remember, to fit in the browser's location text box, or to write down.

DG4. file names should be unique across the web site

Although not essential, this helps the site maintainer. When files are in the wrong place, or have been edited in another location, then unique file names help the webmaster know where a file is supposed to go. If you are editing a lot of index.html files in different directories then it can be difficult to remember which is which. I have edited an index.html file and then put it back in the wrong place myself on more than one occasion.

In the case of index.html directory index files it is probably easy to keep file names unique by automatically redirecting requests to restaurants/ or restaurants/index.html, say, some canonical name such as restaurants/restaurants_index.html.

DG5. URLs should be 'choppable', so that the result is a valid URL

If you are going to use URLs such as reviews/restaurants/cambridge.html then you should expect users to look in the parent directories, hoping to find more restaurant reviews at reviews/restaurants/, say.

Assuming that the restaurants directory really does contain more restaurant reviews, which it probably should, then you have at least three choices. You could let the user see a directory listing (naff), show the user a 'no such page' error (unfriendly) or have a restaurant reviews index page in index.html (nicest). Choosing the last option, though, results in the need to maintain a whole raft of index files, which probably have the same name.

Presumably users, such as myself, do this after giving up on the site's navigation interface.

DG6. in URLs, HTML file names should have a `.html` extension

It is nicer for the user if the server always gives the right page back when he or she uses a URL ending with .html, whatever the real extension. The files on your server might have exotic extensions, such as tcl, pl, shtml, php3. They might even have a primitive extension, such as htm.

scaleability requirements

DG7. small sites should have a simple structure

It should be easy to start small, without making it hard to grow. For example, if a site only has four pages then these should only have to be in four files in the root directory. It would not make sense to use a complex directory structure for just a few pages. After all, the site might never grow.

DG8. users should still be able to find pages, however big the site

Unlike the URL or directory structure, it does not necessarily make sense for a web site to have a 'top', in the way that an FTP site does. It might well make sense for the navigational model to be independent of the URLs and the file system directory structure.

When using most FTP clients, the starting point at an FTP site is probably the /pub directory. From there you navigate 'down' the directory structure to find the files you want. Similarly, a web site maintainer usually accesses files on the web server by starting at the top of the directory structure on the server. However, users of your web site will start at an arbitrary page. This is because other people and search engines will link to your content rather than to your main page. How to help the user navigate your site is outside the scope of this article, and is extensively discussed elsewhere.

DG9. the site maintainers should be able to find files, however big the site

This goal essentially says that there should be a coherent scheme for locating files on the web server. For a large site with many maintainers a coherent strategy is essential so that everyone involved saves or looks for each file in the same place.

This is related to, but may be independent of, all of the other web site architecture requirements. Ultimately, though, if your files on your server are a mess then your site is unlikely to stay neat and tidy for ever.

Writing by Peter Hilton