|
To understand what a device is good for, it is often helpful to take it apart and examine how it works. Here's an introduction to how Arch does what it does. |
|
Gnu Arch
Support Services
Products
Alternative/Regional Providers |
This page contains a brief dissection of Arch -- a high-level description of how Arch works. The intention is to give technically savvy readers deeper insight into Arch by helping them to develop some intuitions about how Arch is implemented. There are two parts and, depending on your interests, you might want to skip the first. These are: Performance PrinciplesBefore examining the details of the Arch implementation directly, it will be helpful to name a few performance principles that shape the overall design: Minimize Server-side ComputationArch allows multiple users to access a single archive over a network. Consequently, although it also supports serverless, local operation, Arch must support a client-server mode of operation. We wish Arch to scale in number of concurrent users accessing an archive as far as possible. For example, if a popular project commits an important bug-fix, there may be thousands or tens of thousands of users who want to update their local sources after that commit. Consequently, the design of Arch aims to
The primary strategy for doing so is to
Disk Space is CheapMost revision control systems in widespread use today were conceived or designed at least a decade and in some cases two decades ago. Even some contemporary designs follow design patterns that were established that long ago. Systems designed with that older perspective place a very high premium on disk space. Although they work hard to be fast, nevertheless they are apt to make trade-offs that consume I/O bandwidth and CPU cycles in favor of consuming disk space. Even worse, these trade-offs often add to the implementation complexity of these systems. These days, such trade-offs are already somewhat anachronistic and, within just a few years, such trade-offs will be completely absurd. Multiple gigabyte disks can be had for under $200USD and, before long, we'll have terabyte disks in such price ranges. Storage at those scales is more than ample for revision control purposes. The lesson?
I/O Bandwidth is ExpensiveUnfortunately, there is for the forseeable future still a sharp constraint about just how much disk space we can touch for a given arch operation: I/O bandwidth remains expensive, especially relative to disk capacity. I/O Bandwidth costs are imposed at more than one layer. On-device cache space limits, seek times, transfer times, OS cache space limits, and CPU cache space limits, and main-memory speeds all constrain the utility of passing lots of data back and forth to disk.
Networks are SlowA fast, local network connection can easily have lower latency and higher bandwidth than a local disk. Nevertheless, the expected case for most Arch operations over a network is that they pass over the Internet or a comparable network. Minimizing roundtrips and the quantity of data transfered is critical.
A Layered ArchitectureAnd now we turn to the implementation structure of Arch itself. Arch is a layered architecture and it's easiest to understand by starting at the lowest level and working upwards.Tree InventoriesThe lowest level concept in Arch is that of a tree inventory. A tree inventory is a listing of all of the files and directories in a tree, with two fields per file or directory. These fields are:
In other words, if a file is renamed, its physical name changes, but its inventory id remains the same. This allows Arch to recognize that a file has been renamed by comparing two versions of the physical name for a given inventory id. Here is a fragment of a directory listing showing physical names and inventory ids:
physical name: inventory id:
links/Makefile.in i_x9034AKLJn3
links/make-links.sh.in i_lk3n8908nNx
links/remove-links.sh.in i_lk3nn90lkjs
How are inventory ids associated with files? How does arch know what the inventory id of a given file is? Three mechanisms are provided:
ChangesetsTraditional Traditional The next layer of Arch takes For example, as we'll see below, when you Namespace and ArchiveThe combination of The next layer of arch, the namespace and archive layer, essentially just automates that by hand process. It provides a mechanism for naming (with globally unique, human-readable names) the changesets programmers create, and a mechanism for publishing changesets and retrieving published changesets. Global Revision NamesThe arch namespace is organized around the concept of a development line -- a sequence of revisions ordinarilly beginning with a complete tree called a base revision and proceding with a number of successive changeset revisions, each defined relative to its predecessort by a changeset. For example, a programmer might initially import the source for
an implementation of tar--mainline--1.0--base-0 After fixing a bug, the programmer can commit the changeset describing the bugfix. The changeset and the tree resulting from applying it are known as: tar--mainline--1.0--patch-1 Additional work and more commits would yield:
tar--mainline--1.0--patch-1
tar--mainline--1.0--patch-2
tar--mainline--1.0--patch-3
... etc.
This development line is situated within a particular arch archive which itself has a name. Prefixing the archive name to a revision name yields a globally unique name for the revision: lord@emf.net--gnu-archive-2004/tar--mainline--1.0--patch-3 Tags are named in the same way and function comparably to base revisions. For example, instead of importing a new source tree to create tar--mainline--1.0--base-0the programmer might have defined that revision to be a tag revision -- more or less an an alias within the namespace. We often use notation like: tar--mainline--1.0--base-0 => tar--experimental--1.0--patch-4to mean "the base-0 revision in the mainline branch of tar is a tag revision, tagging the patch-4 revision in the experimental branch". Archive StorageThe most basic role of an arch archive, therefore, is to store base revisions, commit revisions, and tag revisions in a way that (1) associates them with their name; (2) allows programmers to query and retrieve revisions and the changesets and other data associated with them. Internally to arch, the storage mechanism is isolated from the rest of the implementation by an abstraction barrier. Entirely new storage managers are easy to install. In practice, only a very simple storage manager is used and
experience has shown this stone-simple approach to work well. In
particular, archives are stored as ordinary files: imported base
revisions are stored as a compressed tar file of the entire source
tree; changeset revisions created by The stone-simple approach has some enourmous advantages and is one are where the performance principles enumerated above come into play. Some example advantages:
Archive Caching, Revision Libraries and Other SpeedupsThe archive storage layer has many advantages but it also creates a performance problem that needs to be solved: Simple Archive Formats Can be SlowConsider how, given just the archive storage mechanism described above, a particular revision can be retrieved. Let's suppose that we want to get a copy of the tree for:
tar--mainline--1.0--patch-4
Absent any other mechanism, we would build that tree with the following steps:
That's conceptually simple, and it's a direct consequence of
our choice of "dumb server" storage management, but it is
problematic. For example, suppose that instead of wanting the
Worse, in the case of wanting the Arch solves these problems with three mechanisms: archive cached revisions, archive mirrors, and client-side revision libraries. Archive Cached RevisionsOne way to reduce the amount of changeset application and network bandwidth associated with building a revision is to apply our performance principle that disk space is very cheap. Specifically, we augment the archive format to permit
(user-directed) caching of pre-built revisions, stored as
compressed tar files. Continuing the earlier example, a user may
have asked that the revision Now, if a user asks for a copy of the In keeping with our goals of client-side computation and dumb servers, when an archive cached revision is created it is constructed on some client, bundled as a tar file, and stored on the server. Which revisions should be archive cached? One rational, approximately time-and-bandwidth optimal policy would be to space cached revisions such that the sum of the sizes of changesets between any two cached revisions is about the same as the size as the tar file containing the cached revision. Spacing them further apart would lead to a situation where sometimes the total size of changesets fetched to build a revision exceeds what could have been fetched with denser caching. Spacing them closer together will cause arch to sometimes fetch a cached revision when it would have been faster to fetch several changesets instead. Note that applying this policy approximately doubles the size of an archive -- which is quite accessible since archives are relatively compact to begin with and, after all, disk space is very cheap. Other caching policies make good sense in some situations, though. Therefore, we leave it to users to explicitly cache revisions -- a process that can trivially be automated to implement whatever policy a user chooses. Archive MirrorsA second technique for saving on network costs is to use archive mirrors. Arch contains a facility for constructing and incrementally updating local, read-only copies of remote archives. When such a mirror is used, each changeset or base revision from the remote archive is fetched exactly once and, thereafter, access to it is entirely local. In our example, this would still (by default) mean applying 140 changesets -- but at least those changesets would reside on a local disk. Archive caching and archive mirroring interact in a favorable way. Although mirrors are generally read-only, users may archive-cache a revision in a mirror. For example, one reasonable set-up is to publish a heavily read archive on the Internet, but archive-cache nothing there. Instead, encourage users to mirror from that public archive rather than use it directly, and to archive-cache according to their needs in their local copies. Revision LibrariesThe final technique for speeding up arch operations is entirely a client-side optimization. As with archive mirrors and archive caching, it trades-off disk space for speed. The technique provides for something called a revisision library. A revision library is a read-only collection of pre-built whole-tree copies of revisions, stored more or less as ordinary trees. A special feature of revision libraries is that unmodified
files are shared between revisions via hard-links. For example,
suppose that a revision library contains This is a reasonably space-efficient way to store many, many revisions in space which is a fraction of the sum of their individual sizes. The revision library mechanism is designed to be configurable as a cache or memo. Revisions can be automatically added to a library, on demand, or they can be explicitly added. It is a trivial matter for users to implement a "pruning" policy to discard revisions that are no longer needed. When a revision library is present, many arch operations become
very, very fast. For example, suppose again that we want a
Often the tools that a user uses to work on their checked-out trees are well-behaved with respect to hard-links. Emacs and vi, for example, do not modify the files you edit --- they write fresh copies of those files containing your modifications and thus they break hard-links to those files. Arch can take advantage of the hard-link-safety of common tools. When checking-out a tree, you can ask arch to not recursively copy the tree form a revision library but, instead, to make your working copy by hard-linking to the revision library. This turns out to be a blindingly fast way to check out a tree. (And, arch contains integrity checking features for revision libraries. Should your tools fail to be hard-link-safe after all, arch will notice and protect you from any resulting revision library corruption.) Speedups SummarizedThe basic archive storage model of arch is simple, space-efficient, and often but not always network- and time-efficient. In some cases, however, without additional mechanisms, it would lead to gross network- and time-inefficiencies. We eliminate those gross inefficiences with three mechanisms, archive caching, archive mirroring, and revision libraries, which have in common that that they trade disk-space and client-side computation for better performance overall. A pleasant property of this approach is that it requires _no_ server-side computation and thus, excellent speeds can be achieved without sacrificing the scalability or arch in the slightest. Patch Logs and MergingWe've seen so far how arch allows users to commit import, changeset, and tag revisions. We've seen that these are assigned globally unique names and examined how they are stored in archives. We've seen how revisions are retrieved from archives and looked at some of the speed-up mechanisms that exist to make retrieval fast. Now we turn to the question of merging. Suppose, for example, that two users are each working on their own development line of a single project. Naturally, they may want to share work: each user will want to combine his changes with changes made by the other. How does arch handle this? To understand the answer, we must first understand patch logs and then we can turn to history sensitive merging. Patch LogsWhenever a user creates any new revision he must provide a short description of the revision. The description, formatted similarly to an email message, contains a header that provides a brief summary of the revision, and a body that can contain details. Arch maintains a special directory, called For example, when a tree is first imported, the user writes a
patch log entry and that is stored in a file under
Similarly, when a user This implies, for one thing, that a given revision contains
patch log files for its ancestors in its development line. The
tree for When patch log files are added to a revision, arch adds additional, automatically computed headers. For example, one header names all newly added files in the revision. Another header reports what files have been deleted. Conveniently, patch log files can be processed and summarized
to produce a conventional History Sensitive MergingConsider, then, what happens if a user wants to add to his tree the changes assocated with a revision made in a different development line. For example, suppose that I have checked out a copy of tar--mainline--1.0--patch-140and I note that a revision in a different branch, namely tar--bugfixes--1.0--patch-2is a changeset revision and that the changes it makes fix a bug I'd like fixed in mainline. Arch can accomplish this quite easily simply by fetching the
changeset for the bugfix and applying it to my
tar--mainline--1.0--patch-140will contain a patch log entry for tar--bugfixes--1.0--patch-2 That's handy in and of itself. For example, if I don't remember whether the bug-fix changes have been applied to my tree, I can use arch to query the patch log files and look for a record of it having been applied. It's also handy for smart merging. For example, in my mainline tree, I can ask arch: % tla whats-missing tla--bugfixes--1.0and it will respond with something like:
patch-1
patch-3
patch-4
indicating that I have not applied bugfix patches 1, 3, and
4 -- but have applied bugfix patch 2.
Using the summary lines from log messages, arch can report:
% tla --summary whats-missing tla--bugfixes--1.0
patch-1
fix the help message
patch-3
fix buffer overflow (bug #238)
patch-4
correct exit status (bug #14)
and a command such as: % tla whats-missing tla--bugfixes--1.0 | tla replay --list -will apply just the missing patches but none that I've already applied. Merging isn't limited to just the changesets created by
And the Rest is EasyThe above summarizes the essense of what could be dubbed core arch. Although many details have been omitted or glossed over, hopefully you now have some intuition about how arch works. Many commands are built on top of the basic facilities outlined above. In all cases, they boil down to simple combinations of the above ideas: changeset computation and application, building revisions based on archive and revision library data, and so forth. |
Copyright 2004, Thomas Lord (lord@emf.net)