The "mail sifter" is a project to design a highly usable system for filtering, sorting, and routing e-mail. Here is a description of its planned implementation architecture as part of the Pine mail system. One of the key issues in designing the architecture is whether to do the sifting as the message is delivered or when the message is read by the end user. It is hoped that this architecture can accommodate both, including the possibility of having the mail sifter execute on a remote mail server. This is one of the primary reasons TCL [Ousterhout 94] was chosen as the scripting language for the sifters.
This design architecture focuses more on Internet-based e-mail than LAN-based e-mail, though many of the same issues apply. Internet-based e-mail usually has the messages delivered to a host computer on which the mail user agent (the mail reading program) runs. Lately, Internet e-mail has retained the delivery architecture but relies on mail access protocols that connect a personal computer to a server that has the mail store. LAN-based e-mail (like Microsoft Mail) has host computers that serve as post offices. Each post office serves some number of end users who read their mail on their personal computer. The end user software retrieves mail from the post office via a network file system. Mail is transferred between the post offices via a mail transfer protocol.
This architecture also focuses on sifting that can be done on the receiving end without any way to control how messages are composed. The Internet connects dozens of different mail systems and most users probably receive mail from many different systems on a regular basis. Building a system that requires sender and receiver to use the same software will have much more limited use than a filtering system that has useful functionality regardless or the origin of the message. For a discussion of an end-to-end system see the Information Lens [Malone 87].
Work is underway separately to refine a list of sifting tasks and functions [Lundblade 95]. The list below is derived from that work specifically for this analysis of the basic sifting architecture. It is neither a general-purpose low-level functional specification, or a user-oriented task analysis.
The list below refers to a general set of sifting criteria. This involves pattern matching rules on the message header and body, applying probabilistic information retrieval to the message, testing the message size, and criteria for message expiration.
The CCITT X.400 standard introduced the concepts of the mail user agent (MUA) and mail transfer agent (MTA). The MUA interacts with the end user and implements the mail reading and composing functions. It is only active when the user is present. MTA's see to the delivery of mail messages across networks and gateways, ultimately to a repository where the MUA accesses it.The repository is sometimes termed the "inbox" or "delivery slot". See [Bornstein 94] for a related and more detailed description.
When evaluating the sifting tasks above it becomes clear that sifting must be available at the last MTA which delivers the message into the inbox and in the MUA in order to implement all listed tasks. Vacation responses must be sent out when the MTA delivers to the inbox. They are silly if they are not sent out until the user returns from vacation to invoke the MUA. On the other hand it makes no sense to zoom in to a view of a group of messages if one is not using the MUA. Doing a one-time sort of a mail folder (e.g. separating a collection of messages about e-mail systems into two subject groups) is also something that is best done in the MUA. Delivering messages at a specific hour, the reminder service, requires further facilities and is not discussed here.
There are several more subjective reasons for filtering both in the MTA and the MUA. It is more efficient to discard any unwanted mail at the MTA than to wait for the MUA to discard it. This saves disk space and will allow the MUA to start up faster. On the other end, implementing filtering in the MUA allows for tighter coupling between the sifter and the MUA itself. This will result in an easier to use interface for the sifter.
A further complication that interposes itself between the MTA and the MUA is the use of mail access protocols like POP [Rose 93] and IMAP [Crispin 90]. These allow the MUA to retrieve mail in a client-server fashion from the inbox. They form a bridge between non-multi-tasking personal computers that cannot easily accept incoming mail and time sharing hosts which cannot run the latest GUI mail reading software. In this scheme incoming mail is delivered to mail the server or mail store. The client software running on a PC (but not necessarily a PC) retrieves the mail using the mail access protocol. In some cases (POP) the mail is transferred completely to the MUA host computer where all permanent mail is stored. In other cases (IMAP) the mail may be left on the server where it was deposited by the MTA. It may even be the case that all mail is stored on the server and none on the client host running the MUA. A third possibility involves storing mail both at the server and at the client and have a protocol that resynchronizes them when client contacts the server.
Revisiting the issue of where to do the sifting, there are several advantages to being able to do the sifting on the server. In a situation where one retrieves their e-mail from the server over a slow or expensive link it is advantageous to delete any unwanted mail before it is transferred. It is also advantageous to execute a search over a large collection of mail and only send the results, rather than transferring the whole collection and then executing the search on the client.
Pine and several other MUA's use the "c-client" library for access to the mail store. The c-client provides a generic API for access to a mail store and a number of data structures and functions for manipulating e-mail messages. There are also delivery agents and POP and IMAP servers based on the c-client. The c-client incorporates drivers for mail stored in a number of different formats, drivers for IMAP, POP, USENET news and some other data formats.
Building the mail sifter on top of the c-client is advantageous in the
first case because Pine uses it and it provides a lot of functionality
that is needed. The mail sifter will work on mail in many formats, USENET
news, and other data formats for which a c-client driver is invented. A
further advantage comes from the fact that several mail access protocol
servers are build on the c-client as well as a mail delivery protocol.
This means that essentially the same mail sifter code can be used in the
MUA, the mail access protocol server or in the delivery agent (which is
invoked by the last MTA to place the message in the inbox)
.
The next issue that arises is how to describe the sifting rules, in particular if they are potentially going to be created by an MUA and uploaded to a mail access protocol server.The obvious solution is to have a simple configuration file that lists the rules. This can easily enough be uploaded to a server and executed by code written in C that is common to the server, the MUA and the final MTA delivery program.
However, a design inspired by safe-tcl [Bornstein 93] provides some additional advantages. Safe-tcl already defines TCL syntax and semantics for a large amount of the functionality needed to describe sifting rules. It includes access to all parts of the mail message and primitives for creating and sending messages. TCL also provides the full functionality of a programming language and will easily fulfill the need for things like string matching and boolean logic operations. It does not provide access to mail store functions such as storing a message in a mail folder. These can be provided with TCL extensions that reflect some of the c-clients functionality. It is also most likely best that probabilistic retrieval be implemented in C and made available through a TCL interface, rather than trying to implement it in TCL only.
The result is shown in figure 2. The sifter rules configuration that
is actually executed is expressed in an extended TCL scripting language.
The TCL script is invoked for each message to be filtered. Return codes
can signal what is to be done (highlighting, deleting.) with the message
in the current folder. Other TCL primitives can be used to generate and
send a response to the message, to file the message in another folder or
take other action
.
The TCL-sifter scripts may be created manually, but this can't be expected from most users. A configuration program that generates the TCL scripts will be needed. In most cases the scripts will be very simple and will be not much more than a series of print statements with a few variable substitutions. The ability to create scripts manually will be a big advantage to power users. They will be able to create very powerful and complicated scripts without having to have write C, have a copy of the source code or recompile.
Perhaps the most substantial advantage to using a scripting language is that it can be uploaded to a mail server (e.g. POP or IMAP) to implement mail sifting as the message is delivered by the MTA.
A further advantage is that of great extensibility and flexibility in the sifting functionality at the server without having to change the server software. Especially when the client user has no control of the server software. When a new sifter feature is conceived it can be simply coded in TCL and uploaded to the server. Two different kinds of mail client software that share access to a mail access server can also both make use of the sifting features on it. In fact the sifting features would be more tied to the client than the server.
There are some down sides to this though. If a script is coded by hand and accidentally has an infinite loop in it there will be problems. The larger problem is at the server because the user has no way to abort it. Allowing the uploading of code for a general purpose programing language to a server changes the character of the server from one where the resources consumed by a client was fairly fixed to one where the are potentially unchecked. It is possible to place limits on the amounts a server instance is allowed to consume, but it will be difficult to gracefully terminate a script in progress. Another way to avoid some of these adverse conditions on the server is to have all scripts uploaded to the server be generated by a program. That is to discourage or even prevent a user from manually writing scripts to be uploaded.
In summary, using a TCL based scripting language has some significant advantages in its generality and open endedness. It does not restrict a prior the functionality of a mail sifter. It also has the potential to be a common scheme for filtering among several mail clients. It has some down sides in that an interpreter is required and implementing the interpreter and script generator is probably more work than implementing a simple rules file. It also may allow a user to consume a disproportionate amount of resources on the server that executes the sifter. Last, using an existing language like TCL leverages the expertise of it's designers and because it is in common use takes advantage of a growing community of TCL programmers.
As of this writing none of this is implemented. The first step will be creating TCL interpreter extensions for the c-client in order to build the most basic sifter that matches patterns in message headers and files messages based on those patterns. Since a large part of the goal of the project is to study usability of the end system the priority will be on adding sifter features and advancing the sifter configuration manager, rather than on creating server and MTA sifters.