Introduction¶
Currently, the only caches available to Plan 9 systems are the in-kernel image cache and CFS. Both are very loose caches and do little to reduce the latency of small operations (they help mostly for large operations, where they can avoid streaming data back from the server).
Here is a proposal for a way for cache controllers or servers to notify
clients or caches of invalidation events. It allows caches to know (up to
latency) that they are up to date with the server state. The proposal does
not alter the 9P protocol at all (in contrast to Op); instead it relies on
additional aname
-s to provide a control channel. We refer to the
protocol run over the cacheWhen cfs starts up, it makes no assumptions of
the validity of elements in cache. control aname as a side-protocol to
emphasize that it is spoken beside an ordinary 9P stream.
This protocol has been designed to allow caches to help hide latency as well
as bandwidth. We hope to allow caching of directories and to reduce or
eliminate altogether the need to stat()
files on the server to check
cache validity. Caches are purely event-driven; the protocol does not call
for timeouts of any sort. Caches here remain write-through for simplicity.
The protocol is engineered to be small and simple. Some optimizations have been considered, but above all the emphasis has been on getting the fundamental approach right and allowing some extensibility through the namespace exported by the cache controller.
Existing Caches: CFS¶
CFS is currently a single-threaded, synchronous cache. It’s behavior is simple: directories are not cached and cache status is only validated on Topen requests:
For directories (files with
QTDIR
), pass the operation to the server.For directory-specific operations (
Twalk
,Tcreate
,Tremove
), pass the operation to the server.For
Twrite
andTwstat
messages, pass the operation to the server and update the cache.For
Topen
, pass the operation and mark the in-memory version with the resulting QID.
- If the request is a Tread,
Find the QID in cache; if the version reported by
Topen
mismatches, throw out the cached data.If the data are not present, it forwards the request on to the server, collects the results, caches them and responds to the client.
Otherwise, answer from cache.
There is also some special handling of QTAPPEND
files.
The Callback Scheme¶
If the server (or a cache controller) published an append-only list of all changes (let’s call it “the journal”) to the filesystem, then:
The state of the filesystem could be described as the size of this stream.
A cache’s contents may be described as a subset of the state given by an offset into this stream. (Several offsets may work if the changes are orthogonal to the contents of the cache, however the caches themselves can track their offset and so always have a dividing line between the “seen” and “unseen” changes)
The notion of callbacks comes about by allowing reads to the journal to block. If a client is “behind” the leading edge of the log, then reads should return immediately. When (if) a client catches up to the leading edge, the first read should return immediately with zero bytes available, informing the client of its caught up state. Subsequent reads from this client at the edge should block until the journal is appended.
In this way we have added asynchronous callbacks to 9P without altering the protocol, in particular without altering its inherently client-driven nature. There is precedent within Plan9 for files with quasi-blocking semantics as above: the usb audio driver’s mixer file behaves similarly. See usb(4).
While the cache is not caught up, it may block client requests or may revert to a Tstat-based validation behavior.
Contents of the Callback Journal¶
A sufficient, though much larger than necessary, journal would consist of all operations given to the filesystem. This has several unpleasant features, but it would, in theory, suffice.
9P already offers us a convenient guaranteed-unique small tag on every entity served by a file server: the qid. If the journal held only the qid of every altered (write, wstat, or removed) file, caches could know to invalidate their cached copies of content from that qid. There is no information exposure except possibly the rate of file operations on the server, but not even user information is given in this channel, making it very unlikely to be a leak.
Directories may also be safely cached: they have qids, of course. They are
modified only by Tcreate
and Tremove
messages. Note that
Tremove
should journal an update to both the directory itself and the
file removed, as caches may have discarded one but held the other.
ReQID operations¶
Simply enumerating mutated QIDs would allow caches to invalidate their
contents but, if a file is open and becomes invalid, the cache must revert
to being pass-through and will have to fetch anew the new version of the
file. This happens because the cache is not informed of the new qid and so
cannot relabel the contents it does have (if we go with typed journals; see
below) or may subsequently receive of the open file. We would therefore
like the journal to contain pairs of qid, e.g. (old, new), to allow a cache
client to re-read an entire file (or suffix, if QTAPPEND
) and bring
itself back up to date, so that it doesn’t have to re-download the contents
on subsequent Topen.
Todo
I’m not really sure what to do about this! The cache controller can re-walk fids to find out the new QIDs, but that’s going to be annoying. This suggests against interposition or that perhaps an extension to 9P whereby mutation events return the new QID.
Filtered Callbacks¶
However, even the journal of all qids changing on a server might be rather large and mostly uninteresting to every cache. We could imagine that the server/cache controller could be selective and only advance the logs of caches that had ever read from that qid.
If the cache and identifies itself (quasi-)uniquely to the cache controller, the cache controller may keep per-cache lists of qids and provide many, filtered journals containing only these qids. Note that these journals should continue to be updated even after a cache has disconnected, under the assumption that it will reconnect soon. This implies that the cache controller will be storing some duplicated state in the journals, but we can easily guarantee that this state will not increase without bound.
Server-side Journal Size Management¶
Holding all journal data forever is unwise. The cache controller is free to forget (and return Rerror messages for) sufficiently old journal entries. Caches must interpret this as a signal that they should revert to Tstat behavior. (Optionally, they may seek forward for the tail of their journal and preemptively invalidate entries in the cache; this may be useful to reclaim space? It may also not be useful at all.)
The correct recovery procedure, upon an Rerror from the per-cache journal file, is to mark the cache as stale, to seek to the end of the file, and to dispatch a read, as if there had been no interruption. All entires in the cache are now stale, but the journal is caught up and all subsequently revalidated files will be shot down as appropriate. At this point, the cache may revert to cfs-like behavior, validating the cache contents with Topen (if the client requests a Topen) or synthesized Tstat messages (on other client requests).
The controller may delete the filtered journal file, as well as backing store material. In this case, the correct procedure for the cache is to re-open the file and resume reads from offset zero, marking the cache invalid as above. Deletions may happen even while the cache is connected if the controller becomes sufficiently resource-starved to merit dropping the client’s filter.
Journal Rollover¶
File sizes in 9P are rather large, but it should be possible to roll over journal files (back to offset 0). One option is simple deletion of the journal file, but that’s somewhat klugy and results in total invalidation of the cache.
Todo
Perhaps sequence-number arithmetic can save us?
Alternate Uses As A Notification Channel¶
As pointed out by Uriel at IWP9 2007, this protocol may have uses other than cache control: in particular, it naturally provides an arm/notify/rearm API similar to Linux’s inotify. Even for servers supporting live queries (ala BeOS), this protocol may still have a place as the back channel to notify clients that their queries have been updated.
The Journal Side Protocol¶
9P servers may export multiple file systems. We propose a synthetic file system to be exported along side others, possibly with the suffix /cc (for cache control). Within this file system, we will expose a directory called “journals” (segregated in case we later wish to add additional functionality to the cache control stream; e.g. locks and leases. Later, later.). The files inside here serve to name caches and identify their connections.
A cache wishing to play the filtered callback game first looks up or
generates its probably-unique identifier (sufficiently large random numbers
will suffice). It then carries out authentication as usual, Tattach
-es
the /cc
system. It then opens the file bearing its name in
/journals
on the /cc
export, or creates it if it does not exist.
Now it Tattach
-es to the corresponding normal file system. (Because
authentication has taken place, the permissions on this file may be locked
to the credentials that were used to authenticate, meaning that cross-user
journal snooping is not possible.)
While it is believed that filtering will be useful to have on large
installations and to close the minimal information leak of server mutation
rate, but for initial implementations, the file name /journal
in /cc
is reserved for an unfiltered journal. If exposed, this file indicates that
the controller will provide an unfiltered journal, possibly in addition to
filtered ones. It is then the cache’s choice.
The cache controller uses the open file descriptor to the journal file to identify the operations of this cache and maintain the server-side qid filter. Entries will be taken off the filter list as they are shot down (placed into the journal) and of course the filter may be entirely discarded if the cache controller removes the journal file.
Journals are marked QTAPPEND
and may only be opened for reading by the
clients. A cache may delete its filtered journal file to indicate that the
controller may discard its filter; this may be a polite action if a cache is
reformatted.
Note that because the cache control protocol is simply a file system, it should be straightforward to mount it (without caching, of course) and watch the journal changes happen in parallel to the cache’s operations. This should facilitate debugging.
Experimental Followons¶
Typed Journals¶
It may make sense to indicate in the journal which kind of update
triggered the insertion as well as some other metadata. For example, a
Tremove
should cause the cache to dump all of its data for a file, but a
Twstat
may merely mean that the cache needs to Tstat
the file again.
If Twrite
journal entries include the new qid and the region of the file
updated, that may also prove useful for large files or avoid flushing cache
state for regions not touched.
This may dramatically increase the record size of journal entries. Note
that some of these things mean that we lose the arm/disarm/rearm behavior of
the caching protocol: as described above, a Twrite
entry would
implicitly rearm the journal on the new qid.
The cache controller should always have the option of reverting to a general
flush notification, to deal with, for example, a large number of writes all
over the file. It is also possible that the cache controller may “fudge”
the journal a bit: if a large number of updates to a qid (or qid chain, ala
Twrite
) are pending and not yet read by the cache, they could be
replaced in entirety by a simple flush message, freeing up controller state.
Musings about volume-like bundles¶
Since the server/cache controller knows the full path taken to reach a given file, it should be possible to emulate something like AFS volume callbacks by informing the caches that a given qid will be invalidated by an alias in the journal. Under this scheme, an arbitrary collection of qids can be given a single qid handle for invalidation.
Explicitly, consider a fid that has been walked into /bin/386
. Since
this directory contains lots of files and is updated only rarely, we would
like that the cache controller not have to keep a large amount of state
here. Upon entering /, the fid’s server-side metadata would be labeled with
the bulk-invalidate qid for /, upon entering /bin this would be replaced
with that of /bin, etc. If an open is made for reading, the server must
return the qid of the actual file, but must notify the cache of the aliasing
effect (the server is required to notify the client of the alias before it
may notify the client of invalidations of the bulk address). If any update
opcodes (write/unlink/wstat/create/…) take place based on this fid, the
bulk invalidation qid is written to the journals, and the world proceeds
from there.
I propose that aliasing records be handed back not in the journal file
itself, but perhaps in a file named /journal-alias/${CACHENAME}
in the
/cc side-protocol file system. All records in this file are two-qids
wide. Let’s (arbitrarily) mandate that the order is
qid-returned-from-Topen
/Twalk
followed by the aliasing qid. That
is, the second one is what will end up in the journal and is the handle to
the whole “volume”.
Note that having this be a separate file means that it may be optional for caches and servers to support it. If the protocol is to open the aliases file first, servers may enforce that caches partake in this protocol by rejecting their attempts to create journal files (this kind of mandatory support for aliasing may help reduce server overhead).
Each server-side accessed file should point at its invalidation qid and be hashed by invalidation qid. This means that the server can easily flush the right qid and can also flush invalidation records for all files on a volume once a flush has taken place.
Musings about cache flush notifications¶
It may be advantageous to a given client to notify the server when it has flushed a file from cache, to keep the server from (falsely) flushing files as part of its memory reclaim procedure. This also helps the server as it avoids entering the more aggressive phases of memory reclaim in the first place.
I propose a write-only, QTAPPEND
file evict
in the root of the /cc
directory into which a cache may write a qid it no longer cares about. This
file does not need to be per-client as it is a client->server communication
pathway.
Other Questions¶
Coherency? Current thoughts are to just avoid the problem by having clients mount filesystems without caching to get coherent access.
Write-back, not write-through?
Distributed cache coherency?
Recommendations for P92010¶
As much as we try to make the protocol entirely orthogonal to 9P2000, there are some small changes which would help our lives immensely but, at the same time, do not force people to play this (or any other) game.
Interaction with QTCTL / QTDECENT¶
There is a proposal to add a new qid type flag, QTDECENT, to denote that a file is cachable (the contrast, of course, is files synthesized by device driver). The original proposal was for a QTCTL, but the sense has been flipped so as to have “failsafe” defaults. I believe that this would be a beneficial addition to 9P2010.
QTDECENT would let the generic cache controller shim discussed above know which files it could reasonably attempt to cache. It could then pass on QTDECENT to clients or caches that did not speak the cache control protocol, such as the current cfs. Prior to QTDECENT, it would have been necessary to carefully offer cache control only on true file servers or emulations of true file servers. With QTDECENT, the cache controller shim may be placed on any exported file system, again subject to the constraint that it sees all traffic to the files or is itself speaking the cache control side protocol with the exporter.
The journal files should not be marked QTDECENT, but this does not seem to be a correctness issue so much as one of taste.
Note
For completeness, synthetic/”indecent” objects are currently typically marked by qid.version == 0, but this is undocumented and nowhere enforced. Having an explicit flag seems better.
Mutation operations should return the new QID¶
- Certain mutation operations do not return the new QID when they are carried out:
Rcreate does carry the QID of the file created but not that of the directory containing it.
Rremove does not carry the new QID of the directory containing the removed file.
Rwstat does not return the new QID of the file being modified nor the directory containing it.
Rwrite does not carry the new QID of the file being modified.
We suggest that 9P2K10 responses contain these five additional QID fields so that our intermediating controller does not
Interaction with propagation of QIDs to root¶
There has been a proposal that all mutations propagate QID version changes up to the root. This has many advantages, including very fast exact delta computation. There are a few possible mechanisms to support intermediators such as ours:
Don’t. That is, require the cache controller and server to be the same body of code.
The operations described above could return lists of QIDs.
Servers playing the propagate-to-root game publish, in another journal-like file, a list of all reQIDs that could not be described in 9P2K (or 9P2K10).
The intermediator(s) can subscribe to this. Clients may (but need not) be allowed to subscribe but will probably prefer to get their updates filtered by the intermediator(s).
Of the three, the latter is the most attractive to us because there is no danger that the R-messages would overflow the maximal message size, and it avoids mingling the already complicated server code with the complicated cache controller code.
Acknowledgements¶
Much of this is not entirely original thought but was developed with the help and gentle prodding of David Eckhardt (de0u@andrew.cmu.edu). This document originally appeared on the Carnegie Mellon University 15-412 class wiki.
Publications¶
IWP9 2007 WiP Presentation¶
I (nwf) gave a brief presentation at the WiP session of IWP9 2007. It is
available dvi
form here. This is
mostly for archival purposes; the slides are basically a subset of this
page.
IWP9 2009¶
We (Venkatesh Srinivas and nwf) wrote a paper for IWP9 2009 talking about a summer 2009 research implementation of jccfs. The code base for this project is officially housed in a Mercurial tree at http://www.grex.org/~vsrinivas/src/cfs.
A version of the paper with the below errata applied is available in
pdf form here
. The submitted version is
available in pdf form here
;
Errata¶
The versions of this paper hosted on this site have had all (known) errata corrected.
The characterization of cfs(4) given in section 2.1 is incorrect. cfs(4) merely forwards
Topen
messages and collects the file’s QID from theRopen
. The first paragraph should readcfs(4) is an on-disk cache intended for use by Plan 9 terminals. It copies data from read messages into an on-disk cache. For subsequent read operations, if the data are already present, cfs responds with cached data. On each Open, cfs implicitly collects stat information from the server to check for validity of cached contents. cfs does not cache directory conents nor, by extension, does it attempt to hide latency for walk operations. Every Walk, Stat, and Read on directory is simply passed through to the server.
The printed paper does not have footnote markers.
The printed paper, in section 4, suggests that a table is in a section titled “Execution Characteristics”. No such section exists; instead the table follows immediately below, on the facing page.