git » libjio » commit 5b4bcd6

When we create a transaction file, we need to make sure not only the data hit the disk, but the directory metadata too, so we can be absolutely sure we will be able to access it.

author Alberto Bertogli
2004-07-13 21:16:38 UTC
committer Alberto Bertogli
2007-07-15 13:09:18 UTC
parent b60413ad4ff5761764f5503f45b04c1f636530af

When we create a transaction file, we need to make sure not only the data hit the disk, but the directory metadata too, so we can be absolutely sure we will be able to access it.

When we create a transaction file, we need to make sure not only the data hit
the disk, but the directory metadata too, so we can be absolutely sure we will
be able to access it.

Flushing directory metadata is quite messy because it's not clearly
standarized, so it depends a lot on the OS/Filesystem combination.

On some systems, fsync() over a file is guaranteed to flush also the metadata
needed to access the file (Linux/ext3, all BSDs), so nothing else is needed.

On other systems, fsync() on the directory holding the file is needed
(Linux/ext2). This is the proper Linux way to do things.

This gets even more weird, because it is also possible that neither works and
you need a sync() to do it, but the standard allows sync() to return before
the data has really hit the disk (although nobody sane do that these days,
some old systems work this way, eg. Linux < 1.3.20). Luckily, all current
systems seem to fall within the previous two categories.

God knows what happens over NFS on different client-server combinations. It
will probably work on most tho (at least from reading the source it seems like
Linux client and server do the right thing).

What this patch do is trying to cope with all those cases by always fsync()
the parent directory, and if that fails with EINVAL or EBADF, use sync(). It's
the best I can do.

Linux, FreeBSD, NetBSD, Solaris and MacOS X return OK when doing a directory
fsync(), so this should not cause unnecesary sync()s these days.

For reference, look at the huge number of posts of the subject on lkml, and
read fsync()'s SUSv3 reference.

libjio.h +1 -0
trans.c +21 -2

diff --git a/libjio.h b/libjio.h
index 8204f46..83a95fe 100644
--- a/libjio.h
+++ b/libjio.h
@@ -24,6 +24,7 @@ extern "C" {
 struct jfs {
 	int fd;			/* main file descriptor */
 	char *name;		/* and its name */
+	int jdirfd;		/* journal directory file descriptor */
 	int jfd;		/* journal's lock file descriptor */
 	int flags;		/* journal flags */
 	pthread_mutex_t lock;	/* a soft lock used in some operations */
diff --git a/trans.c b/trans.c
index c121bb2..d2a7a2d 100644
--- a/trans.c
+++ b/trans.c
@@ -324,8 +324,21 @@ int jtrans_commit(struct jtrans *ts)
 	 * everything O_SYNC, we sync at this point only, this way we avoid
 	 * doing a lot of very small writes; in case of a crash the
 	 * transaction file is only useful if it's complete (ie. after this
-	 * point) so we only flush here */
-	fsync(fd);
+	 * point) so we only flush here (both data and metadata) */
+	if (fsync(fd) != 0)
+		goto exit;
+	if (fsync(ts->fs->jdirfd) != 0) {
+		/* it seems to be legal that fsync() on directories is not
+		 * implemented, so if this fails with EINVAL or EBADF, just
+		 * call a global sync(); which is awful (and might still
+		 * return before metadata is done) but it seems to be the
+		 * saner choice; otherwise we just fail */
+		if (errno == EINVAL || errno == EBADF) {
+			sync();
+		} else {
+			goto exit;
+		}
+	}
 
 	/* now that we have a safe transaction file, let's apply it */
 	written = 0;
@@ -473,6 +486,12 @@ int jopen(struct jfs *fs, const char *name, int flags, int mode, int jflags)
 	if (rv < 0 || !S_ISDIR(sinfo.st_mode))
 		return -1;
 
+	/* open the directory, we will use it to flush transaction files'
+	 * metadata in jtrans_commit() */
+	fs->jdirfd = open(jdir, O_RDONLY);
+	if (fs->jdirfd < 0)
+		return -1;
+
 	snprintf(jlockfile, PATH_MAX, "%s/%s", jdir, "lock");
 	jfd = open(jlockfile, O_RDWR | O_CREAT, 0600);
 	if (jfd < 0)