Tuesday, August 30, 2005

gnomefs

GnomeVFS is a filesystem abstraction used in GNOME desktop environment. In adition to traditional FS operations, it has several extra features (better metadata suport, clever handling of file types, async operations, can operate on any URI, not just files, extensible through modules). Unfortunately, GnomeVFS is used only by GNOME - the rest of the system is unaware of it. This is both confusing for new users (who must distinguish between paths in GNOME and non-GNOME aware programs), and annoying for everyone (imho ;-).

Enter Filesystem in Userspace. This incredibly cool thing allows anyone to write filesystem handler in one afternoon - which is exactly what I did. Combine GnomeVFS and FUSE and you bring previosly GNOME-only features to the whole system! The hack is a just over 200 lines long, and supports read-only operation. It's basically a mapping between GnomeVFS and FUSE API.

Obligatory "screenshot":

senko@rei:~/src/gnomefs$ sudo ./gnomefs sftp://senko@marvin.kset.org/ -o allow_other /mnt/gnomefs
senko@rei:~/src/gnomefs$ cat /mnt/gnomefs/etc/hostname
marvin

The code is just a proof of concept - it's really ugly, unoptimized and probably contains a couple of bugs. Also, currently only one scheme/host pair is mounted - I plan to extend gnomefs to have just one mountpoint, with translation: /scheme/host/path -> scheme://host/path I started doing that and then ran away in terror before the C's string handling ;-)

Grab the the code here and play with it! You'll need FUSE 2.2.1 and GnomeVFS 2.10 (these are versions available for Ubuntu Hoary which I'm running).

Update:With the Internet as vast space as it is, I might've suspected that someone has already figured out this cool idea and has done something about it. Well, at least it was a fun..

Friday, August 12, 2005

Py/Invoke

One of great things about Python is that it's relatively easy to extend it. Either by digging into Python internals manually or using Swig to do the dirty work for you, you can create brand new native modules or bindings to libraries in a few hours.

But what if you just happen to need this one function, once, for a quick hack, and there's no binding avaliable? Would you want to go through the (however small) trouble of defining, testing and compiling the extension module? I wouldn't. So, what's there to do? I really like .NET solution for this problem (Platform Invoke) - to invoke the native function you just add enough metadata to describe how to marshall types from .NET to the native function, specify the library and function name, and you're set.

For Python, there's this standard module (for Unix platforms) called dl. It is an interface to dynamic link loader, and allows loading a library, selecting a function by its name and calling it. But, there's a catch - only integers and constant strings may be supplied to the function, and the function must return an integer value. Not very practical, considering that much data-moving in C is done using pointers; which makes good ol' dl almost worthless.

So, inspired by P/Invoke, I decided to extend dl to support passing arbitrary data and mutable strings. I've added a new method call_mutable, which is largely copy-pasted call with a few tweaks to allow mutable (call by reference) parameters. Here it is.

Extension to the dl module

The call_mutable method is an extended version of standard call method available in the dl module. The method allows integers, strings and None as arguments to the native function. Integers are passed by value, and None is passed as NULL pointer. For each string argument, a separate data buffer is allocated, initialized with string data, and passed to the function.

The method returns a tuple containing return value of the native function, and strings holding the data found in string buffers upon native function exit.

Example:

>>> import dl
>>> m = dl.open('libc.so.6')
>>> m.call_mutable('time')
(1123796886,)
>>> m.call_mutable('read', 0, '\0' * 15, 15)
Hello World
(12, 'Hello World\n\x00\x00\x00')

Convenience wrapper: native

Since dealing with raw data buffers understood by call_mutable is cumbersome, I've created the convenience wrapper which combines the functionality of struct and dl modules. The module provides just one function, native, which loads the shared library, performs marshalling and unmarshalling of arguments and calls the native C function.

The data marshalling is done according to format string similar to one used by struct module. Its format is:

        '<type>:<type>:..:<type>'
where 'type' is one of:
  • '' - specifies an integer to be passed by value
  • 's' - specifies data buffer with size identical to the correspondenting string argument + one byte for the NUL-terminator
  • any other - used exactly as in struct module

Upon return from the external C function, the native function unmarshalls the arguments and returns the tuple containing return value (integer) as the first element of the tuple, and unmarshalled values for mutable arguments (that is, all arguments except integers passed by value).

Note that this module caches the dl objects used, so the external library won't be reopened several times on multiple function invocation. To close all open libraries, you can use close_all function provided by the module.

Example:

>>> import native
>>> native.native('libc.so.6', 'time', 'i', (1,))
(1123797259, 1123797259)
>>> native.native('libc.so.6', 'read', ':s:', (0, '\0' * 15, 15))
Hello World
(12, 'Hello World\n\x00\x00\x00')

The code

I've packed my version od dl module along with setup.py script and wrapper native module into a tarball which you can get from my software repository. The tarball also contains a patch against dlmodule.c from Python CVS.
So, get it, play with it and feel free to flame me about it ;-)

Thursday, August 11, 2005

Using extended attributes for file type detection

In a recent IRC discussion a subject of detecting file types was brought up. At that point I argued that file type detection based on extensions was maybe a nice hack at the time when there were no fancy methods for storing file metadata, the computers were real computers, and the universe was young. Today, it's totally obsolete, limited, arbitrary and brain-damaged in general.

In modern Unix-based systems (with Mac OS X as notable exception, more on that later), file extensions are rarely used by applications, and as a convenience if at all (Microsoft keeps thinking that three-letter strings are a good way to describe a file type). Using a command line shell you can pass any file names to the program you start, and most of them are extension-agnostic, except sometimes for convenience purpose. In environments which do need file type information such as KDE or GNOME GUIs, it is guessed based on file extension and data format parsing.

The problem here, is that these environments are adding a new layer to the system infrastructure, reimplementing functionality which belongs to the lower levels. This also creates inconsitencies - users must be aware of the difference between e.g GNOME file URIs and filenames in the system. The inconcistency is also due to the fact that GNOME/KDE VFS support data which isn't stored on the filesystem at all (http://..., smb://..., info://... URL's, etc) but that's another issue.

So, what can be done about this? Many modern filesystems (ext3, xfs, reiserfs - I admit I'm Linux-centric here) support extended attributes, that is, pairs of (name, value) strings which can store metadata. They're ideal for storing file type information. That's the easy part.

The hard part is, what to store there? Where exactly do we get file type information for some file, and how to represent it? For the representation, MIME types come to mind - they're widely used (mail, news, web, GNOME and KDE desktops), they're a standard, and people know how to handle them (more or less). Alas, they do have their problems - only 2 levels of hierarchy, hard to extend (must use application/x-foo kludge) and the extensions have no guarantee of uniqueness. The other option are Apple's Uniform Type Identifiers, which are attempt to address precisely the problems with MIME. UTI's are guaranteed to be unique, easy to extend, support namespaces and (multiple) type inheritance and many popular types are already standardized. The only "small" problem with UTI's is that they're nonstandard, and only present in Mac OS X (actually, they appeared only recently in Tiger).

This brings us to the following problem - if only local system uses this filetyping system, how will we communicate with the outside world? How to properly assign types coming from outside, and, for that matter, generated by legacy applications not aware of this new shiny typing system? Well, we can use the same technique that is employed today for mime-type detection: heuristics based on file extensions (here they creep up again) and content inspection (parsing). Only, we do it only once, not every time the file is accessed. But (as was repetedly pointed out to me, tnx kre & zvrba), that creates new problem: does the system allow the user to specify file types, and do the applications blindly trust the user?

My proposal is this: The file type information is deduced from the file extension or content upon first creation (or available mime type info if downloading from the web, for example). The file type information may be changed by the user (having the appropriate write privileges on the file). The applications don't ever blindly trust the user - upon loading the data they alway perform (or should perform) validation, and usually report error to the user if the data is inconsistent. So, if the user tries to spoof the data type, let them - the application in question should gracefully report error (such spoofing is possible today, and it isn't abused: the only security issue here is automatic loading and executing of content, but automatic execution is a security problem anywhere, and isn't a problem pertaining to filetype detection).

The question here is, what new features does this bring? After all, the applications are required to validate the input, and the user cannot trust the type information provided by external sources (because it might be spoofed by an attacker). I believe that it is beneficial because the environments that do rely on file types have a lot easier job (and don't need to reinvent the wheel at various level), and the security argument becomes irrelevant if we admit that user can never trust outside information, be it file metadata or the file content.

It is also possible to make filetyping mandatory, handle it using a trusted source, and let applications trust in it. This involves having some sort of daemon running at superuser privileges, inspecting the files periodically (using the heuristics I mentioned above) and assigning the types. Newly created files would have empty/unknown type, and upon any file modification the type would be cleared again (to prevent users from creating a file with some type and then changing the contents to something else). This approach has its own problem: the user can attempt DoS attack on system by changing the files rapidly. This can be circumvented by building a list of new/changed files and updating the metadata in a batch, delayed for a few seconds or minutes. Although this scheme guarantees that known file types are correctly identified, I believe that it is too restrictive to the user, and that, ultimately, applications don't want to rely on it because they still would want validate the input.

What's your view on this? Is the current vfs-layer-on-top-of-traditional-fs approach good (enough), do we really need automatic type detection and handling, and what you think is the Right Way to do this? Don't hesitate to comment ;-)

PS. This is my first blog posting! Wheee! ;-)