In a recent IRC discussion a subject of detecting file types was brought up. At that point I argued that file type detection based on extensions was maybe a nice hack at the time when there were no fancy methods for storing file metadata, the computers were real computers, and the universe was young. Today, it's totally obsolete, limited, arbitrary and brain-damaged in general.
In modern Unix-based systems (with Mac OS X as notable exception, more on that later), file extensions are rarely used by applications, and as a convenience if at all (Microsoft keeps thinking that three-letter strings are a good way to describe a file type). Using a command line shell you can pass any file names to the program you start, and most of them are extension-agnostic, except sometimes for convenience purpose. In environments which do need file type information such as KDE or GNOME GUIs, it is guessed based on file extension and data format parsing.
The problem here, is that these environments are adding a new layer to the system infrastructure, reimplementing functionality which belongs to the lower levels. This also creates inconsitencies - users must be aware of the difference between e.g GNOME file URIs and filenames in the system. The inconcistency is also due to the fact that GNOME/KDE VFS support data which isn't stored on the filesystem at all (http://..., smb://..., info://... URL's, etc) but that's another issue.
So, what can be done about this? Many modern filesystems (ext3, xfs, reiserfs - I admit I'm Linux-centric here) support extended attributes, that is, pairs of (name, value) strings which can store metadata. They're ideal for storing file type information. That's the easy part.
The hard part is, what to store there? Where exactly do we get file type information for some file, and how to represent it? For the representation, MIME types come to mind - they're widely used (mail, news, web, GNOME and KDE desktops), they're a standard, and people know how to handle them (more or less). Alas, they do have their problems - only 2 levels of hierarchy, hard to extend (must use application/x-foo kludge) and the extensions have no guarantee of uniqueness. The other option are Apple's Uniform Type Identifiers, which are attempt to address precisely the problems with MIME. UTI's are guaranteed to be unique, easy to extend, support namespaces and (multiple) type inheritance and many popular types are already standardized. The only "small" problem with UTI's is that they're nonstandard, and only present in Mac OS X (actually, they appeared only recently in Tiger).
This brings us to the following problem - if only local system uses this filetyping system, how will we communicate with the outside world? How to properly assign types coming from outside, and, for that matter, generated by legacy applications not aware of this new shiny typing system? Well, we can use the same technique that is employed today for mime-type detection: heuristics based on file extensions (here they creep up again) and content inspection (parsing). Only, we do it only once, not every time the file is accessed. But (as was repetedly pointed out to me, tnx kre & zvrba), that creates new problem: does the system allow the user to specify file types, and do the applications blindly trust the user?
My proposal is this: The file type information is deduced from the file extension or content upon first creation (or available mime type info if downloading from the web, for example). The file type information may be changed by the user (having the appropriate write privileges on the file). The applications don't ever blindly trust the user - upon loading the data they alway perform (or should perform) validation, and usually report error to the user if the data is inconsistent. So, if the user tries to spoof the data type, let them - the application in question should gracefully report error (such spoofing is possible today, and it isn't abused: the only security issue here is automatic loading and executing of content, but automatic execution is a security problem anywhere, and isn't a problem pertaining to filetype detection).
The question here is, what new features does this bring? After all, the applications are required to validate the input, and the user cannot trust the type information provided by external sources (because it might be spoofed by an attacker). I believe that it is beneficial because the environments that do rely on file types have a lot easier job (and don't need to reinvent the wheel at various level), and the security argument becomes irrelevant if we admit that user can never trust outside information, be it file metadata or the file content.
It is also possible to make filetyping mandatory, handle it using a trusted source, and let applications trust in it. This involves having some sort of daemon running at superuser privileges, inspecting the files periodically (using the heuristics I mentioned above) and assigning the types. Newly created files would have empty/unknown type, and upon any file modification the type would be cleared again (to prevent users from creating a file with some type and then changing the contents to something else). This approach has its own problem: the user can attempt DoS attack on system by changing the files rapidly. This can be circumvented by building a list of new/changed files and updating the metadata in a batch, delayed for a few seconds or minutes. Although this scheme guarantees that known file types are correctly identified, I believe that it is too restrictive to the user, and that, ultimately, applications don't want to rely on it because they still would want validate the input.
What's your view on this? Is the current vfs-layer-on-top-of-traditional-fs approach good (enough), do we really need automatic type detection and handling, and what you think is the Right Way to do this? Don't hesitate to comment ;-)
PS. This is my first blog posting! Wheee! ;-)