Thursday, August 11, 2005

Using extended attributes for file type detection

In a recent IRC discussion a subject of detecting file types was brought up. At that point I argued that file type detection based on extensions was maybe a nice hack at the time when there were no fancy methods for storing file metadata, the computers were real computers, and the universe was young. Today, it's totally obsolete, limited, arbitrary and brain-damaged in general.

In modern Unix-based systems (with Mac OS X as notable exception, more on that later), file extensions are rarely used by applications, and as a convenience if at all (Microsoft keeps thinking that three-letter strings are a good way to describe a file type). Using a command line shell you can pass any file names to the program you start, and most of them are extension-agnostic, except sometimes for convenience purpose. In environments which do need file type information such as KDE or GNOME GUIs, it is guessed based on file extension and data format parsing.

The problem here, is that these environments are adding a new layer to the system infrastructure, reimplementing functionality which belongs to the lower levels. This also creates inconsitencies - users must be aware of the difference between e.g GNOME file URIs and filenames in the system. The inconcistency is also due to the fact that GNOME/KDE VFS support data which isn't stored on the filesystem at all (http://..., smb://..., info://... URL's, etc) but that's another issue.

So, what can be done about this? Many modern filesystems (ext3, xfs, reiserfs - I admit I'm Linux-centric here) support extended attributes, that is, pairs of (name, value) strings which can store metadata. They're ideal for storing file type information. That's the easy part.

The hard part is, what to store there? Where exactly do we get file type information for some file, and how to represent it? For the representation, MIME types come to mind - they're widely used (mail, news, web, GNOME and KDE desktops), they're a standard, and people know how to handle them (more or less). Alas, they do have their problems - only 2 levels of hierarchy, hard to extend (must use application/x-foo kludge) and the extensions have no guarantee of uniqueness. The other option are Apple's Uniform Type Identifiers, which are attempt to address precisely the problems with MIME. UTI's are guaranteed to be unique, easy to extend, support namespaces and (multiple) type inheritance and many popular types are already standardized. The only "small" problem with UTI's is that they're nonstandard, and only present in Mac OS X (actually, they appeared only recently in Tiger).

This brings us to the following problem - if only local system uses this filetyping system, how will we communicate with the outside world? How to properly assign types coming from outside, and, for that matter, generated by legacy applications not aware of this new shiny typing system? Well, we can use the same technique that is employed today for mime-type detection: heuristics based on file extensions (here they creep up again) and content inspection (parsing). Only, we do it only once, not every time the file is accessed. But (as was repetedly pointed out to me, tnx kre & zvrba), that creates new problem: does the system allow the user to specify file types, and do the applications blindly trust the user?

My proposal is this: The file type information is deduced from the file extension or content upon first creation (or available mime type info if downloading from the web, for example). The file type information may be changed by the user (having the appropriate write privileges on the file). The applications don't ever blindly trust the user - upon loading the data they alway perform (or should perform) validation, and usually report error to the user if the data is inconsistent. So, if the user tries to spoof the data type, let them - the application in question should gracefully report error (such spoofing is possible today, and it isn't abused: the only security issue here is automatic loading and executing of content, but automatic execution is a security problem anywhere, and isn't a problem pertaining to filetype detection).

The question here is, what new features does this bring? After all, the applications are required to validate the input, and the user cannot trust the type information provided by external sources (because it might be spoofed by an attacker). I believe that it is beneficial because the environments that do rely on file types have a lot easier job (and don't need to reinvent the wheel at various level), and the security argument becomes irrelevant if we admit that user can never trust outside information, be it file metadata or the file content.

It is also possible to make filetyping mandatory, handle it using a trusted source, and let applications trust in it. This involves having some sort of daemon running at superuser privileges, inspecting the files periodically (using the heuristics I mentioned above) and assigning the types. Newly created files would have empty/unknown type, and upon any file modification the type would be cleared again (to prevent users from creating a file with some type and then changing the contents to something else). This approach has its own problem: the user can attempt DoS attack on system by changing the files rapidly. This can be circumvented by building a list of new/changed files and updating the metadata in a batch, delayed for a few seconds or minutes. Although this scheme guarantees that known file types are correctly identified, I believe that it is too restrictive to the user, and that, ultimately, applications don't want to rely on it because they still would want validate the input.

What's your view on this? Is the current vfs-layer-on-top-of-traditional-fs approach good (enough), do we really need automatic type detection and handling, and what you think is the Right Way to do this? Don't hesitate to comment ;-)

PS. This is my first blog posting! Wheee! ;-)

15 Comments:

Anonymous Anonymous said...

The brand NEW! Safelist Autosubmission software XP - blasts your ads to 3.6 Million Safelist recipients at the click of one button.

Automatic-Mailer XP! - 50,000 hits guaranteed to your website every month! - Posts unlimited number of ads - Create unlimited number of profiles - Posts whenever you want! 24/7 - Schedule autosubmission when you are not at home - Sends Html & Text Advertisements.

Order today. Our Software also includes: - Autovalidation feature that automatically validates all your Safelists validation codes - mailbox cleaner feature that automatically keeps your mailboxes clean!

NOW That's an incredible Deal, isn't it? Get Your Astonishing Automatic - Mailer XP Software today!
Click here: AUTO-MAILER

10:03 PM  
Anonymous Anonymous said...

"New Blog Submission Software Takes Total Domination To A whole New Level, And Allows Complete Control Over Any Market and Any Product You Sell.--Renders All Other Marketing Methods Totally Useless by Comparison"
Click here: FREE DEMO

12:52 PM  
Anonymous Anonymous said...

I didnt find thing that i need... :-(
msn

4:11 AM  
Anonymous Astronouth7303 said...

I think that's a good idea. Subversion already does the same thing, and I take liberty with MIME types anyway. (ie, if it is a plain text format, call it text/*. So I have text/x-php, text/x-python, etc running about.)

Freedesktop.org already has a description of using xattr to store such data. http://freedesktop.org/wiki/CommonExtendedAttributes

8:06 AM  
Anonymous Anonymous said...

Hi people
I do not know what to give for Christmas of the to friends, advise something ....

3:07 PM  
Anonymous Anonymous said...

Mmm my sweety private weblink collection. I hope you enjoy it !
ass parade
naruto xxx





------------------------------------------------------------------------------------------------
About Christmas
merry christmas

4:40 PM  
Anonymous Anonymous said...

xanax
phentermine
buy tramadol
big breast
merry christmas
ass parade
naruto xxx
auto insurance
auto insurance
cars insurance
cars insurance
texas car insurance
texas car insurance
car insurance quotes
car insurance quotes
car insurance policy
car insurance policy

3:16 PM  
Anonymous Anonymous said...

Buy levitra online

http://mtsu32.mtsu.edu:11263/_disc2/00000194.htm#levitra

5:41 AM  
Anonymous Anonymous said...

I congratulate all Soon Christmas
Here some sites about the Christmases, a lot of interesting here
new year celebration
christmas gift
santa claus email
new year
christmas card
christmas flower
christmas
christmas tree
christmas ornament
christmas song
happy new year
chinese new year

7:34 AM  
Anonymous Anonymous said...

Hello.


I Want to divid with you medicine which rescue the lifes of peoples. These tablets rescued not one groups
of ten peoples. If you not don't care fates of the sick people that, please not deletes this message.
This message can read the person who these preparations rescue the life!!!!

Penis Growth Patch Rx
Ultra Allure Pheromones
Anatrim
Advanced Gain Pro Pills
Regenisis HGH

5:13 PM  
Anonymous Anonymous said...

You know you are not satisfied with mainstream porn but don't know yet what exactly you are looking for?
Nasty moms make sons satisfy them! Uncensored incest mom-son-friends gangbangs videos!

10:36 AM  
Anonymous Anonymous said...

I like it a lot! Very nicely done. :-)..!
- www.blogger.com c
samsung ringtone
motorola ringtone
sony ericsson ringtone
nokia ringtone

4:04 PM  
Anonymous Anonymous said...

Great work guys. Good resources here, very useful. Your web site is helpful. I will bookmark!
- www.blogger.com 0
spaghetti alla carbonara

3:02 PM  
Anonymous Anonymous said...

Pleasse Do not delete this urls , i need money for my child














3:43 AM  
Anonymous Anonymous said...

Please , do not delete it . Ineed money.













9:53 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home