This is a revised version of the 20100103 post. Tom rightly commented on 2010110 that there was some confusion between data and information, and he warned me that if I ended up talking of insight, I would certainly get it wrong. I am lucky he did not mention knowledge and wisdom… Just to clarify things and avoid using big words, I have adapted this post: what I want to find is “files”, and what the search is based on is “text” contained in the files… where text could be numbers, i.e. years (1739). More generally, “text” is thus any alphanumeric string contained in the file, in the file name, or in the information fields. Some text may be “keywords”, but I shall cover this later.
This being said, how do you find files on your computer? If you have few files, and an easy and logical directory structure (mypictures, mymusic, myreports…), this should be a relatively easy task… but when it you have many files, say 10000 or hundred thousand, things get a bit tougher, particularly if you want to locate files by their names and by their contents.
This post is about software that was designed to help users retrieve files, such as Google Desktop, or Beagle, Tracker, Doodle, Strigi and Pinot. They are not equivalent, which is why they are difficult to compare; none of them actually offers what I am looking for. As I think I know what I would like to have, I decided to write this article. Maybe some day some people will start an open source project to develop the ideal search engine… Let me know, because I’ll joint them!
This post is subdivided into three parts.
- The first is a somehow detailed inventory of the types of files I have on my hard disk, as it is useful to know a bit better what you have on your hard disk if you want to optimize searches…
- The second provides a comparison of search tools, in terms of what they look for, how much memory they need, how big their index files can be and particularly what they find!
- The third is my wishlist, i.e. a description of the search engine I would like to have.
Before starting, pls note that, although my computer is a Dell Latitude D830 laptop running Ubuntu 9.10, much of the discussion below should be relevant and of interest to people using other operating systems as well!
1. Inventory of my hard disk
1.1. Directory structure
I have about 130000 user files and directories for a total of 40 GB on my computer, all kept in a directory called myfiles which, for security reasons corresponds to a separate partition on the hard disk. For other files, see endnote (1). Since I shall discuss only myfiles, I will refer to it as “the” disk, with directories, sub-directories, sub-sub-directories etc.
myfiles is subdivided into several directories with the following number of files and sub-directories:
• data, 16000 files in 150 sub-directories. This is where I keep documentation on various subjects I am interested in. There is, for instance, sub-directory 1739-40 with info about the severe winter that occured during those years; sub-directory risk with documents on risk management (countries at risk, globalisation of risk), a sub-directory volcanoes with sub-sub-directories pinatubo, nyiragongo, agung, etc.
• documents, 27000 files and 6 sub-directories; documents contains… documents which I have written (publications, powerpoint presentations…) together with all the supporting material. There is some overlapping with folder data and I may one day reorganise myfiles and merge them. This folder is where the deepest sub-sub-sub… directories (up to 8 levels below documents) are to be found.
• genealogy, 10000 files in 31 sub-directories, from ardennenschlacht to website.old; genealogy has documentation about my family, and related families, together with background about the familes, i.e. sources (scanned archives), historical information about localities where the families lived, gedcom files etc.
• help, 11000 files in three subdirectories called hardware, linux and software
• pictures, 9000 files
• programmes, 7000; this is the programmes which I have written, over the last 20 years or so
• storage, 41000: mainly linux and windows installation programmes
• website, 4000: the backup of my websites, and other material used on websites
• others, about 5000 files.
1.2 File types
The detailed inventory that follows covers 61419 files in the directories data, documents, genealogy and help only. Based on the extensions (suffixes of the file names), the disk contains 392 file types, plus a type without extension. If you think file extensions and file types are not the same thing, pls. see endnote (2). Some of the rarest extensions still seem familiar: loc, ltd, lu1, lu2, lu3 ,m3u ,memory, memory~, 2f, rot, sdw, snm, sys, tdm, tpl, uev, url, me~, idx, info~, stat, text, this~, val, var!
The 30 most common extensions are given in the table below.
The table reads as follows: the first column is the rank, considering that the files are sorted according to the disk space they occupy, or their volume (KBytes of 1024 bites). The largest volume is occupied by pdf files (3.87 GBytes), which is no surprise. The second column (Count rank) lists the files by their numbers. For instance, the most numerous files are csv files, followed by html. pdfs are only fifth.
“Average KB” is the average size of the file type. pdf files on average occupy just under 1.5 MByte (1.486), and they rank only 48th in terms of their size. The largest files are avi (only 82nd by number of files).
The median size usually markedly differs from the average: it is 2 to 100 times smaller, indicating a positive skew in the size distributions. What this means in practice is that there is a very large number of small files.
The next colum (Max KB) is the size of the largest file. For instance, the largest pdf peaks at 137 MBytes, but it is only the third largest (see col. Max Rank), the largest in absolute being an avi file of 571 MB, a BBC transmission on the “global warming swindle”. The last column is the 9th decile, i.e. the value of the file size that is exceeded only in 10% of the files. For instance, 90% of jpg images are smaller than 398 KBytes. The fact is that my disk, ans possibly many others, is full of many small files.
Note that I have included only the files in the directories where I actually look for files. For instance directory pictures contains about 9000 files, all of them but a couple of hundreds jpg and jpeg. This directory is not searched for files.
The files I actually want to search for text are essentially documents (1: odt, doc, docx), pdf (2) , saved webpages (3: htm, html) and, to some extent, presentations (4: odp, ppt, pptx) and, but rarely, spread
sheets (5: ods, gnumeric, xls, xlsx, gnumeric) or tarballs and compressed files (6: tar, tar.gz, tar.bz, rar, zip). Whatever system I will eventually select to find files must be able to search “office” documents, webpages and pdf. The three first types represent 25% of the volume of the files and 27% of their numbers. All 6 types make up 44% of the volume and 29% of the number of files.
2. Comparison of search engines
I have done a more or less systematic comparison of several search engines over a couple of months, including the following: recoll, tracker, google desktop, pinot, strigi, doodle, namazu. I have also used beagle some time ago, but then dropped it to use tracker instead. I never managed to make strigi work correctly, either through the gnome deskbar applet or with catfish, a generic interface for search engines. Same thing with namazu. Doodle is a specific tool to search the info fields of all kinds of files, but since many of the other search programmes can do that as well, I did not keep doodle on my computer. In the end, I remained thus with recoll, google desktop, pinot and tracker. The table below provides a general overview of the search engines.
Somewhat irrationally, I eliminate pinot because of its confused user interface. I say “irrationally” because other people may like the interface, but I don’t understand its logic, and I grew allergic to it. Tracker does not know wild cards nor regular expressions, and its user interface is ugly and “basic”. Some people may like the simplicity; I don’t. My preference goes to google desktop and to recoll. Google desktop occupies very little disk space, and it puts little strain on system resources (contrary to tracker, for instance, which can macroscopilly slow down my computer). But google desktop is a typical “for dummies” product. For instance, it can only search a limited number of file types (not by chance, they happen to be the most frequent types on my computer, i.e html, openoffice and MS office, and pdf!) Unfortunately, it does not scan info fields! One of the nice features is that search results can be sorted by “relevance”, but the concept is explained nowhere. It seems to be a mix of frequency of the searched words and a preference for MS Office files. For instance, a search of “food security” will first list the files where the words occur together, then separated, then food only and security only… I dropped a line to Google to find out how to interpret “relevance”, but where were too busy to answer.
My absolute preference goes to recoll, for several reasons:
- it does not use a daemon; instead, the user must update the indices every now (the updates are fast). The index can also e updated automatically at different time intervals through cron (insert 0 12 *** recollindex in crontab to update the index every day at 12!) or by compiling recoll with the –with-fam or –with-notify parameters;
- search capability is the most flexible of the four engines;
- it can do “proximity searches”, i.e. it can find “food security”, i.e. “security” immediately following “food”, or look for “food” and “security” separated by a certain number of words.
I am now going to uninstall all the serach engines from my computer. I am sure I will notice a jump in performance!
3. TOR for an ideal search engine
Recoll is still far from my “ideal search engine” (ISE). Here is a list of things that I would like the ISE to be able to do…
- make a distinction between “index words” (IW) – those identified in the text body and kept in the huge index files mentioned in the table above (> 2 GBytes) – and “keywords” (KW) that are assigned by the user and usually stored in the info fields;
- for both KW and IW, understand synonyms (war=conflict, capacitaciòn=training, conflict=krieg, krieg=guerre, etc.)
- recognise KW and IW hierarchies (i.e Africa=North Africa, West Arica, East Africa, etc.; North Africa= Algeria, Tunesia, Morocco, etc);
- record authors (through a special KW) and offer the option to restrict all analyses to specific authors;
- understand regular expressions;
- assign densities to IWs (i.e. know how many times “buckwheat” occurs in a document, and use this to define a “relevance” indicator);
- recognise the structure of KWs and IWs , e.g. tell the user that “war” is correlated with “conflict”, “climate” with “atmosphere” and with “weather”, see that “insecurity” comes as “food insecurity”, “environmental insecurity” and “job insecurity”;
- automatically (Optionally) assign KWs to documents based on the most frequent IWs in the document;
- helps the user to rationally structure directories (maybe build a parallel “virtual directory structure”);
- the GUI to lets users manage KWs and IWs (add some that do not exist in documents, cancel others, use a “stop list i.e. a list of words to be ignored). The GUI should be able to analyse the KWs and IWs, show correlations, densities, number of documents etc.
(1) I am user ergosum. In addition to the myfiles directory, I have the standard /home/ergosum directory with the configuration files for my programmes (10 GB, about 10000 files) while system files (all the files outside the /home directory, incuding my programmes and the linux/ubuntu system) has 370000 files (6 GB).
(2) It is difficult to say how many actual types this corresponds to, as linux programmes (except some of them with a windows ancestry or bias) do not recognise file types based on the suffix, but based on contents and permissions. In addition, several types are equivalent (jpg, jpeg; htm, html), and others are identical in coding but their names differ (bna, bnb; ida, img, af, dvi…). Therefore, “file types” and suffixes are indeed not the same thing, but for the purpose of this little inventory, we will assume they are!