- 1 About
- 2 Features
- 3 Compiling and Installing
- 4 Download
ioda is a fulltext database: a word indexing and retrieving engine. It stores unique words from a file or database source in a btree and their repeaters in an flexible and highly space optimized list structure. Each stored word "knews" its source, position in the source and some (optional) info bytes.
We use the term "database" for the summary of all files of an ioda data collection. I.e. if you have indexed your webservers HTML files in a ioda database called "myserver", at least this ioda files makes the database: myserver.config, myserver.btf, myserver.ocl and eventually myserver.ref. The only file you have to edit manually is the config file, where you describe the properties of the database. There can be some more helper files, which are described below.
Master or Slave
ioda can be used stand alone ("master mode") for archiving files. In this case it stores full file names and can archive whole directory trees - i.e. the whole webserver content - by one call. On the other hand, ioda can be used as an addOn to an existing (i.e. SQL) database in a "slave mode". Therefore it stores the unique key of each database record as references to its words.
For retrieving information, ioda handles logical operators (AND, OR, NOT, NEAR), parenthesis and optional word distance values (ie. AND.4). NEAR is an operator which means AND.50. The query parser of ioda is able to optimize a search path for complex queries like "(Albert or Alfred) and.1 Einstein) and Quant* not Physik*".
Wildcards and Regular Expressions
Beginning in Release 1.3, ioda can retrieve data with wildcards or regular expressions. I.e.: The word "barfooter" will be find with the query /foo/. This is similar to the wildcard notation *foo*. ioda internally converts wildcards mostly into regular expressions.
Delete and Update Functions
ioda can delete entries and update them by deleting the old version and inserting the new one. (Entries means the alist of words from an article, a file etc.). ioda offers a merge function for merging two databases into one or for optimization purposes. In the last case, an existing database will be rebuilded with continuous word lists (which are impossible to create in the orginal archiving run without wasting much disk space).
Sorting by Relevance
There are some more features: ioda can sort hits by time (of file or database entry) or by weight. In the last case words (or combinations while using the AND operator) are appraised by their position in the text. ioda can (optionally) detect text doublettes by MD5 checksums and can ignore them or store them in an space optimized way.
ioda can handle all ISO-8859-XX charsets and UTF-8. In the case of ISO charsets ioda can handle the casefolding (optional automatic uppercase function). While using UTF-8 the calling application has to handle all casefoldings.
Flexible Indexing through external Filters
For archiving whole directory trees, ioda needs support of an external program. This can be written in any language and may work as pipeline or may generate temporary files. ioda can store additional information on each word. Beside the mandatories (source id, source position and a 16-bit-value for flags and other informations), each word can optionally have a timestamp and a 32-bit-value (insteadt of the 16-bit one).
Tailor-made Data Structures
The database structure of ioda consists of two or three parts, which are all designed by the author (non standard):
- The Bayer Baum, BTree, (*.btf): It stores all unique words, each poiting to...
- The Word occureny list (*.ocl): It stores information about the words, at least the file or database id (ie. unique key) as doubleword, the position (in word counts) in a word, the weight and an optional info byte. This can store information like "word is in title" or something else. ioda offers bigger data models for the occurency list, ie. for storing a timestamp in each word or a source information. This bigger structures are mainly used for ioda stand alone duties.
- The File reference list (*.ref) is used for stand alone service only. In this case, ioda manages the ids itself ("master mode") and the ids point to the entries in the fileref list (instead of getting ids from a master database). In the fileref list, a full path name is stored. It is possible to agree upon a base path at creating time of the ioda database which is a leading part of the full path and can be truncated (ie. a webserver root path) to avoid redundant information.
From the source, four binaries can be builded:
- ioda as a command line programm (joda)
- ioda as a server for client/server communicating over TCP sockets (jodad)
- ioda as a linkable library (libjodafulltext.so). Interfaces to C, Perl, Python and PHP are published within the source package
- ioda as a CGI programm. This is only a trunc which does no HTML-formatting
ioda is in a productive environment ie. as full text index to a Wikipedia mirror: http://lexikon.rhein-zeitung.de. Try a query with wildard (*) to force a search or use this link as an example:
((Albert or Alfred) and.1 Einstein) and /^Quant.+sprung/) not Schrödinger
Compiling and Installing
You can use the binares from the bin package immediatly under Linux. For compiling the sources, a Makefile is available in the source package. If you want to use the Perl and/or Python or PHP import modules, please install the source or the binary package first! To install all, your can extract the source package into one subdirectory. Call first "make" then "make install" from the master Makefile to do all in one. The Free Pascal Compiler ≥ 1.9.3 is needed (recent version is 2.0). Important: Switch the Delphi mode in the fpc config file on (-S2)! No other libraries are required for the binaries. At the moment, it is only guarantied that it runs under Linux. Under Windows, we have only tested read only until now. Theoretically it will be no or only little work to fit ioda for all other OS, which are supported by Free Pascal.