Saving Time and Money with a Rule Based Approach to Automatic Indexing

Written by Marjorie M.K. Hlava
President, Access Innovations, Inc.

Getting a Higher Return on Your Investment

There are two major types of automatic categorization systems. These two types of systems are known by many different names. However, the information science theory behind them boils down to two major schools of thought: rule based and statistics based.

Companies advocating the statistics system hold that editorially maintained rule bases take a lot of up-front investment and higher costs overall. They also claim that statistics based systems are more accurate. On the other hand, statistics based systems require a training set up front and are not designed to allow editorial refinement for greater accuracy.

A case study may be the best way to see what the true story is. We did such a study, to answer the following questions:

  • What are the real up-front costs of rule based and training set based systems?
  • Which approach takes more up-front investment?
  • Which is faster to implement?
  • Which has a higher accuracy level?
  • What is the true cost of the system with the additional cost of creating the rule base or collecting the training set?

To answer these questions, we’ll look at how each system works and then the costs of actual implementation in a real project for side-by-side comparison.

First a couple of assumptions and guidelines:

  1. There is an existing thesaurus or controlled vocabulary of about 6000 terms. If not, then the cost of thesaurus creation needs to be added.
  2. Hourly rates and units per hour are based on field experience and industry rules of thumb.
  3. 85% accuracy is the baseline needed for implementation to save personnel time.

The Rule Based Approach

A simple rule base (a set of rules matching vocabulary terms and their synonyms) is created automatically for each term in the controlled vocabulary (thesaurus, taxonomy, etc.). With an existing, well-formed thesaurus or authority file, this is a two-hour process. Rules for both synonym and preferred terms are generated automatically.

Complex rules make up an average of about 10% to 20% of the terms in the vocabulary. Rules are created at a rate of 4 – 6 per hour. So for a 6000 term thesaurus, creating 600 complex rules at 6 per hour takes 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. Accuracy (as compared with what indexing terms a skilled indexer would select) is usually 60% with just the simple rule base and 85 – 92% with the complex rule base.

The rule based approach places no limit on the number of users, the number of terms used in the taxonomy created, or the number of taxonomies held on a server.

The software is completed and shipped via CD-ROM the day the purchase is made. For the client’s convenience, if the data is available in one of three standard formats (tab- or comma-delimited, XML, or left-tagged ASCII), it can be preloaded into the system. Otherwise a short conversion script is applied.

On the average, customers are up and running one month after the contract is complete.

The client for whom we prepared the rule base reported 92% accuracy in totally automated indexing and a four-fold increase in productivity.

The up-front time and dollar investment based on the workflow for implementation for the full implementation is as follows:

Table 1

Table 1

The Statistical Approach – Training Set Solution

To analyze the statistical approach, which requires use of a training set up front, we used the same pre-existing 6000 word thesaurus.

The cost of the software usually starts at about $75,000 to $250,000. (Though costs can run much higher, we will use this lower estimate. Some systems limit the number of terms allowed in a taxonomy, requiring an extra license or secondary file building.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000). Travel expense may be added.

The up-front time and dollar investment, based on the workflow for implementation for the statistical (Bayesian, co-occurrence, etc.) systems, is as follows:

Table 2

Table 2

A two-fold productivity increase was noted using the system. Accuracy has not gone above 72% at present.


The table that follows compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation.

Table 3

Table 3

It is apparent that considerable savings in both time and money can be gained by using a rule based system instead of a statistics based system — by a factor of almost seven, based on the assumptions outlined above.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

