Tag Archives | roi

Saving Time and Money with a Rule Based Approach to Automatic Indexing

Written by Marjorie M.K. Hlava
President, Access Innovations, Inc.

Getting a Higher Return on Your Investment

There are two major types of automatic categorization systems. These two types of systems are known by many different names. However, the information science theory behind them boils down to two major schools of thought: rule based and statistics based.

Companies advocating the statistics system hold that editorially maintained rule bases take a lot of up-front investment and higher costs overall. They also claim that statistics based systems are more accurate. On the other hand, statistics based systems require a training set up front and are not designed to allow editorial refinement for greater accuracy.

A case study may be the best way to see what the true story is. We did such a study, to answer the following questions:

  • What are the real up-front costs of rule based and training set based systems?
  • Which approach takes more up-front investment?
  • Which is faster to implement?
  • Which has a higher accuracy level?
  • What is the true cost of the system with the additional cost of creating the rule base or collecting the training set?

To answer these questions, we’ll look at how each system works and then the costs of actual implementation in a real project for side-by-side comparison.

First a couple of assumptions and guidelines:

  1. There is an existing thesaurus or controlled vocabulary of about 6000 terms. If not, then the cost of thesaurus creation needs to be added.
  2. Hourly rates and units per hour are based on field experience and industry rules of thumb.
  3. 85% accuracy is the baseline needed for implementation to save personnel time.

The Rule Based Approach

A simple rule base (a set of rules matching vocabulary terms and their synonyms) is created automatically for each term in the controlled vocabulary (thesaurus, taxonomy, etc.). With an existing, well-formed thesaurus or authority file, this is a two-hour process. Rules for both synonym and preferred terms are generated automatically.

Complex rules make up an average of about 10% to 20% of the terms in the vocabulary. Rules are created at a rate of 4 – 6 per hour. So for a 6000 term thesaurus, creating 600 complex rules at 6 per hour takes 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. Accuracy (as compared with what indexing terms a skilled indexer would select) is usually 60% with just the simple rule base and 85 – 92% with the complex rule base.

The rule based approach places no limit on the number of users, the number of terms used in the taxonomy created, or the number of taxonomies held on a server.

The software is completed and shipped via CD-ROM the day the purchase is made. For the client’s convenience, if the data is available in one of three standard formats (tab- or comma-delimited, XML, or left-tagged ASCII), it can be preloaded into the system. Otherwise a short conversion script is applied.

On the average, customers are up and running one month after the contract is complete.

The client for whom we prepared the rule base reported 92% accuracy in totally automated indexing and a four-fold increase in productivity.

The up-front time and dollar investment based on the workflow for implementation for the full implementation is as follows:

Table 1

Table 1

The Statistical Approach – Training Set Solution

To analyze the statistical approach, which requires use of a training set up front, we used the same pre-existing 6000 word thesaurus.

The cost of the software usually starts at about $75,000 to $250,000. (Though costs can run much higher, we will use this lower estimate. Some systems limit the number of terms allowed in a taxonomy, requiring an extra license or secondary file building.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000). Travel expense may be added.

The up-front time and dollar investment, based on the workflow for implementation for the statistical (Bayesian, co-occurrence, etc.) systems, is as follows:

Table 2

Table 2

A two-fold productivity increase was noted using the system. Accuracy has not gone above 72% at present.


The table that follows compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation.

Table 3

Table 3

It is apparent that considerable savings in both time and money can be gained by using a rule based system instead of a statistics based system — by a factor of almost seven, based on the assumptions outlined above.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

Comments { 0 }

Improving Enterprise Search Using Auto-Categorization: Making the Business Case to Senior Executives

By Marjorie M.K. Hlava and Jay Ven Eman
of Access Innovations, Inc

The significance of using a business case approach to improve corporate search using auto-categorization and taxonomy is the subject of this white paper. These solutions are understood by corporate librarians and knowledge management leaders, but the value aspect is often poorly comprehended by the executives responsible for the budget and approval process.

This paper differentiates between solely presenting a technical resource to the business vs. using a well thought-out business case when attempting to procure enterprise or department funding. Search is on the radar of senior management due to the appearance of Google and other search systems. There is a vast proliferation of knowledge workers, and efficiencies in information throughput are in strong demand. Workers spend more than 25% of their time searching for information (IDC Research, 2008). The average corporation has four search systems with none of them delivering productivity to the work force. This issue has emerged as a significant concern in helping to drive higher business productivity and profits.

This paper outlines how the development of a cohesive taxonomy strategy, well aligned with corporate business needs, becomes a strategic investment supporting staff productivity and overall knowledge worker output quality. It is a tactical purchase to strengthen the company’s competitive edge.

There is now a 92% accuracy rating on accounting and regulatory document search based on hit, miss and noise or relevance, precision and recall statistics [using] Access Innovations. –USGAO

Obstacles in optimizing search

The problem with search is that it usually depends on statistics and immense data processing and storage to process answers, without paying attention to the language of the user. Corporate intranets, pharmaceutical firms, large database publishers, and magazine and content publishers suffer without well-formed information to clearly indicate conceptual links, provide replicable results, and support intuitive semantic search. This directly impacts the knowledge worker’s patience and productivity, with many spending one fourth of their time looking for information rather than using it in creative and strategic ways. Individual lost time multiplied by tens to hundreds in a large corpora- tion significantly undermines the bottom line. By not readily allowing the user his or her own terminology, the system creates small hurdles which, multiplied by many failed searches, become large barriers. The result is a loss of efficiency and flexibility across the entire enterprise.

Agile enterprises must provide a mechanism for the user to automatically translate their terms, dialect, or language into well-formed, standard terms. This provides for consistent, deep searching, the most effective means to obtain information with comprehensive recall and accuracy. It prevents trial- and-error searching that wastes workers’ time. Factor in the direct and burden costs of each knowledge worker; the cost savings rapidly become significant.

Research has shown that most classification systems touted as automatic actually require rules to reach productive levels for production or search. The rules differentiate among meanings of words to correctly interpret a document. To create and maintain these rules, one needs to build a rich semantic layer and then place a rule-based appli-cation over the classification function. Traditional search does not provide this functionality. To facilitate information capture and retrieval that runs at 6, 8, even 10 times greater productivity, a good taxonomy must provide the search backbone.

IT departments, charged with safeguarding valuable corporate information, require a simple and safe way for users to manage the categorization tools, to avert increasing IT costs and burden. The current move to Web 2.0 empowers users and lessens the load on IT departments. Collaborative taxonomy management supports Web 2.0 initiatives.

We have moved from a fielded Boolean search to a faceted search GUI, but the fundamentals of search still hold. The 1960s gave us the Arpanet and ReCon systems, which gave rise to the Internet and present search technologies. Metadata elements rose from fielded data. The missing piece in today’s search is the taxonomy application. The market challenge is to produce solutions that enhance search through taxonomy and automatic categorization.

IEEE had their system up and running in three days, in full production in less than two weeks. –Institute of Electrical and Electronics Engineers

The American Economic Association said its editors think using it is fun and makes time fly! –American Economic Association (AEA)

The business of auto-categorization and taxonomies

Well-formed data, with clear indication of conceptual semantic links, provides replicable results and intuitive, semantic search. Users search with their own words, removing obstacles to search success and increasing productivity. The system translates non-standard word choices to consistent taxonomy terms, resulting in consistent, deep searching and, ultimately, greater knowledge access and use.

To produce the highest level of productivity at the most cost-effective TCO (total cost of ownership), a system must provide both semantic interpretation and governing rules linked to a taxonomy. This ensures fast, accurate search regardless of the skill or number of users.

Good corporate compliance systems need to ensure conformity with accepted taxonomy standards. These include ANSI/NISO Z39.19, and those from the ISO, WC3, British Standards Institute, and other standards-setting organizations.

To minimize costs, the categorization system should work both at the content creation, content management, digital depository end of the information management process and at the search end to provide seamless performance.

Dangers in the industry that inhibit seamless performance include out-of-date data schemas in which critical data is stored in extinct formats and media. Strategic planning for search must consider migration of this data as technical platforms evolve. Most enterprises handle terabytes of data with an average lifespan of 3 years. With often inadequate and over-capacity contingency plans (all of which further exacerbate search inefficiencies), these huge information stores must be configured to ensure that the data is platform-independent and accommodates new technologies.

Value drivers for your project

Business issues and value drivers supporting projected returns are shown here.

Business issues and value drivers supporting projected returns are shown here.

The need for a supportive business case

A business case is vital in helping executives rationalize decisions, especially ones of a technical nature. It facilitates their ability to analyze the technology’s impact compared with other corporate opportunities, particularly with limited budgets.

Having financial metrics along with technical recommendations fuels the ability to communicate expected upstream value. Several industry-leading vendors are extending themselves by drawing up contracts where payment is conditioned on proving delivered value. Accenture, Triology, and IBM have established value-based selling as a best practice; soon, it will be an industry standard.

Research shows that, of over 400 software vendors, close to 75% fail to prove their solution’s tangible value. These vendors sell solutions that challenge the client to build business value. But that business value must be clearly described in the business case.

Building a supportive business case also needs to address technical issues such as enabling semantic search, interlinking data, and using rules.

Many firms use a “discovery” process, where technical and business parties join forces in discovering value in a proposed solution. This collaborative process demonstrates how departmental needs are aligned with business value and IT impact and strengthens your business case.

The following elements are key in assembling a software or services business case:

  1. Value proposition – summarizes the position
  2. Executive summary– brief and bottom line
  3. Risk, impact, and strategic benefit
  4. ROI validation – clear and concise is best
  5. Competitive TCO – for competing vendors
  6. IT impact and support – to build bridges

ProQuest CSA has achieved a 7-fold increase in productivity. –ProQuest CSA

Weather Channel finds things 50% faster using Data Harmony. A significant saving in time. –The Weather Channel

Supporting the Metrics

The baseline for integration of automated or assisted metatagging integrated into your workflow should be 85% accuracy or 15-20% irrelevant returns (noise). When this level is reached, you can potentially see seven-fold increases in productivity and cut search time in half. Achieving these levels demonstrated notable credibility for CSA’s implementation.

Though the benefits of an ROI measure depend on size of audience, audience level, complexity of content, and complexity of search, there are reliable data points that can be used. This table serves as a guideline when building cost-justification efforts to buy auto-classification and taxonomy solutions.

A guideline when building cost-justification efforts.

A guideline when building cost-justification efforts.

The Value Produced

Building your case will be invaluable when presenting it to management or a budgeting committee. It helps your department be viewed as in-step with management and supporting corporate strategic goals. To the owner of the case, the benefits are clear:

  • Projects are better received.
  • Projects are well justified.
  • Projects are viewed beyond “tools”.
  • Projects receive better funding.


This paper seeks to illuminate the importance of a well thought-out business case. Whether using outside vendors or an internal committee, following the steps to build each aspect of a persuasive business case for a solution’s implementation is ultimately the most successful way to identify your needs and promote your project.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

Comments { 0 }