About Unlimited Priorities

Author Archive | Unlimited Priorities

Automatic Indexing: A Matter of Degree

Marjorie M.K. Hlava

Marjorie M.K. Hlava

Picture yourself standing at the base of that metaphorical range, the Information Mountains, trailhead signs pointing this way and that: Taxonomy, Automatic Classification, Categorization, Content Management, Portal Management. The e-buzz of e-biz has promised easy access to any destination along one or more of these trails, but which ones? The map in your hand seems to bear little relationship to the paths or the choices before you. Who made those signs?

In general, it’s been those venture-funded systems and their followers, the knowledge management people and the taxonomy people. Knowledge management people are not using the outlines of knowledge that already exist. Taxonomy people think you need only a three-level, uncontrolled term list to manage a corporate intranet, and they generally ignore the available body of knowledge that encompasses thesaurus construction. Metadata followers are unaware of the standards and corpus of information surrounding indexing protocols, including back-of-the-book, online and traditional library cataloging. The bodies of literature are distinct with very little crossover. Librarians and information scientists are only beginning to be discovered by these groups. Frustrating? Yes. But if we want to get beyond that, we need to learn — and perhaps painfully, embrace — the new lingo. More importantly, it is imperative for each group to become aware of the other’s disciplines, standards and needs.

We failed to keep up. It would be interesting to try to determine why and where we were left behind. The marketing hype of Silicon Valley, the advent of the Internet, the push of the dot com era and the entry of computational linguists and artificial intelligence to the realm of information and library science have all played a role. But that is another article.

Definitions

The current challenge is to understand, in your own terms, what automatic indexing systems really do and whether you can use them with your own information collection. How should they be applied? What are the strengths and weaknesses? How do you know if they really work? How expensive will they be to implement? We’ll respond to these questions later on, but first, let’s start with a few terms and definitions that are related to the indexing systems that you might hear or read about.

These definitions are patterned after the forthcoming revision of the British National Standard for Thesauri, but do not exactly replicate that work. (Apologies to the formal definition creators; their list is more complete and excellent.)

Document — Any item, printed or otherwise, that is amenable to cataloging and indexing, sometimes known as the target text, even when the target is non-print.
Content Management System (CMS) — Typically, a combination management and delivery application for handling creation, modification and removal of information resources from an organized repository; includes tools for publishing, format management, revision control, indexing, search and retrieval.
Knowledge Domain — A specially linked data-structuring paradigm based on a concept of separating structure and content; a discrete body of related concepts structured hierarchically.
Categorization — The process of indexing to the top levels of a hierarchical or taxonomic view of a thesaurus.
Classification — The grouping of like things and the separation of unlike things, and the arrangement of groups in a logical and helpful sequence.
Facet — A grouping of concepts of the same inherent type, e.g., activities, disciplines, people, natural objects, materials, places, times, etc.
Sub Facet — A group of sibling terms (and their narrower terms) within a facet having mutually exclusive values of some named characteristics.
Node — A sub-facet indicator.
Indexing — The intellectual analysis of the subject matter of a document to identify the concepts represented in the document and the allocation of descriptors to allow these concepts to be retrieved.
Descriptor — A term used consistently when indexing to represent a given concept, preferably in the form of a noun or noun phrase, sometimes known as the preferred term, the keyword or index term. This may (or may not) imply a “controlled vocabulary.”
Keyword — A synonym for descriptor or index term.
Ontology — A view of a domain hierarchy, the similarity of relationships and their interaction among concepts. An ontology does not define the vocabulary or the way in which it is to be assigned. It illustrates the concepts and their relationships so that the user more easily understands its coverage. According to Stanford’s Tom Gruber, “In the context of knowledge sharing…the term ontology…mean(s) a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”
Taxonomy — Generally, the hierarchical view of a set of controlled vocabulary terms. Classically, taxonomy (from Greek taxis meaning arrangement or division and nomos meaning law) is the science of classification according to a pre-determined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis or information retrieval. In Web portal design, taxonomies are often created to describe categories and subcategories of topics found on a website.
Thesaurus — A controlled vocabulary wherein concepts are represented by descriptors, formally organized so that paradigmatic relationships between the concepts are made explicit, and the descriptors are accompanied by lead-in entries. The purpose of a thesaurus is to guide both the indexer and the searcher to select the same descriptor or combination of descriptors to represent a given subject. A thesaurus usually allows both an alphabetic and a hierarchical (taxonomic) view of its contents. ISO 2788 gives us two definitions for thesaurus: (1) “The vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example as ‘broader’ and ‘narrower’) are made explicit” and (2) “A controlled set of terms selected from natural language and used to represent, in abstract form, the subjects of documents.”

Are these old words with clearly defined meanings? No. They are old words dressed in new definitions and with new applications. They mean very different things to different groups. People using the same words but with different understandings of their meanings have some very interesting conversations in which no real knowledge is transferred. Each party believes communication is taking place when, in actuality, they are discussing and understanding different things. Recalling Abbott and Costello’s Who’s on First? routine, a conversation of this type could be the basis for a great comedy routine (SIG/CON perhaps), if it weren’t so frustrating — and so important. We need a translator.

For example, consider the word index. To a librarian, an index is a compilation of references grouped by topic, available in print or online. To a computer science person (that would be IT today), it would refer to the inverted index used to do quick look-ups in a computer software program. To an online searcher, the word would refer to the index terms applied to the individual documents in a database that make it easy to retrieve by subject area. To a publisher, it means the access tool in the back of the book listed by subject and sub-subject area with a page reference to the main book text. Who is right? All of them are correct within their own communities.

Returning to the degrees of application for these systems and when to use one, we need to address each question separately.

What Systems Are There?

What are the differences among the systems for automatic classification, indexing and categorization? The primary theories behind the systems are:

  • Boolean rule base variations including keyword or matching rules
  • Probability of application statistics (Bayesian statistics)
  • Co-occurrence models
  • Natural language systems

New dissertations will bring forth new theories that may or may not fit in this lumping.

How Should They Be Applied?

Application is achieved in two steps. First, the system is trained in the specific subject or vertical area. In rule-based systems this is accomplished by (1) selecting the approved list of keywords to be used and, through matching and synonyms, building simple rules and (2) employing phraseological, grammatical, syntactical, semantical, usage, proximity, location, capitalization and other algorithms — based on the system — for building complex rules. This means that, frequently, the rules are keyword-matched to synonyms or to word combinations using Boolean statements in order to capture the appropriate indexing out of the target text.

In Bayesian engines the system first selects the approved list of keywords to be used for training. The system is trained using the approved keywords against a set of documents, usually about 50 to 60 documents (records, stories). This creates scenarios for word occurrence based on the words in the training documents and how often they occur in conjunction with the approved words for that item. Some systems use a combination of Boolean and Bayesian to achieve the final indexing results.

Natural language systems base their application on the parts of speech and the nature of language usage. Language is used differently in different applications. Think of the word plasma. It has very different meanings in medicine and in physics, although the word has the same spelling and pronunciation, not to mention etymology. Therefore, the contextual usage is what informs the application.

In all cases it is clear that a taxonomy or thesaurus or classification system needs to be chosen before work can begin. The resulting keyword metadata sets depend on a strong word list to start with — regardless of the name and format that may be given to that word list.

What Are the Strengths and Weaknesses?

The weaknesses of the systems compared to human indexing are the frequency of what are called false drops. That is, the keywords selected fit the computer model but do not make sense in actual use. These terms are considered noise in the system and in application. Systems work to reduce the level of noise.

The measure of the accuracy of a system is based on

  • Hits — exact matches to what a human indexer would have applied to the system
  • Misses — the keywords a human would have selected that a computerized system did not
  • Noise — keywords selected by the computer that a human would not have selected

The statistical ratios of Hits, Misses and Noise are the measure of how good the system is. The cut-off should be at 85% Hits out of a total of 100% accurate (against human) indexing. That means that Noise and Misses need to be less than 15% combined.

A good system will provide an accuracy rate of 60% initially from a good foundation keyword list and 85% or better with training or rule building. This means that there is still a margin of error expected and that the system needs — and improves with — human review.

Perceived economic or workflow impacts often render this method unacceptable, leading to the attempt to provide some form of automated indexing. The mitigation of these results so human indexers are not needed is addressed in a couple of ways. On the one hand suppose that the keyword list is hierarchical (the taxonomy view) and goes to very deep levels in some subject areas, maybe 13 levels to the hierarchy. A term can be analyzed and applied only to the final level and therefore its use is concise and plugged into a narrow application.

On the other hand, it may also be “rolled up” to ever-broader terms until only the first three levels of the hierarchy are used. This second approach is preferred in the web-click environment, where popular thinking (and some mouse-behavior research) indicates that users get bored at three clicks and will not go deeper into the hierarchy anyway.

These two options make it possible to use any of the three types of systems for very quick and fully automatic bucketing or filtering of target data for general placement on the website or on an intranet. Achieving deeper indexing and precise application of keywords still requires human intervention, at least by review, in all systems. The decision then becomes how precisely and deeply you will develop the indexing for the system application and the user group you have in mind.

How Do We Know If They Really Work?

You can talk with people who have tried to implement these systems, but you might find that (1) many are understandably reluctant to admit failure of their chosen system and (2) many are cautiously quiet around issues of liability, because of internal politics or for other reasons. You can review articles, white papers and analyst reports, but keep in mind that these may be biased toward the person or company who paid for the work. A better method is to contact users on the vendor’s customer list and speak to them without the vendor present. Another excellent method is to visit a couple of working implementations so that you can see them in action and ask questions about the system’s pluses and minuses.

The best method of all is to arrange for a paid pilot. In this situation you pay to have a small section of your taxonomy and text processed through the system. This permits you to analyze the quality and quantity of real output against real and representative input.

How Expensive Will They Be to Implement?

We have looked at three types of systems. Each starts with a controlled vocabulary, which could be a taxonomy or thesaurus, with or without accompanying authority files. Obviously you must already have, or be ready to acquire or build, one of these lists to start the process. You cannot measure the output if you don’t have a measure of quality. That measure should be the application of the selected keywords to the target text.

Once you have chosen the vocabulary, the road divides. In a rule base, or keyword, system the simple rules are built automatically from the list for match and synonym rules, that is, “See XYZ, Use XYZ.” The complex rules are partially programmatic and partially written by human editors/indexers. The building process averages 4 to 10 complex rules per hour. The process of deciding what rules should be built is based on running the simple rule base against the target text. If that text is a vetted set of records — already indexed and reviewed to assure good indexing — statistics can be automatically calculated. With the Hit, Miss and Noise statistics in hand the rule builders use the statistics as a continual learning tool for further building and refinement of the complex rule base. Generally 10—20% of terms need a complex rule. If the taxonomy has 1000 keyword terms, then the simple rules are made programmatically and the complex rules — 100 to 200 of them — would be built in 10 to 50 hours. The result is a rule base or knowledge extractor or concept extractor to run against target text.

Bayesian, inference, co-occurrence categorization systems depend on the gathering of training set documents. These are documents collected for each node (keyword term) in the taxonomy that represents that term in the document. The usual number of documents to collect for training is 50. Some require more, some less. Collection of the documents for training may take up to one hour or more per term to gather, to review as actually representing the term and to convert to the input format of the categorization system. Once all the training sets are collected, a huge systems processing task set is run to find the logical connections between terms within a document and within a set of documents. This returns a probability of a set of terms being relevant to a particular keyword term. Then the term is assigned to other similar documents based on the statistical likelihood that a particular term is the correct one (according to the system’s findings on the training set). The result is a probability engine ready to run against a new set of target text.

A natural language system trains the system based on the parts of speech and term usage and builds a domain for the specific area of knowledge to be covered. Generally, each term is analyzed via seven methods:

  • Morphological (term form — number, tense, etc.)
  • Lexical analysis (part of speech tagging)
  • Syntactic (noun phrase identification, proper name boundaries)
  • Numerical conceptual boundaries
  • Phraseological (discourse analysis, text structure identification)
  • Semantic analysis (proper name concept categorization, numeric concept categorization, semantic relation extraction)
  • Pragmatic (common sense reasoning for the usage of the term, such as cause and effect relationships, i.e., nurse and nursing)

This is quite a lot of work, and it may take up to four hours to define a single term fully with all its aspects. Here again some programmatic options exist as well as base semantic nets, which are available either as part of the system or from other sources. WordNet is a big lexical dictionary heavily used by this community for creation of natural language systems. And, for a domain containing 3,000,000 rules of thumb and 300,000 concepts (based on a calculus of common sense), visit the CYC Knowledge Base. These will supply a domain ready to run against your target text. For standards evolving in this area take a look at the Rosetta site on the Internet.

Summary

There are real and reasonable differences in deciding how a literal world of data, knowledge or content should be organized. In simple terms, it’s about how to shorten the distance between questions from humans and answers from systems. Purveyors of various systems maneuver to occupy or invent the standards high ground and to capture the attention of the marketplace, often bringing ambiguity to the discussion of process and confusion to the debate over performance. The processes are complex and performance claims require scrutiny against an equal standard. Part of the grand mission of rendering order out of chaos is to bring clarity and precision to the language of our deliberations. Failure to keep up is failure to engage, and such failure is not an option.

We have investigated three major methodologies used in the automatic and semi-automatic classification of text. In practice, many of the systems use a mixture of the methods to achieve the result desired. Most systems require a taxonomy in order to start and most systems tag text to each keyword term in the taxonomy as metadata in the keyword name or in other elements as the resultant.

Access Innovations for Document abstracting and indexing • Document conversion • Business Taxonomies • Machine Aided Indexing
All rights reserved. Copyright © 2006 Access Innovations, Inc.

Comments { 0 }

Iris L. Hanney Announces Launch of Unlimited Priorities Corporation

Cape Coral, FL (July 2006) – Iris L. Hanney, well-known executive in the information industry, has announced the formation of a new company, Unlimited Priorities. The organization will focus primarily on those evolving small and mid-size firms in need of senior managerial support and direction at all levels.

Ms. Hanney was employed most recently at Techbooks in Falls Church, VA where she served as President, Information Publishing Group, and Senior Vice-President, Sales. In these roles she took responsibility for both domestic and international aspects of the group’s operation including project management, production, invoicing and collections, and sales and marketing. During her tenure the Information Publishing Group grew from a small operation into a highly respected major player in the content transformation world.

In announcing the creation of Unlimited Priorities Ms. Hanney stated: “I truly have enjoyed my time with Techbooks. Now after 30 years of building companies for others I want to try building one for myself. It is in order to make full use of my executive skills and industry experience that I have established Unlimited Priorities. We look forward to working with the executive teams at firms that are in a position to grow but require management assistance in areas they currently are unable to support fully.”

Ms. Hanney previously was General Manager, U.S. Operations, for Pacific Data Conversion Corp. (PDCC). She joined PDCC during its founding in 1993, remained through its acquisition by SPI Technologies in 1998, then moved to TechBooks in 2002. During an extensive career in information publishing Ms. Hanney has worked for a number of industry leaders including The H.W. Wilson Company, The R. R. Bowker Company, BRS Information Services and Saztec International.

About Unlimited Priorities

Unlimited Priorities is attuned to the management requirements of companies in the information industry. We provide executive level support services by utilizing a highly skilled group of professionals with abundant experience in sales, marketing, finance, operations, production, IT and training. Unlimited Priorities believes that sales and marketing effectiveness combined with operational excellence are the keys to successful market penetration and growth. By coordinating our talents with your team’s knowledge and experience, we can help you build your business and attain your goals in an efficient and cost effective manner.

Media Contact:
Iris L. Hanney
239-549-2384
Unlimited Priorities
iris.hanney@unlimitedpriorities.com
www.unlimitedpriorities.com

Comments { 0 }

The Kano Model: Critical to Quality Characteristics and VOC

Origin of the Kano Model

Dr. Noriaki Kano, a very astute student of Dr. Ishikawa, developed an interesting model to address the various ways in which Six Sigma practitioners could prioritize customer needs. This becomes particularly important when trying to rank the customer’s wants and desires in a logical fashion.

The Practical Side to the Kano Model

The Kano model is a tool that can be used to prioritize the Critical to Quality characteristics, as defined by the Voice of the Customer, which I will explain in greater detail below. The three categories identified by the Kano model are:

  • Must Be: The quailty characteristics that must be present or the customer will go elsewhere.
  • Performance: The better we are at meeting these needs, the happier the customer is.
  • Delighter: Those qualities that the customer was not expecting but received as a bonus.

The First Step for Creating the Kano Model: Identifying the Voice of the Customer

The first step for creating the Kano model is to identify the quality characteristics that are typically fuzzy, vague and nebulous. These quality characteristics are referred to as the Voice of the Customer (VOC). Once the Voice of the Customer is understood, we can attempt to translate it into quantitative terms known as critical to quality (CTQ) characteristics. This should not be a new concept for those familiar with the Six Sigma methodology. What happens from here, though, can sometimes go astray if we are not careful and try to put our own spin on the needs of the customer. This may be the result of trying to make things more easily obtainable for us—a formula for failure.

Use the Kano Model to Prioritize the Critical to Quality Characteristics

So, now that we have identified what is important to the customer in workable terms, we can go to the second step. Always keeping the customer in mind, we can apply the concepts outlined in the Kano model diagram.

A Few Words About Kano

A Few Words About Kano

The Kano model is broken down into an (x, y) graph, where the x-axis of the Kano model represents how good we are at achieving the customer’s outcome(s), or CTQ’s. The y-axis of the Kano model records the customer’s level of satisfaction as a result of our level of achievement.

The red line on the Kano model represents the Must Bes. That is, whatever the quality characteristic is, it must be present; if the quality characteristic is not met, the customer will go elsewhere. The customer does not care if the product is wrapped in 24-carat gold, only that it is present and is functionally doing what it was designed to do. An example of this would be a client who checks into a hotel room expecting to find a bed, curtains and bathroom in the room. These items are not called out for by the customer, but would definitely cause them to go elsewhere if any of these “characteristics” were not present.

The blue line on the Kano model represents the Performance. This line reflects the Voice of the Customer. The better we are at meeting these needs, the happier the customer is. It is here where the trade-offs take place. Someone wanting good gas mileage would not likely expect to have a vehicle that has great accelerations from a standing position.

By far, the most interesting evaluation point of the Kano model is the Delighter (the green line). This represents those qualities that the customer was not expecting, but received as a bonus. A few years ago, it was customary that when a car was taken back to the dealer for a warranty oil change, the vehicle was returned to the owner with its body washed, mirrors polished, carpets vacuumed, etc. After a few trips to the dealer, this Delighter became a Must Be characteristic. Thus, a characteristic that once was exciting was now a basic need, and a part of the customer’s expectations. Another example of this is the amenities platter that some hotels provide their platinum customers upon checking in. I am one of those clients entitled to such a treat. This practice was certainly a delight. It has, however, become an expected part of my check-in, such that if there is no platter waiting in my room, I’m on the phone with the front desk.

Once the critical to quality characteristics have been prioritized, the last step of the Kano model involves an analysis of evaluating or assessing just how well we can satisfy each of Dr. Noriaki Kano’s classifications.

Kano Model Case Study

Being a trainer and consultant, I spend a lot of time on the road. In doing so, I have a tendency to check into hotels on a regular basis, as mentioned earlier. I once queried the manager of a hotel I spend a lot of time at on how he established practices to entice the business client. He related the following scenario to me.

The first thing he did was identify a list of qualities the client would be interested in. He came upon his list by listening to complaints, handing out surveys, holding focus groups and conducting interviews. The information below is a partial list from the Voice of the Customer. Knowing that I was involved in something that dealt with customer satisfaction, he asked me to assist him in ranking the characteristics. I explained the concepts behind the Kano model, and together we developed the list in the column labeled Business Client, as shown in Table 1. This was all fine and dandy, as far as the business customer was concerned.

Table 1

Table 1

For my own interest, I asked him to look at these same characteristics from the point of view of a vacationing family. As a final task, I asked him to assess how strong or weak he felt the hotel was when trying to meet those quality characteristics identified in table 1.

The results are shown in Table 2.

Table 2

Table 2

The conclusions from this effort can be as summarized by looking at the rows that have a characteristic in the Must Be category. With respect to the business client, this yielded express checkout, a comfortable bed, continental breakfast, internet hook-up and newspaper. The vacationer, on the other hand, had Must Bes that included price, comfortable bed, cable/HBO and a swimming pool.

Of these quality characteristics, the manager realized that the hotel was weak in the check-in and express checkout process, and internet hook-up. This Kano model exercise allowed the manager to better address the needs of the customer, based on their Critical to Quality characteristics. Now the work begins to minimize the gap of where the hotel is with respect to where the hotel wants to be.

One final thought: If a characteristic isn’t on the list, does that mean it can be ignored?

©2006 E. George Woodley. All rights reserved.
Published with permission from the author.

Comments { 0 }