Archive | Articles RSS feed for this section

Saving Time and Money with a Rule Based Approach to Automatic Indexing

Written by Marjorie M.K. Hlava
President, Access Innovations, Inc.

Getting a Higher Return on Your Investment

There are two major types of automatic categorization systems. These two types of systems are known by many different names. However, the information science theory behind them boils down to two major schools of thought: rule based and statistics based.

Companies advocating the statistics system hold that editorially maintained rule bases take a lot of up-front investment and higher costs overall. They also claim that statistics based systems are more accurate. On the other hand, statistics based systems require a training set up front and are not designed to allow editorial refinement for greater accuracy.

A case study may be the best way to see what the true story is. We did such a study, to answer the following questions:

  • What are the real up-front costs of rule based and training set based systems?
  • Which approach takes more up-front investment?
  • Which is faster to implement?
  • Which has a higher accuracy level?
  • What is the true cost of the system with the additional cost of creating the rule base or collecting the training set?

To answer these questions, we’ll look at how each system works and then the costs of actual implementation in a real project for side-by-side comparison.

First a couple of assumptions and guidelines:

  1. There is an existing thesaurus or controlled vocabulary of about 6000 terms. If not, then the cost of thesaurus creation needs to be added.
  2. Hourly rates and units per hour are based on field experience and industry rules of thumb.
  3. 85% accuracy is the baseline needed for implementation to save personnel time.

The Rule Based Approach

A simple rule base (a set of rules matching vocabulary terms and their synonyms) is created automatically for each term in the controlled vocabulary (thesaurus, taxonomy, etc.). With an existing, well-formed thesaurus or authority file, this is a two-hour process. Rules for both synonym and preferred terms are generated automatically.

Complex rules make up an average of about 10% to 20% of the terms in the vocabulary. Rules are created at a rate of 4 – 6 per hour. So for a 6000 term thesaurus, creating 600 complex rules at 6 per hour takes 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. Accuracy (as compared with what indexing terms a skilled indexer would select) is usually 60% with just the simple rule base and 85 – 92% with the complex rule base.

The rule based approach places no limit on the number of users, the number of terms used in the taxonomy created, or the number of taxonomies held on a server.

The software is completed and shipped via CD-ROM the day the purchase is made. For the client’s convenience, if the data is available in one of three standard formats (tab- or comma-delimited, XML, or left-tagged ASCII), it can be preloaded into the system. Otherwise a short conversion script is applied.

On the average, customers are up and running one month after the contract is complete.

The client for whom we prepared the rule base reported 92% accuracy in totally automated indexing and a four-fold increase in productivity.

The up-front time and dollar investment based on the workflow for implementation for the full implementation is as follows:

Table 1

Table 1

The Statistical Approach – Training Set Solution

To analyze the statistical approach, which requires use of a training set up front, we used the same pre-existing 6000 word thesaurus.

The cost of the software usually starts at about $75,000 to $250,000. (Though costs can run much higher, we will use this lower estimate. Some systems limit the number of terms allowed in a taxonomy, requiring an extra license or secondary file building.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000). Travel expense may be added.

The up-front time and dollar investment, based on the workflow for implementation for the statistical (Bayesian, co-occurrence, etc.) systems, is as follows:

Table 2

Table 2

A two-fold productivity increase was noted using the system. Accuracy has not gone above 72% at present.

Summary

The table that follows compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation.

Table 3

Table 3

It is apparent that considerable savings in both time and money can be gained by using a rule based system instead of a statistics based system — by a factor of almost seven, based on the assumptions outlined above.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

Comments { 0 }

Explaining the Total Degrees of Freedom for Six Sigma Practitioners

George Woodley

George Woodley

We, as statisticians, Six Sigma Belts and quality practitioners have utilized the term degrees of freedom as a part of our hypothesis testing, such as the t-test for comparison of two means and ANOVA (Analysis of Variance), as well as confidence intervals, to mention a few references. I can recall from the many classes I have taught from Six Sigma Green Belts to Six Sigma Master Black Belts inclusive, that students have had a bit of a problem grasping the whole idea of the degrees of freedom, especially when we describe the concept of the standard deviation: …the average distance of the data from the MEAN…1 By now, Six Sigma practitioners should have a comfort level with concepts like the MEAN; which is calculated by taking the sum of all the observations, and dividing by the number of observations (n). The total degrees of freedom are then represented as (n-1).

Defining Degrees of Freedom

One method for describing the degrees of freedom, as per William Gosset, has been stated as, “The general idea: given that n pieces of data x1, x2, … xn, you use one ‘degree of freedom’ when you compute μ, leaving n-1 independent pieces of information.”2

This was reflected in the approach summarized by one of my former professors. He stated that the degrees of freedom represented the total number of observations minus the number of population parameters that are being estimated by a specific statistical test. Since we assume populations are infinite and cannot be easily accessed to generate parameters, we rely on samples to generate statistical inferences that provide estimates of the original population, provided the sampling techniques are both random and representative; another discussion for later.

This may seem very elementary, but from my own experiences, degrees of freedom have not been the easiest of concepts to comprehend—especially for the novice Six Sigma belt. A definition that can also be representative of the concept of the degrees of freedom can be summarized as “equal to the number of independent pieces of information concerning the variance.”3 For a random sample from a population, the number of degrees of freedom is equal to the sample size minus 1.

A Degrees of Freedom Numerical Example

A numerical example of this approach might illustrate this. The values would reflect the actual observations of a given data set. This example is simply for illustration purposes only. Given that we have eight data points that sum up to 60, we can randomly and independently assign values to seven of them. For instance, we may record them as: 8, 9, 7, 6, 7, 10 and 7. The seven values would have the freedom to be any number, yet the eighth number would have to be some fixed value to sum up to a total of 60 (in this case the value would have to be 6). Hence, the degrees of freedom are (n-1) or (8–1) = 7. There are seven numbers that can take on any value, but only one number will make the equation (the sum of the values, in this case) hold true.

One may argue that although this seems to be a simplistic illustration, the data collected for the original seven readings are not really independent, in that they are representative of an existing process, and depend on what the observations are in reality. Furthermore, we would have to know from the beginning what the final value was—in this case 60. Even though this illustration attempts to explain the theory behind the degrees of freedom it can be more confusing than obvious.

My “Easy” Way of Describing Degrees of Freedom

What really inspired me to write this article about the impact of the degrees of freedom was a conversation I had with my wife. She was heading to her class, and she called me and asked if I had an “easy” way of explaining the degrees of freedom. I gave her the description for describing the degrees of freedom I use in my classes:

Since statistics deals with making decisions in the world of uncertainty, we, as statisticians, need to provide ourselves with a cushion or padding to deal with this uncertainty. It can be viewed as the greater the sample size, the more confident we can be with our decisions. For example, when we estimate the variance of a normal distribution, we divide the sum of the squared deviations by (n-1). Hence if we have a sample size of 5, we are dividing by 4. This provides us with as a cushion of 20 percent. In fact, we are overstating the variance by a factor of 20 percent. If, however, our sample size is 100, we would be dividing by 99 percent. Here we are only overstating the variance by a factor of 1 percent.

This explanation places the emphasis on a common statistical concept that the larger the sample size, the more confident we can be of our estimates and decisions. To summarize this idea in a slightly different way—as long as our sampling technique is random and representative, the likelihood that we have a good estimator of a parameter increases as the sample sizes increase.

I have attempted to address the various approaches to the degrees of freedom and hopefully my simplistic approach to the rationale behind what we are trying to accomplish can shed some light on future explanations of such a vital part of statistical analysis.

References

1 Gonick, L. and Smith, W. (1993), The Cartoon Guide to Statistics, Harper Collins Publishers, pg. 22
2 Breyfogle, Forrest W. III (2003), Implementing Six Sigma, John Wiley & Sons, pg. 1105
3 Upton, Graham and Cook, Ian (2002), Dictionary of Statistics, Oxford University Press, pg. 100

©2009 E. George Woodley. All rights reserved.
Published with permission from the author.

Comments { 0 }

Improving Enterprise Search Using Auto-Categorization: Making the Business Case to Senior Executives

By Marjorie M.K. Hlava and Jay Ven Eman
of Access Innovations, Inc

The significance of using a business case approach to improve corporate search using auto-categorization and taxonomy is the subject of this white paper. These solutions are understood by corporate librarians and knowledge management leaders, but the value aspect is often poorly comprehended by the executives responsible for the budget and approval process.

This paper differentiates between solely presenting a technical resource to the business vs. using a well thought-out business case when attempting to procure enterprise or department funding. Search is on the radar of senior management due to the appearance of Google and other search systems. There is a vast proliferation of knowledge workers, and efficiencies in information throughput are in strong demand. Workers spend more than 25% of their time searching for information (IDC Research, 2008). The average corporation has four search systems with none of them delivering productivity to the work force. This issue has emerged as a significant concern in helping to drive higher business productivity and profits.

This paper outlines how the development of a cohesive taxonomy strategy, well aligned with corporate business needs, becomes a strategic investment supporting staff productivity and overall knowledge worker output quality. It is a tactical purchase to strengthen the company’s competitive edge.

There is now a 92% accuracy rating on accounting and regulatory document search based on hit, miss and noise or relevance, precision and recall statistics [using] Access Innovations. –USGAO

Obstacles in optimizing search

The problem with search is that it usually depends on statistics and immense data processing and storage to process answers, without paying attention to the language of the user. Corporate intranets, pharmaceutical firms, large database publishers, and magazine and content publishers suffer without well-formed information to clearly indicate conceptual links, provide replicable results, and support intuitive semantic search. This directly impacts the knowledge worker’s patience and productivity, with many spending one fourth of their time looking for information rather than using it in creative and strategic ways. Individual lost time multiplied by tens to hundreds in a large corpora- tion significantly undermines the bottom line. By not readily allowing the user his or her own terminology, the system creates small hurdles which, multiplied by many failed searches, become large barriers. The result is a loss of efficiency and flexibility across the entire enterprise.

Agile enterprises must provide a mechanism for the user to automatically translate their terms, dialect, or language into well-formed, standard terms. This provides for consistent, deep searching, the most effective means to obtain information with comprehensive recall and accuracy. It prevents trial- and-error searching that wastes workers’ time. Factor in the direct and burden costs of each knowledge worker; the cost savings rapidly become significant.

Research has shown that most classification systems touted as automatic actually require rules to reach productive levels for production or search. The rules differentiate among meanings of words to correctly interpret a document. To create and maintain these rules, one needs to build a rich semantic layer and then place a rule-based appli-cation over the classification function. Traditional search does not provide this functionality. To facilitate information capture and retrieval that runs at 6, 8, even 10 times greater productivity, a good taxonomy must provide the search backbone.

IT departments, charged with safeguarding valuable corporate information, require a simple and safe way for users to manage the categorization tools, to avert increasing IT costs and burden. The current move to Web 2.0 empowers users and lessens the load on IT departments. Collaborative taxonomy management supports Web 2.0 initiatives.

We have moved from a fielded Boolean search to a faceted search GUI, but the fundamentals of search still hold. The 1960s gave us the Arpanet and ReCon systems, which gave rise to the Internet and present search technologies. Metadata elements rose from fielded data. The missing piece in today’s search is the taxonomy application. The market challenge is to produce solutions that enhance search through taxonomy and automatic categorization.

IEEE had their system up and running in three days, in full production in less than two weeks. –Institute of Electrical and Electronics Engineers

The American Economic Association said its editors think using it is fun and makes time fly! –American Economic Association (AEA)

The business of auto-categorization and taxonomies

Well-formed data, with clear indication of conceptual semantic links, provides replicable results and intuitive, semantic search. Users search with their own words, removing obstacles to search success and increasing productivity. The system translates non-standard word choices to consistent taxonomy terms, resulting in consistent, deep searching and, ultimately, greater knowledge access and use.

To produce the highest level of productivity at the most cost-effective TCO (total cost of ownership), a system must provide both semantic interpretation and governing rules linked to a taxonomy. This ensures fast, accurate search regardless of the skill or number of users.

Good corporate compliance systems need to ensure conformity with accepted taxonomy standards. These include ANSI/NISO Z39.19, and those from the ISO, WC3, British Standards Institute, and other standards-setting organizations.

To minimize costs, the categorization system should work both at the content creation, content management, digital depository end of the information management process and at the search end to provide seamless performance.

Dangers in the industry that inhibit seamless performance include out-of-date data schemas in which critical data is stored in extinct formats and media. Strategic planning for search must consider migration of this data as technical platforms evolve. Most enterprises handle terabytes of data with an average lifespan of 3 years. With often inadequate and over-capacity contingency plans (all of which further exacerbate search inefficiencies), these huge information stores must be configured to ensure that the data is platform-independent and accommodates new technologies.

Value drivers for your project

Business issues and value drivers supporting projected returns are shown here.

Business issues and value drivers supporting projected returns are shown here.

The need for a supportive business case

A business case is vital in helping executives rationalize decisions, especially ones of a technical nature. It facilitates their ability to analyze the technology’s impact compared with other corporate opportunities, particularly with limited budgets.

Having financial metrics along with technical recommendations fuels the ability to communicate expected upstream value. Several industry-leading vendors are extending themselves by drawing up contracts where payment is conditioned on proving delivered value. Accenture, Triology, and IBM have established value-based selling as a best practice; soon, it will be an industry standard.

Research shows that, of over 400 software vendors, close to 75% fail to prove their solution’s tangible value. These vendors sell solutions that challenge the client to build business value. But that business value must be clearly described in the business case.

Building a supportive business case also needs to address technical issues such as enabling semantic search, interlinking data, and using rules.

Many firms use a “discovery” process, where technical and business parties join forces in discovering value in a proposed solution. This collaborative process demonstrates how departmental needs are aligned with business value and IT impact and strengthens your business case.

The following elements are key in assembling a software or services business case:

  1. Value proposition – summarizes the position
  2. Executive summary– brief and bottom line
  3. Risk, impact, and strategic benefit
  4. ROI validation – clear and concise is best
  5. Competitive TCO – for competing vendors
  6. IT impact and support – to build bridges

ProQuest CSA has achieved a 7-fold increase in productivity. –ProQuest CSA

Weather Channel finds things 50% faster using Data Harmony. A significant saving in time. –The Weather Channel

Supporting the Metrics

The baseline for integration of automated or assisted metatagging integrated into your workflow should be 85% accuracy or 15-20% irrelevant returns (noise). When this level is reached, you can potentially see seven-fold increases in productivity and cut search time in half. Achieving these levels demonstrated notable credibility for CSA’s implementation.

Though the benefits of an ROI measure depend on size of audience, audience level, complexity of content, and complexity of search, there are reliable data points that can be used. This table serves as a guideline when building cost-justification efforts to buy auto-classification and taxonomy solutions.

A guideline when building cost-justification efforts.

A guideline when building cost-justification efforts.

The Value Produced

Building your case will be invaluable when presenting it to management or a budgeting committee. It helps your department be viewed as in-step with management and supporting corporate strategic goals. To the owner of the case, the benefits are clear:

  • Projects are better received.
  • Projects are well justified.
  • Projects are viewed beyond “tools”.
  • Projects receive better funding.

Summary

This paper seeks to illuminate the importance of a well thought-out business case. Whether using outside vendors or an internal committee, following the steps to build each aspect of a persuasive business case for a solution’s implementation is ultimately the most successful way to identify your needs and promote your project.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

Comments { 0 }

Why Does It Seem Like 1989?

John Ganly

John Ganly

Déjà vu is a forever buzzword and this year it keeps buzzing for me. The public library is a cross section of the community, and better than most pundits the librarians can spot trends before they happen. The sad trend that I began to spot in January is the increase in the number of men and women in business dress sitting in the reading room in mid-afternoon. It was 1989 when I last saw this phenomenon.

Recession murmurs were in the air in 1989, and the murmurs became a roar which lasted until 1993. By 1990 the number of business persons investigating job opportunities, career changes or doing pick-up consulting work had grown to the point where numbers had to be assigned for seating. We are not at that point yet, but getting there at SIBL.

Public libraries assume their role as havens in tough times. On September 12, 2001 when I arrived at work at 8:00 AM there were 300, or so, folks lined up outside the building. Grateful for a safe place with e-mail connection to let their folks know they were OK, the crowd kept repeating “Thank God for the library!”

The growing legions of the laid-off are finding comfort and help here at SIBL. In anticipation of the needs of mid-career job seekers, I have added the Vault Career Library and Universum’s New Career databases. Resume guides and cover letter models are represented in both the circulating collection and the research resources. E-Books describing career change options have been acquired. The Lap Top Docking area has been expanded and the Electronic Information Center has new computers put in place to serve the folks doing pick up consulting.

Recession is a sad time, and it is difficult to think of good coming from it. However, during the 1989-1993 trouble many library users became aware of CD-ROM technology for the first time, and now an awareness of new technologies also is being noticed. When we hear the mantra that libraries are no longer relevant we can be happy that we are here when needed.

About the Author

John Ganly is Assistant Director for Collections at The New York Public Library’s Science, Industry and Business Library.

Comments { 0 }

What Can You Do With XML Today?

Jay Ven Eman, Ph.D.

Jay Ven Eman, Ph.D.

Interest in Extensible Markup Language (XML) rivals the press coverage the World Wide Web received at the turn of the Millennium. Is it justified? Rather than answer directly, let us take a brief survey of what XML is, why it was developed, and highlight some current XML initiatives. Whether or not the hype is justified, ASIST members will inevitably become involved in your own XML initiatives.

An alternative question to the title is, “What can’t you do with XML?” I use it to brew my coffee in the morning. Doesn’t everyone? To prove my point, the following is a “well-formed” XML document. (“Well-formed” will be defined later in the article.)

<?xml version="1.0" standalone="yes" encoding="ISO-8859-1"?>
<StartDay Attitude="Iffy">
<Sunrise Time="6:22" AM="Yes"/>
<Coffee Prepare="Or die!" Brand="Starbuck’s" Type="Colombian"
Roast="Dark"/>
<Water Volume="24" UnitMeasure="Ounces">Add cold water.</Water>
<Beans Grind="perc" Type="Java">Grind coffee beans.
Faster, please!!</Beans>
<Grounds>Dump grounds into coffee machine.</Grounds>
<Heat Temperature="152 F">Turn on burner</Heat>
<Brew>Wait, impatiently!!</Brew>
<Dispense MugSize="24" UnitMeasure="Ounces">Pour, finally.</Dispense>
</StartDay>

This XML document instance contains sufficient information to drive the coffee making process. Given the intrinsic nature of XML, our coffee-making document instance could be used by the butler (should we be so lucky) or by a Java applet or perl script to send processing instructions to a wired or wireless coffeepot. If XML can brew coffee, it can drive commerce; it can drive R & D; it can drive the information industry; it can drive information research; it can drive the uninitiated out of business.

What is XML? To understand XML, you must understand meta data. Meta data is “data about data.” It is an abstraction, layered above the abstraction that is language. Meta data can be characterized as natural or added. To illustrate, consider the following text string, “Is MLB a sport, entertainment, or business?” You, the reader, can guess with some degree of accuracy that this is an article title about Major League Baseball (MLB). Presented out of context, even people are only guessing. Computers have no clue, in or out of, context. There are no software systems that can reliable recognize it in a meaningful way.

For this example, it is a newspaper article title. To it we will add subject terms from a controlled vocabulary, identify the author, the date, and add an abstract. As a “well-formed” XML document instance, it is rendered:

<?xml version="1.0" standalone="yes" encoding="ISO-8859-1"?>
<DOC Date=5/21/02 Doctype="Newspaper">
<TI> "Is MLB a sport, entertainment, or business?" </TI>
<Byline> Smith </Byline>
<ST> Sports </ST>
<ST> Entertainment </ST>
<ST> Business </ST>
<AB> Text of abstract...</AB>
<Text> Start of article ...</Text>
</DOC>

In this example, what are the meta data? What is natural and what is added? Natural meta data is information that enhances our understanding of the document and parts thereof, and can be found in the source information. The date, the author’s name, and the title are “natural” meta data. They are an abstraction layer apart from the “data” and add significantly to our understanding of the “data.”

The subject terms, document type, and abstract are “added” meta data. This information also adds to our understanding, but it had to be added by an editor and/or software. The tags are “added” and are meta data. Meta data can be the element tag, the attribute tag, or the values labeled by element and attribute tags. It is the collection of meta data that allows computer software to reliably deal with the data. Meta data facilitates networking across computers, platforms, software, companies, cities, countries.

What is the “data” in this example? The text, tables, charts, figures, and graphs that are contained within the open <Text> and close </Text> tags.

Comments { 0 }

OWL Exports From a Full Thesaurus

Jay Ven Eman, Ph.D.

Jay Ven Eman, Ph.D.

What do you make of “198”? You could assume a number. Computer applications make no reliable assumptions since it could be an integer and decimal but not octal, but it could also be something else, too. Neither you nor the computer could do anything useful with it. What if, we added a period, so “198” becomes “1.98”? Maybe it represents the value of something such as its price. If we found it embedded with additional information, we would know more. “It cost 1.98.” The reader now knows that it is a price, but software applications still are unable to figure it out. There is much the reader still doesn’t know. “It cost ¥1.98.” “It cost £1.98.” “It cost $1.98.” There is even more information you would want. Wholesale? Retail? Discounted? Sale price? $1.98 for what?

Basic interpretation is something humans do very well, but software applications do not. Now imagine a software application trying to find the nearest gasoline station to your present location that has gas for $1.98 or less. Per gallon? Per liter? Diesel or regular? Using your location from your car’s GPS and a wireless Internet connection such a request is theoretically possible, but beyond the most sophisticated software applications using Web resources. They cannot do the reasoning based upon the current state of information on the Web.

Finding Meaning

Trying to search the Web based upon conceptual search statements adds more complications. Looking for information about “lead” using just that term returns a mountain of unwanted information about leadership, your water, and conditions at the Arctic Ocean. Refining the query to indicate you are interest in “lead based soldering compounds” helps. Software applications still cannot reason or draw inferences from keywords found in context. At present, only humans are adept at interpreting within context.

Semantic Web

The “Semantic Web” is a series of initiatives to help make more of the vast resources found via the Web, available to software applications and agents, so that these programs can perform at least rudimentary analysis and processing to help you find that cheaper gasoline. The Web Ontology Language (OWL) is one such initiative and will be described herein in relation to thesauri and taxonomies.

At the heart of the Semantic Web are words and phrases that represent concepts that can be used for describing Web resources. Basic organizing principles for “concepts” exist in the present thesaurus standards (ANSI/NISO Z39.19 found at www.niso.org and ISO 2788 and ISO 5964 found at www.iso.org). They are being expanded and revised. Drafts of the revisions are available for review.

The reader is directed to the standards’ Web sites referenced above and to www.accessinn.com, www.dataharmony.com, and www.willpowerinfo.co.uk/thesprin.htm for basic information on thesaurus and taxonomy concepts. It is assumed here that the reader will have a basic understanding of what a thesaurus is, what a taxonomy is, and related concepts. Also, a basic understanding of the Web Ontology Language (OWL) is required. OWL is a W3C recommendation and is maintained at the W3C Web site. For an initial investigation of OWL, the best place to start is the Guide found at W3C.

OWL

From the OWL Guide, “OWL is intended to provide a language that can be used to describe the classes and relations between them that are inherent in Web documents and applications.” OWL formalizes a domain by defining classes and properties of those classes; defining individuals and asserting properties about them; and reasoning about these classes and individuals.

Ontology is borrowed from philosophy. In philosophy, Ontology is the science of describing the kinds of entities in the world and how they relate.
An OWL ontology may include classes, properties, and instances. Unlike ontology from philosophy, an OWL ontology includes instances, or members, of classes. Classes and members, or instances, can have properties and those properties have values. A class can also be a member of another class. OWL ontologies are meant to be distributed across the Web and to be related as needed. The normative OWL exchange syntax is RDF/XML (www.w3.org/RDF/).

Thesaurus

A thesaurus is not an ontology. It does not describe kinds of entities and how they are related in a why that a software agent could use. One could draw useful inferences about the domain of medicine by studying a medical thesaurus, but software cannot. You would discover important terms in the field, how terms are related, what terms have broader concepts and what terms encompass narrower concepts. An inference, or reasoning engine, would be unable to draw any inferences beyond a basic “broader term/narrower term” pairing like “nervous system/central nervous system,” unless specifically articulated. Is it a whole/part, instance, parent/child, or other kind of relationship?

Using OWL, more information about the classes represented by thesauri terms, the relationship between classes, subclasses, and members can be described. In a typical thesaurus, the terms “nervous system” and “central nervous system” would have the labels BT and NT, respectfully. A software agent would not be able to make use of these labels and the relationship they describe unless the agent is custom coded. The purpose of OWL is to provide descriptive information using RDF/XML syntax that would allow OWL parsers and inference engines, particularly those not within the control of the owners of the target thesaurus, to use the incredible intellectual value contained in a well developed thesaurus.

The levels of abstraction should be apparent at this point. At one level there are terms. At another level the relationships between groups of terms are described within a thesaurus structure. The thesauri standards do not dictate how to label thesaurus relationships. A term could be USE Agriculture or Preferred Term Agriculture or PT Agriculture. Hard coding of software agents with all of the possible variations of thesaurus labels is impractical.

OWL then is used to describe labels such as BT, NT, NPT, and RT1, etc., and to describe additional properties about classes and members such as the type of BT/NT relationship between two terms. Additional power can be derived when two or more thesauri OWL ontologies are mapped. This would allow Web software agents to determine the meaning of subject terms (key words) found in the meta-data element of Web pages, to determine if other Web pages containing the same terms have the same meaning, and to make additional inferences about those Web resources.

An OWL output from a full thesaurus provides semantic meaning to the basic classes and properties of a thesaurus. Such an output becomes a true Web resource and can be used more effectively by automated processes. Another layer of OWL wrapped around subject terms from an OWL level thesaurus and the resources (such as Web pages) these subject terms are describing would be an order of magnitude more powerful, but also more complicated and difficult to implement.

OWL Thesaurus Output

An OWL thesaurus output contains two major parts. The first part articulates the basic definition of the structure of the thesaurus. It is an XML/RDF schema. As such, a software agent can use the resolving properties in the schema to locate resources that provide the necessary logic needed to use the thesaurus.

FIGURE 1 – XML/RDF/OWL DECLARATIONS

<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
<!ENTITY owl "http://www.w3.org/2002/07/owl#" >
<!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" > ]>
<rdf:RDF
xmlns    ="http://localhost/owlfiles/DHProject#"
xmlns:DHProject ="http://localhost/owlfiles/DHProject#"
xmlns:base ="http://localhost/owlfiles/DHProject#"
xmlns:owl ="http://www.w3.org/2002/07/owl#"
xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd ="http://www.w3.org/2001/XMLSchema#">
<owl:Ontology rdf:about="">
<rdfs:comment>OWL export from MAIstro</rdfs:comment>
<rdfs:label>DHProject Ontology</rdfs:label>
</owl:Ontology>

Without agonizing over the details, Figure 1 provides the necessary declarations in the form of URL’s so that software agents can locate additional resources related to this thesaurus. The software agent would not have to have any of the W3C recommendations (XML, RDF, OWL) hard coded into its internal logic. It would have to have ‘resolving’ logic such as, “if you encountered a URL, then do the following…”

FIGURE 2 – SAMPLE TERM RECORD OUTPUT IN XML

<TermInfo>
<T>Agrotechnology</T>
<BT>Biotechnology</BT>
<NT>Animal management technologies</NT>
<NT>Controlled environment agriculture</NT>
<NT>Genetically modified crops</NT>
<RT>Agricultural science</RT>
<RT>Food technology</RT>
<UF>Plant engineering</UF>
<Scope_Note></Scope_Note>
<Editorial_Note></Editorial_Note>
<Facet></Facet>
<History></History>
</TermInfo>

Figure 2 shows a sample thesaurus term record output in XML for the term, “Agrotechnology”. This term has BT, NT, RT, Status, UF, Scope_Note, Editorial_Note, Facet, and History as a complex combination of classes, members, and properties. Anyone familiar with thesauri can determine what the abbreviations mean such as BT, NT, and RT and, thus, they can infer the relationships between all of the terms in the term record. An OWL thesaurus output provides additional intelligence that helps software make the same inferences.

After the declarations portion, shown in Figure 1, the remaining portion of the first part of an OWL thesaurus output is the schema describing the classes, subclasses, and members that comprise a thesaurus and all of the properties of each. Each of the XML elements (e.g. <RT>) in Figure 2 is defined in the schema as well as their properties and relationships. These definitions conform to the OWL W3C recommendation.

The first part of an OWL thesaurus output contains declarations and classes, subclasses, and their properties. It contains all of the logic needed by a specialized agent to make sense of your thesaurus and other OWL thesaurus resources on the Web.

FIGURE 3 – OWL OUTPUT OF TERM RECORD “AGROTECHNOLOGY”

</PreferredTerm>
<PreferredTerm rdf:ID="T131">
<rdfs:label xml:lang="en">Agrotechnology</rdfs:label>
<BroaderTerm rdf:resource="#T603"
  newsindexer:alpha="Biotechnology"/>
<NarrowerTerm rdf:resource="#T252"
  newsindexer:alpha="Animal management technologies"/>
<NarrowerTerm rdf:resource="#T1221"
  newsindexer:alpha="Controlled environment agriculture"/>
<NarrowerTerm rdf:resource="#T2166"
  newsindexer:alpha="Genetically modified crops"/>
<Related_Term rdf:resource="#T127"
  newsindexer:alpha="Agricultural science"/>
<Related_Term rdf:resource="#T2020"
  newsindexer:alpha="Food technology"/>
<Non-Preferred_Term rdf:resource="#T3898"
  newsindexer:alpha="Plant engineering"/>
</PreferredTerm>

The second part of an OWL thesaurus contains the terms of your thesaurus marked up according to the OWL recommendation. Figure 3 shows an OWL output for our sample term, “Agrotechnology”. (Note, since there are no values found in Figure 2 for Scope_Note, Editorial_Note, Facet, and History, these elements are not present in Figure 3.)

Now our infamous software agent could infer that “Agrotechnology” is a ‘NarrowerTerm’ of “Biotechnology”. “Agrotechnology” has three ‘NarrowerTerms’, two “RelatedTerms’, and one “NonPreferredTerm’. From the OWL output, the software agent can resolve the meaning and use of ‘BroaderTerm’, ‘NarrowerTerm’, ‘RelatedTerm’, and ‘NonPreferredTerm’ by navigating to the various URL’s. The agent can determine from the schema dictates that, if a term has property value, ‘NarrowerTerm’, then it must have property type value, ‘BroaderTerm’. A term can’t be a narrower term, if it doesn’t have a broader term. A term that is a ‘BroaderTerm’ must also be a ‘PreferredTerm’ and so on.

FIGURE 4 – OWL OUTPUT OF TERM RECORD “PLANT ENGINEERING”

<NonPreferredTerm rdf:ID="T3898">
<rdfs:label xml:lang="en">Plant engineering</rdfs:label>
<USE rdf:resource="T131" newsindexer:alpha="Agrotechnology"/>
</NonPreferredTerm>

Our thesaurus software agent can infer from Figure 3 that the thesaurus it is evaluating uses “Agrotechnology” for “Plant engineering”. Figure 4 identifies “Plant engineering” as a ‘NonPreferredTerm’ and identifies “Agrotechnology” as the ‘PreferredTerm’. (The logic in the schema dictates that if you have a “NonPreferredTerm”, then it must have a “PreferredTerm”.)

Suppose our software agent encounters “Plant engineering” at another Web site and uses it to locate resources there. Now the agent locates your Web site. The agent would first use “Plant engineering”. From your OWL thesaurus output it would infer that at your site it should use your preferred term, “Agrotechnology”, to locate similar resources.

All the terms and terms relationship in your thesaurus or taxonomy would be defined in part two of the OWL thesaurus output. It is now a Web resource that can be used by software agents. Designed to be distributed and referenced, a given base OWL thesaurus can grow as other thesaurus ontologies reference it.

More Meaning Needed

Even a thesaurus wrapped in OWL falls short of the full potential of the Semantic Web. This ‘first order’ output allows other thesaurus applications to make inferences about classes, subclasses, and members of a thesaurus. By “reading” the OWL wrappings, any thesaurus OWL software agent can make useful infers. By using classes, subclasses, and members and their properties, Web software agents would be able to reproduce the hierarchical structure of a thesaurus outside of the application used to construct it.

However, a lot is still missing. For example, knowing a term’s parent, children, other terms it is related to, and terms it is used for, does not tell you what the term means and what it might be trying to describe. Additional classes, subclasses, and members all with properties are needed. How a term is supposed to be used and why this ‘term’ is preferred over that ‘term’ would be enormously useful properties for improving the performance of software agents.

A more difficult layer of semantic meaning is the relationship between a thesaurus term and the entity, or object, it describes. An assignable thesaurus term is a member of class “PreferredTerm”. When it is assigned to an object, for example a research report or Web page, that term becomes a property of that object. For a Web page, descriptive terms become attributes of the ‘Meta’ element:

<META NAME="KEYWORDS" CONTENT="content management software,
xml thesaurus, concept extraction, information retrieval,
knowledge extraction, machine aided indexing,
taxonomy management system, text management, xml">

None of the intelligence found in an OWL thesaurus output is found in the Meta element. Having that intelligence improves the likelihood that our software agent can make useful inferences about this Web resource.

This intelligence is not currently available because HTML does not allow for OWL markup of keywords in the Meta element. There are major challenges to doing this. To illustrate, the single keyword, “machine aided indexing”, is rendered in Figure 5 as an OWL thesaurus output. This is very heavy overhead.

FIGURE 5 – OWL OUTPUT OF TERM RECORD MACHINE AIDED INDEXING

<PreferredTerm rdf:ID="T131">
<rdfs:label xml:lang="en">Machine aided indexing</rdfs:label>
<BroaderTerm rdf:resource="#T603"
 newsindexer:alpha="Information technology"/>
<NarrowerTerm rdf:resource="#T1221"
 newsindexer:alpha="Concept extraction"/>
<NarrowerTerm rdf:resource="#T2166"
 newsindexer:alpha="Rule base techniques"/>
<Related_Term rdf:resource="#T127"
 newsindexer:alpha="Categorization systems"/>
<Related_Term rdf:resource="#T2020"
 newsindexer:alpha="Classification systems"/>
<Non-Preferred_Term rdf:resource="#T3898"
 newsindexer:alpha="MAI"/>
</PreferredTerm>

The entire rendering depicted in Figure 5 would not be necessary for each keyword assigned to the Meta element of a Web page. A shorthand version could be designed that would direct software agents to the OWL thesaurus output, but such a shorthand method is not available.

Even if HTML incorporates a shorthand OWL markup for Meta keywords, the intelligence required to apply the right keywords automatically, for example, making the determination, “Web page x is about Machine aided indexing”, is not in the current OWL output. Automatic or semiautomatic indexing is the only way to handle volume and variety, especially dealing with Web pages.

Commercial applications such as Data Harmony’s M.A.I.™ Concept Extractor© and similar products provide machine automated indexing solutions. Theoretically, the knowledge representation systems that drive machine automated indexing and classification systems could incorporate OWL markup. When a machine indexing system assigned a preferred term to a Web page, it would write it into the Meta element along with its OWL markup.

However, to truly achieve the objectives of the Semantic Web the OWL W3C recommendation should be extended to include the decision algorithms used in the machine automated indexing process. Or, alternative W3C recommendation regarding the Semantic Web should be used in conjunction with OWL. If this could be accomplished, then software agents could determine the logic used in assigning terms. Next, the agent could compare the logic used at other Web sites and would then be able to make comparisons and to draw conclusions about various Web resources; conclusions like, of the eighteen Web sites your software agent reviewed that discussed selling gasoline, only eight were actual gas stations and only four of the eight had data the agent could determine was the retail price for unleaded.

We have moved closer to locating the least expensive gasoline within a five-mile radius of our current location. What has been described herein is actually being done, but so far only in closed environments where all of the variables are controlled. For example, there are Web sites that specialize in price comparison shopping.

Beyond these special cases, for the open Web the challenges are great. The sheer size of the Web and its speed of growth are obvious. More challenging is capturing meaning in knowledge representation systems like OWL (and other Semantic Web initiatives at W3C like SKOS, Topic Maps, etc.). How many OWL thesauri will there be? How many are needed? How much horsepower will be needed for an agent to resolve meaning when OWL thesauri are cross-referencing each other in potentially endless loops?

For these and other reasons, the Semantic Web may not live up to its full promise. The complexity and the magnitude of the effort may prove to be insurmountable. That said, there will be a Semantic Web and OWL will play an important role, but it will probably be a more simplified semantic architecture and more isolated, for example, to vertical markets or specific fields and disciplines.

For the reader, before you launch your own initiatives, assess your internal resources and measure the level of internal commitment, particularly at the upper levels of your organization. Know what is happening in your industry or field. If Semantic initiatives are happening in your industry, then the effort needed to deploy a taxonomic strategy (OWL being one piece of the solution) should be seriously considered. If you don’t make the effort, your Web resources and your vast internal, private resources risk being lost in the ‘sea of meaninglessness’, putting you at a tremendous competitive disadvantage.

1. BT – broad term, NT – narrow term, NPT – non-preferred term, RT – related term

Comments { 0 }

Automatic Indexing: A Matter of Degree

Marjorie M.K. Hlava

Marjorie M.K. Hlava

Picture yourself standing at the base of that metaphorical range, the Information Mountains, trailhead signs pointing this way and that: Taxonomy, Automatic Classification, Categorization, Content Management, Portal Management. The e-buzz of e-biz has promised easy access to any destination along one or more of these trails, but which ones? The map in your hand seems to bear little relationship to the paths or the choices before you. Who made those signs?

In general, it’s been those venture-funded systems and their followers, the knowledge management people and the taxonomy people. Knowledge management people are not using the outlines of knowledge that already exist. Taxonomy people think you need only a three-level, uncontrolled term list to manage a corporate intranet, and they generally ignore the available body of knowledge that encompasses thesaurus construction. Metadata followers are unaware of the standards and corpus of information surrounding indexing protocols, including back-of-the-book, online and traditional library cataloging. The bodies of literature are distinct with very little crossover. Librarians and information scientists are only beginning to be discovered by these groups. Frustrating? Yes. But if we want to get beyond that, we need to learn — and perhaps painfully, embrace — the new lingo. More importantly, it is imperative for each group to become aware of the other’s disciplines, standards and needs.

We failed to keep up. It would be interesting to try to determine why and where we were left behind. The marketing hype of Silicon Valley, the advent of the Internet, the push of the dot com era and the entry of computational linguists and artificial intelligence to the realm of information and library science have all played a role. But that is another article.

Definitions

The current challenge is to understand, in your own terms, what automatic indexing systems really do and whether you can use them with your own information collection. How should they be applied? What are the strengths and weaknesses? How do you know if they really work? How expensive will they be to implement? We’ll respond to these questions later on, but first, let’s start with a few terms and definitions that are related to the indexing systems that you might hear or read about.

These definitions are patterned after the forthcoming revision of the British National Standard for Thesauri, but do not exactly replicate that work. (Apologies to the formal definition creators; their list is more complete and excellent.)

Document — Any item, printed or otherwise, that is amenable to cataloging and indexing, sometimes known as the target text, even when the target is non-print.
Content Management System (CMS) — Typically, a combination management and delivery application for handling creation, modification and removal of information resources from an organized repository; includes tools for publishing, format management, revision control, indexing, search and retrieval.
Knowledge Domain — A specially linked data-structuring paradigm based on a concept of separating structure and content; a discrete body of related concepts structured hierarchically.
Categorization — The process of indexing to the top levels of a hierarchical or taxonomic view of a thesaurus.
Classification — The grouping of like things and the separation of unlike things, and the arrangement of groups in a logical and helpful sequence.
Facet — A grouping of concepts of the same inherent type, e.g., activities, disciplines, people, natural objects, materials, places, times, etc.
Sub Facet — A group of sibling terms (and their narrower terms) within a facet having mutually exclusive values of some named characteristics.
Node — A sub-facet indicator.
Indexing — The intellectual analysis of the subject matter of a document to identify the concepts represented in the document and the allocation of descriptors to allow these concepts to be retrieved.
Descriptor — A term used consistently when indexing to represent a given concept, preferably in the form of a noun or noun phrase, sometimes known as the preferred term, the keyword or index term. This may (or may not) imply a “controlled vocabulary.”
Keyword — A synonym for descriptor or index term.
Ontology — A view of a domain hierarchy, the similarity of relationships and their interaction among concepts. An ontology does not define the vocabulary or the way in which it is to be assigned. It illustrates the concepts and their relationships so that the user more easily understands its coverage. According to Stanford’s Tom Gruber, “In the context of knowledge sharing…the term ontology…mean(s) a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”
Taxonomy — Generally, the hierarchical view of a set of controlled vocabulary terms. Classically, taxonomy (from Greek taxis meaning arrangement or division and nomos meaning law) is the science of classification according to a pre-determined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis or information retrieval. In Web portal design, taxonomies are often created to describe categories and subcategories of topics found on a website.
Thesaurus — A controlled vocabulary wherein concepts are represented by descriptors, formally organized so that paradigmatic relationships between the concepts are made explicit, and the descriptors are accompanied by lead-in entries. The purpose of a thesaurus is to guide both the indexer and the searcher to select the same descriptor or combination of descriptors to represent a given subject. A thesaurus usually allows both an alphabetic and a hierarchical (taxonomic) view of its contents. ISO 2788 gives us two definitions for thesaurus: (1) “The vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example as ‘broader’ and ‘narrower’) are made explicit” and (2) “A controlled set of terms selected from natural language and used to represent, in abstract form, the subjects of documents.”

Are these old words with clearly defined meanings? No. They are old words dressed in new definitions and with new applications. They mean very different things to different groups. People using the same words but with different understandings of their meanings have some very interesting conversations in which no real knowledge is transferred. Each party believes communication is taking place when, in actuality, they are discussing and understanding different things. Recalling Abbott and Costello’s Who’s on First? routine, a conversation of this type could be the basis for a great comedy routine (SIG/CON perhaps), if it weren’t so frustrating — and so important. We need a translator.

For example, consider the word index. To a librarian, an index is a compilation of references grouped by topic, available in print or online. To a computer science person (that would be IT today), it would refer to the inverted index used to do quick look-ups in a computer software program. To an online searcher, the word would refer to the index terms applied to the individual documents in a database that make it easy to retrieve by subject area. To a publisher, it means the access tool in the back of the book listed by subject and sub-subject area with a page reference to the main book text. Who is right? All of them are correct within their own communities.

Returning to the degrees of application for these systems and when to use one, we need to address each question separately.

What Systems Are There?

What are the differences among the systems for automatic classification, indexing and categorization? The primary theories behind the systems are:

  • Boolean rule base variations including keyword or matching rules
  • Probability of application statistics (Bayesian statistics)
  • Co-occurrence models
  • Natural language systems

New dissertations will bring forth new theories that may or may not fit in this lumping.

How Should They Be Applied?

Application is achieved in two steps. First, the system is trained in the specific subject or vertical area. In rule-based systems this is accomplished by (1) selecting the approved list of keywords to be used and, through matching and synonyms, building simple rules and (2) employing phraseological, grammatical, syntactical, semantical, usage, proximity, location, capitalization and other algorithms — based on the system — for building complex rules. This means that, frequently, the rules are keyword-matched to synonyms or to word combinations using Boolean statements in order to capture the appropriate indexing out of the target text.

In Bayesian engines the system first selects the approved list of keywords to be used for training. The system is trained using the approved keywords against a set of documents, usually about 50 to 60 documents (records, stories). This creates scenarios for word occurrence based on the words in the training documents and how often they occur in conjunction with the approved words for that item. Some systems use a combination of Boolean and Bayesian to achieve the final indexing results.

Natural language systems base their application on the parts of speech and the nature of language usage. Language is used differently in different applications. Think of the word plasma. It has very different meanings in medicine and in physics, although the word has the same spelling and pronunciation, not to mention etymology. Therefore, the contextual usage is what informs the application.

In all cases it is clear that a taxonomy or thesaurus or classification system needs to be chosen before work can begin. The resulting keyword metadata sets depend on a strong word list to start with — regardless of the name and format that may be given to that word list.

What Are the Strengths and Weaknesses?

The weaknesses of the systems compared to human indexing are the frequency of what are called false drops. That is, the keywords selected fit the computer model but do not make sense in actual use. These terms are considered noise in the system and in application. Systems work to reduce the level of noise.

The measure of the accuracy of a system is based on

  • Hits — exact matches to what a human indexer would have applied to the system
  • Misses — the keywords a human would have selected that a computerized system did not
  • Noise — keywords selected by the computer that a human would not have selected

The statistical ratios of Hits, Misses and Noise are the measure of how good the system is. The cut-off should be at 85% Hits out of a total of 100% accurate (against human) indexing. That means that Noise and Misses need to be less than 15% combined.

A good system will provide an accuracy rate of 60% initially from a good foundation keyword list and 85% or better with training or rule building. This means that there is still a margin of error expected and that the system needs — and improves with — human review.

Perceived economic or workflow impacts often render this method unacceptable, leading to the attempt to provide some form of automated indexing. The mitigation of these results so human indexers are not needed is addressed in a couple of ways. On the one hand suppose that the keyword list is hierarchical (the taxonomy view) and goes to very deep levels in some subject areas, maybe 13 levels to the hierarchy. A term can be analyzed and applied only to the final level and therefore its use is concise and plugged into a narrow application.

On the other hand, it may also be “rolled up” to ever-broader terms until only the first three levels of the hierarchy are used. This second approach is preferred in the web-click environment, where popular thinking (and some mouse-behavior research) indicates that users get bored at three clicks and will not go deeper into the hierarchy anyway.

These two options make it possible to use any of the three types of systems for very quick and fully automatic bucketing or filtering of target data for general placement on the website or on an intranet. Achieving deeper indexing and precise application of keywords still requires human intervention, at least by review, in all systems. The decision then becomes how precisely and deeply you will develop the indexing for the system application and the user group you have in mind.

How Do We Know If They Really Work?

You can talk with people who have tried to implement these systems, but you might find that (1) many are understandably reluctant to admit failure of their chosen system and (2) many are cautiously quiet around issues of liability, because of internal politics or for other reasons. You can review articles, white papers and analyst reports, but keep in mind that these may be biased toward the person or company who paid for the work. A better method is to contact users on the vendor’s customer list and speak to them without the vendor present. Another excellent method is to visit a couple of working implementations so that you can see them in action and ask questions about the system’s pluses and minuses.

The best method of all is to arrange for a paid pilot. In this situation you pay to have a small section of your taxonomy and text processed through the system. This permits you to analyze the quality and quantity of real output against real and representative input.

How Expensive Will They Be to Implement?

We have looked at three types of systems. Each starts with a controlled vocabulary, which could be a taxonomy or thesaurus, with or without accompanying authority files. Obviously you must already have, or be ready to acquire or build, one of these lists to start the process. You cannot measure the output if you don’t have a measure of quality. That measure should be the application of the selected keywords to the target text.

Once you have chosen the vocabulary, the road divides. In a rule base, or keyword, system the simple rules are built automatically from the list for match and synonym rules, that is, “See XYZ, Use XYZ.” The complex rules are partially programmatic and partially written by human editors/indexers. The building process averages 4 to 10 complex rules per hour. The process of deciding what rules should be built is based on running the simple rule base against the target text. If that text is a vetted set of records — already indexed and reviewed to assure good indexing — statistics can be automatically calculated. With the Hit, Miss and Noise statistics in hand the rule builders use the statistics as a continual learning tool for further building and refinement of the complex rule base. Generally 10—20% of terms need a complex rule. If the taxonomy has 1000 keyword terms, then the simple rules are made programmatically and the complex rules — 100 to 200 of them — would be built in 10 to 50 hours. The result is a rule base or knowledge extractor or concept extractor to run against target text.

Bayesian, inference, co-occurrence categorization systems depend on the gathering of training set documents. These are documents collected for each node (keyword term) in the taxonomy that represents that term in the document. The usual number of documents to collect for training is 50. Some require more, some less. Collection of the documents for training may take up to one hour or more per term to gather, to review as actually representing the term and to convert to the input format of the categorization system. Once all the training sets are collected, a huge systems processing task set is run to find the logical connections between terms within a document and within a set of documents. This returns a probability of a set of terms being relevant to a particular keyword term. Then the term is assigned to other similar documents based on the statistical likelihood that a particular term is the correct one (according to the system’s findings on the training set). The result is a probability engine ready to run against a new set of target text.

A natural language system trains the system based on the parts of speech and term usage and builds a domain for the specific area of knowledge to be covered. Generally, each term is analyzed via seven methods:

  • Morphological (term form — number, tense, etc.)
  • Lexical analysis (part of speech tagging)
  • Syntactic (noun phrase identification, proper name boundaries)
  • Numerical conceptual boundaries
  • Phraseological (discourse analysis, text structure identification)
  • Semantic analysis (proper name concept categorization, numeric concept categorization, semantic relation extraction)
  • Pragmatic (common sense reasoning for the usage of the term, such as cause and effect relationships, i.e., nurse and nursing)

This is quite a lot of work, and it may take up to four hours to define a single term fully with all its aspects. Here again some programmatic options exist as well as base semantic nets, which are available either as part of the system or from other sources. WordNet is a big lexical dictionary heavily used by this community for creation of natural language systems. And, for a domain containing 3,000,000 rules of thumb and 300,000 concepts (based on a calculus of common sense), visit the CYC Knowledge Base. These will supply a domain ready to run against your target text. For standards evolving in this area take a look at the Rosetta site on the Internet.

Summary

There are real and reasonable differences in deciding how a literal world of data, knowledge or content should be organized. In simple terms, it’s about how to shorten the distance between questions from humans and answers from systems. Purveyors of various systems maneuver to occupy or invent the standards high ground and to capture the attention of the marketplace, often bringing ambiguity to the discussion of process and confusion to the debate over performance. The processes are complex and performance claims require scrutiny against an equal standard. Part of the grand mission of rendering order out of chaos is to bring clarity and precision to the language of our deliberations. Failure to keep up is failure to engage, and such failure is not an option.

We have investigated three major methodologies used in the automatic and semi-automatic classification of text. In practice, many of the systems use a mixture of the methods to achieve the result desired. Most systems require a taxonomy in order to start and most systems tag text to each keyword term in the taxonomy as metadata in the keyword name or in other elements as the resultant.

Access Innovations for Document abstracting and indexing • Document conversion • Business Taxonomies • Machine Aided Indexing
All rights reserved. Copyright © 2006 Access Innovations, Inc.

Comments { 0 }

The Kano Model: Critical to Quality Characteristics and VOC

Origin of the Kano Model

Dr. Noriaki Kano, a very astute student of Dr. Ishikawa, developed an interesting model to address the various ways in which Six Sigma practitioners could prioritize customer needs. This becomes particularly important when trying to rank the customer’s wants and desires in a logical fashion.

The Practical Side to the Kano Model

The Kano model is a tool that can be used to prioritize the Critical to Quality characteristics, as defined by the Voice of the Customer, which I will explain in greater detail below. The three categories identified by the Kano model are:

  • Must Be: The quailty characteristics that must be present or the customer will go elsewhere.
  • Performance: The better we are at meeting these needs, the happier the customer is.
  • Delighter: Those qualities that the customer was not expecting but received as a bonus.

The First Step for Creating the Kano Model: Identifying the Voice of the Customer

The first step for creating the Kano model is to identify the quality characteristics that are typically fuzzy, vague and nebulous. These quality characteristics are referred to as the Voice of the Customer (VOC). Once the Voice of the Customer is understood, we can attempt to translate it into quantitative terms known as critical to quality (CTQ) characteristics. This should not be a new concept for those familiar with the Six Sigma methodology. What happens from here, though, can sometimes go astray if we are not careful and try to put our own spin on the needs of the customer. This may be the result of trying to make things more easily obtainable for us—a formula for failure.

Use the Kano Model to Prioritize the Critical to Quality Characteristics

So, now that we have identified what is important to the customer in workable terms, we can go to the second step. Always keeping the customer in mind, we can apply the concepts outlined in the Kano model diagram.

A Few Words About Kano

A Few Words About Kano

The Kano model is broken down into an (x, y) graph, where the x-axis of the Kano model represents how good we are at achieving the customer’s outcome(s), or CTQ’s. The y-axis of the Kano model records the customer’s level of satisfaction as a result of our level of achievement.

The red line on the Kano model represents the Must Bes. That is, whatever the quality characteristic is, it must be present; if the quality characteristic is not met, the customer will go elsewhere. The customer does not care if the product is wrapped in 24-carat gold, only that it is present and is functionally doing what it was designed to do. An example of this would be a client who checks into a hotel room expecting to find a bed, curtains and bathroom in the room. These items are not called out for by the customer, but would definitely cause them to go elsewhere if any of these “characteristics” were not present.

The blue line on the Kano model represents the Performance. This line reflects the Voice of the Customer. The better we are at meeting these needs, the happier the customer is. It is here where the trade-offs take place. Someone wanting good gas mileage would not likely expect to have a vehicle that has great accelerations from a standing position.

By far, the most interesting evaluation point of the Kano model is the Delighter (the green line). This represents those qualities that the customer was not expecting, but received as a bonus. A few years ago, it was customary that when a car was taken back to the dealer for a warranty oil change, the vehicle was returned to the owner with its body washed, mirrors polished, carpets vacuumed, etc. After a few trips to the dealer, this Delighter became a Must Be characteristic. Thus, a characteristic that once was exciting was now a basic need, and a part of the customer’s expectations. Another example of this is the amenities platter that some hotels provide their platinum customers upon checking in. I am one of those clients entitled to such a treat. This practice was certainly a delight. It has, however, become an expected part of my check-in, such that if there is no platter waiting in my room, I’m on the phone with the front desk.

Once the critical to quality characteristics have been prioritized, the last step of the Kano model involves an analysis of evaluating or assessing just how well we can satisfy each of Dr. Noriaki Kano’s classifications.

Kano Model Case Study

Being a trainer and consultant, I spend a lot of time on the road. In doing so, I have a tendency to check into hotels on a regular basis, as mentioned earlier. I once queried the manager of a hotel I spend a lot of time at on how he established practices to entice the business client. He related the following scenario to me.

The first thing he did was identify a list of qualities the client would be interested in. He came upon his list by listening to complaints, handing out surveys, holding focus groups and conducting interviews. The information below is a partial list from the Voice of the Customer. Knowing that I was involved in something that dealt with customer satisfaction, he asked me to assist him in ranking the characteristics. I explained the concepts behind the Kano model, and together we developed the list in the column labeled Business Client, as shown in Table 1. This was all fine and dandy, as far as the business customer was concerned.

Table 1

Table 1

For my own interest, I asked him to look at these same characteristics from the point of view of a vacationing family. As a final task, I asked him to assess how strong or weak he felt the hotel was when trying to meet those quality characteristics identified in table 1.

The results are shown in Table 2.

Table 2

Table 2

The conclusions from this effort can be as summarized by looking at the rows that have a characteristic in the Must Be category. With respect to the business client, this yielded express checkout, a comfortable bed, continental breakfast, internet hook-up and newspaper. The vacationer, on the other hand, had Must Bes that included price, comfortable bed, cable/HBO and a swimming pool.

Of these quality characteristics, the manager realized that the hotel was weak in the check-in and express checkout process, and internet hook-up. This Kano model exercise allowed the manager to better address the needs of the customer, based on their Critical to Quality characteristics. Now the work begins to minimize the gap of where the hotel is with respect to where the hotel wants to be.

One final thought: If a characteristic isn’t on the list, does that mean it can be ignored?

©2006 E. George Woodley. All rights reserved.
Published with permission from the author.

Comments { 0 }