Tag Archives | indexing

Unlimited Priorities and NCSU Libraries Partner to Create Model Data Mining Agreement

Cape Coral, FL (March 24, 2015)Unlimited Priorities LLC, a firm specializing in support for small and medium-size companies in the information and publishing industries, and The North Carolina State University Libraries (NCSU Libraries), have collaborated to open up Accessible Archives’ databases for text and data mining for client libraries.

Text and data mining (TDM) encompasses dozens of computationally-intensive techniques and procedures used to examine and transform data and metadata. At its core, TDM uses high-speed computing technology to examine large data sets in order to recognize and model meaningful patterns and rules.

Unlimited Priorities orchestrated this initiative at the request of Darby Orcutt, Assistant Head, Collection Management, at The NCSU Libraries. Mr. Orcutt explained: “Through this model agreement, Unlimited Priorities and Accessible Archives have become even stronger partners with libraries in supporting the current and emerging needs of researchers. They quickly and positively responded to the opportunity for a win-win relationship in this area. Not only does this agreement open up large and high-quality historical datasets for mining by our users, but as scholars come to understand this content in ways that only such computational research makes possible, the value of these resources for academia correspondingly increases.”

Continue Reading →

Comments { 1 }

A Web by Any Other Name

Written for Unlimited Priorities and DCLnews Blog.

Why We Need to Know About the Semantic Web

Richard Oppenheim

Richard Oppenheim

Some say “Look out — the semantic web is coming.” Some say it is already here. Others say: “what exactly is semantic about the web?”

Whether or not you have ever heard of the “semantic web,” you need to know more about it. Probably the first step for all of us is to get past the hype of yet another marketing term for technology. We know that technology will continuously create new phrases for new features that enable us to do more than yesterday. This includes terms like personal computer, smartphone, internet, world wide web, telecommuting, cloud computing and a lot more. Twenty-five years ago, only a few folks were even using computers, let alone smartphones, e-mail, social networks, and search engines.

Tim Berners-Lee, the person credited with developing the world wide web, said, “the semantic web is not a separate web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

The purpose of the semantic web is to enable words and phrases to provide links to resources, like Wikipedia, to reach across the universe of web-accessible data.

The dictionary definition of “semantics” is a range of ideas that has no defined limits. In written language, such things as paragraph structure and punctuation have semantic content. In spoken language, it is the study of the signs or symbols inside a set of circumstances and contexts. This includes sounds, tones, facial expressions, body language, foot tapping and hand waving.

There are lots of ways to refer to the huge storehouse of data outside our control, such as the internet, the web, cyberspace, geek heaven, or some other term. Whatever term you choose, know that the semantic storehouse is a repository for words, images, and applications that is way too big to measure. It is like trying to count the number of stars in the universe; only rough estimates are available.

To deliver or receive communication, we combine individual elements in small or large quantities to create spoken language, articles, books, web sites, blogs, tweets, photo albums, videos, songs, song albums, audio books, podcasts and more. Words can be from any one or multiple languages. Images can be still or moving, personal or commercial. Each element in our storehouse is always available to be used in any sequence and any quantity.

The semantic web invokes Tim Berners-Lee’s vision of every piece of data immediately accessible by anyone to use in any way they want. His vision expands the use of “linked data” to connect all web-based elements with every other web-based element. Wikipedia provides a peek into how, with its linking of terms from one posting to other entries in other posts. The resounding slogan shouted by Berners-Lee is “Raw Data Now.”

The purpose of the semantic web is to enable words and phrases to provide links to resources, like Wikipedia, to reach across the universe of web-accessible data. A current example is how CNN is expanding its resources. For on-air broadcasting, CNN summarizes its news feeds. You can login to the CNN website to access the “raw” news feeds to watch and listen without an analyst’s intervention.

Growing up, my primary reference material was a paper-based encyclopedia or dictionary or thesaurus. Books in my local town library were available, but it was hard to access books that the town library did not have. Today, all of those resources are accessible at any time without completing a weight training exercise. One fledgling example of this use of raw data is the DBpedia. DBpedia is a community effort to extract structured information from Wikipedia to expand the linkage of data. As of April 2010, the DBpedia knowledge base contains over one billion pieces of information describing more than 3.4 million things.

In 2010, you will not change your life and adopt only the semantic web over currently-used resources. Every forecast tells us that in the few years ahead there will be lots of new and rich resources available. The semantic web will enable us to collect data elements, assemble them, disassemble them and start anew or continue by adding more data elements. It will be one of the 21st century’s functional erector sets, useful for business support, personal search, and even customizable games.

The semantic web, however, is not a game. And it is, of course, under construction today and will likely be under construction for at least the rest of this century. The skeptics are stating that the goals are too lofty and not realistic. But a quick view of very recent history reveals:

  • The internet was first used by a few universities in the 1960s. Thirty years later, the world wide web started its revolutionary integration with our lives.
  • Bar code scanning was first tested on a pack of chewing gum in 1974. It was another ten years before grocery stores started to adopt the thick and thin bars. Today bar coding has grown way beyond grocery store checkout lines.
  • In 1983, Motorola released the first cellular phone for $3,000. 10 years later, the cell phone industry took its first leap. Today cellular and wireless technologies are essential tools for lots and lots of enterprises and individuals of every age.
  • In the past 10 years: LinkedIn was founded in 2002, Facebook in 2004, Twitter in 2006

Raw data without borders will enable, for example, each of us to create our very own Dewey Decimal filing system, including card catalogue, rolodex, and other customized information.

Technology will always ride the sea changes as new capabilities build on what has been tested and used before. Search engines enable us to ask questions and retrieve answers that are some combination of data, some precise, some tangential to the subject, and some totally unrelated to the topic. Today’s data libraries are single location silos, such as Wikipedia, that hold information beyond the capacity of my local library. With the semantic web, these silos will lose their standalone status. The new linking capabilities will deliver infinitely expanded ways to link data in any one silo with data in almost any other silo. Everything, including national security, will still require protection from criminals, hackers, and assorted bad guys.

Raw data without borders will enable, for example, each of us to create our very own Dewey Decimal filing system, including card catalogue, rolodex, and other customized information. Similar to the smartphone app world, we will have a large selection of end-user applications that integrate, combine and deduce information needed to assist us in performing tasks. Of course, we may choose to perform information construction ourselves. This would be like answering our own phone or typing our own correspondence or driving our own car. We can choose to adapt, adopt, or discard any feature that becomes available. The semantic web is real and it is growing. It has the potential to expand beyond any estimate.

About the Author

Richard Oppenheim, CPA, blends business, technology and writing competence with a passion to help individuals and businesses get unstuck from the obstacles preventing their moving ahead. He is a member of the Unlimited Priorities team. Follow him on twitter at twitter.com/richinsight.

Comments { 0 }

Interview with Deep Web Technologies’ Abe Lederman

Written for Unlimited Priorities and
DCLnews Blog by Barbara Quint

Abe Lederman

Abe Lederman

Abe Lederman is President and CEO of Deep Web Technologies, a software company that specializes in mining the deep web.

Barbara Quint: So let me ask a basic question. What is your background with federated search and Deep Web Technologies?

Abe Lederman: I started in information retrieval way back in 1987. I’d been working at Verity for 6 years or so, through the end of 1993. Then I moved to Los Alamos National Laboratory, one of Verity’s largest customers. For them, I built a Web-based application on top of the Verity search engine that powered a dozen applications. Then, in 1997, I started consulting to the Department of Energy’s Office of Science and Technology Information. The DOE’s Office of Environmental Management wanted to build something to search multiple databases. Then, we called it distributed search, not federated search.

The first application I built is now called the Environmental Science Network. It’s still in operation almost 12 years later. The first version I built with my own fingers on top of a technology devoted to searching collections of Verity documents. I expanded it to search on the Web. We used that for 5 to 6 years. I started Deep Web Technologies in 2002 and around 2004 or 2005, we launched a new version of federated search technology written in Java. I’m not involved in writing any of that any more. The technology in operation now has had several iterations and enhancements and now we’re working on yet another generation.

BQ: How do you make sure that you retain all the human intelligence that has gone into building the original data source when you design your federated searching?

AL: One of the things we do that some other federated search services are not quite as good at is to try to take advantage of all the abilities of our sources. We don’t ignore metadata on document type, author, date ranges, etc. In many cases, a lot of the databases we search — like PubMed, Agricola, etc. — are very structured.

BQ: How important is it for the content to be well structured? To have more tags and more handles?

AL: The more metadata that exists, the better results you’re going to get. In the library world, a lot of data being federated does have all of that metadata. We spend a lot of effort to do normalization and mapping. So if the user wants to search a keyword field labeled differently in different databases, we do all that mapping. We also do normalization of author names in different databases — and that takes work! Probably the best example of our author normalization is in Scitopia.

BQ: How do you work with clients? Describe the perfect client or partner.

AL: I’m very excited about a new partnership with Swets, a large global company. We have already started reselling our federated search solutions through them. Places we’re working with include the European Space Agency and soon the European Union Parliament, as well as some universities.

We pride ourselves on supplying very good customer support. A lot of our customers talk to me directly. We belong to a small minority of federated search providers that can both sell a product to a customer for deployment internally and still work with them to monitor or fix any issues with connectors to what we’re federating, but get no direct access. A growing part of our business uses the SaaS model. We’re seeing a lot more of that. There’s also the hybrid approach, such as that used by DOE’s OSTI. At OSTI our software runs on servers in Oak Ridge, Tennessee, but we maintain all their federated search applications. Stanford University is another example. In September we launched a new app that federates 28 different sources for their schools of science and engineering.

BQ: How are you handling new types of data, like multimedia or video?

AL: We haven’t done that so far. We did make one attempt to build a federated search for art image databases, but, unfortunately for the pilot project, the databases had poor metadata and search interfaces. So that particular pilot was not terribly successful. We want to go back to reach richer databases, including video.

BQ: How do you gauge user expectations and build that into your work to keep it user-friendly?

AL: We do track queries submitted to whatever federated search applications we are running. We could do more. We do provide Help pages, but probably nobody looks at them. Again, we could do more to educate customers. We do tend to be one level removed from end-users. For example, Stanford’s people have probably done a better job than most customers in creating some quick guides and other material to help students and faculty make better use of the service.

BQ: How do you warn (or educate) users that they need to do something better than they have, that they may have made a mistake? Or that you don’t have all the needed coverage in your databases?

AL: At the level of feedback we are providing today, we’re not there yet. It’s a good idea, but it would require pretty sophisticated feedback mechanisms. Some of the things we have to deal with is that when you’re searching lots of databases, they behave differently from each other. Just look at dates (and it’s not just dates), some may not let you search on a date range. A user may want to search 2000-2010 and some databases may display the date, but not let you search on it; some won’t do either. Where the database doesn’t let you search on a date range but displays it, you may get results outside of the date and display them with the unranked results. How to make it clear to the user what is going on is a big thing for the future.

BQ: What about new techniques for reaching “legacy” databases, like the Sitemap Protocol used by Google and other search engines?

AL: That’s used for harvesting information the way that Google indexes web sites. The Sitemap Protocol is used to index information and doesn’t apply to us. Search engines like Google are not going into the databases, not like real-time federated search. Some content owners want to expose all or some of their content existing behind the search forms to search engines like Google. That could include DOE OSTI’s Information Bridge and PubMed for some content. They do expose that content to a Google through sitemaps. A couple of years ago, there was lots of talk about Google statistically filling out forms for content behind databases. In my opinion, they’re doing this in a very haphazard manner. That approach won’t really work.

BQ: Throughout the history of federated search — with all its different names, there have been some questions and complaints about the speed of retrieving results and the completeness of those results from lack of rationalizing or normalizing alternative data sources. Comments?

AL: We’re hearing these days a fair amount of negative comments on federated search and there have been a lot of poor implementations. For example, federated search gets blamed for being really slow, but that probably happens when most federated searches systems wait until each search is complete before displaying any results to the user. We’ve pioneered incremental search results. In our version, results appear within 3-4 seconds. We display whatever results have returned, while, in the background, our server is still processing and ranking results. At any time, the user can ask for a merging of results they’ve not gotten. So the user gets a good experience.

BQ: If the quality of the search experience differs so much among different federated search systems, when should a client change systems?

AL: We’ve had a few successes with customers moving from one federated search service to ours. The challenge is getting customers to switch. We realize there’s a fairly significant cost in switching, but, of course, we love to see new customers. For customers getting federated search as a service, it costs less than if the product were installed on site. So that makes it more feasible to change.

BQ: In my last article about federated searching, I mentioned the new discovery services in passing. I got objections from some people about my descriptions or, indeed, equating them with federated search at all. People from ProQuest’s Serials Solutions told me that their Summon was different because they build a single giant index. Comments?

AL: There has certainly been a lot of talk about Summon. If someone starts off with a superficial look at Summon, it has a lot of positive things. It’s fast and maybe does a better job (or has the potential to do) better relevance ranking. It bothers me that it is non-transparent on a lot of things. Maybe customers can learn more about what’s in it. Dartmouth did a fairly extensive report on Summon after over a year of working with it. The review was fairly mixed, lots of positives and comments that it looks really nice, lots of bells and whistles in terms of limiting searches to peer-reviewed or full text available or library owned and licensed content. But beneath the surface, a lot is missing. It’s lagging behind on indexing. We can do things quicker than Summon. I’ve heard about long implementation times for libraries trying to get their own content into Summon. In federated searching, it only takes us a day or two to add someone’s catalog into the mix. If they have other internal databases, we can add them much quicker.

BQ: Thanks, Abe. And lots of luck in the future.

Related Links

Deep Web Technologies – www.deepwebtech.com
Scitopia – www.scitopia.org
Environmental Science Network (ESNetwork) – www.osti.gov/esn

About the Author

Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on Infotoday.com.

Comments { 0 }

Delores Meglio and the Information Generations

Interviewed by Marydee Ojala in ONLINE: Exploring
Technology & Resources for Information Professionals

Delores Meglio is a survivor, an online information industry survivor. She was profiled on the pages of ONLINE two decades ago (“ONLINE Interviews Delores Meglio of Information Access Company,” by Jeffery K. Pemberton, July 1987, pp. 17-24). Characterized then as an online pioneer, she hasn’t ceased her pioneering activities in the intervening years. Given all the twists and turns in the information industry, she’s seen a lot of changes. “I’ve been through generations in the development of the information industry,” she told me in May 2009, when we sat down to reminisce and look forward during the Enterprise Search Summit East conference.

Meglio is now Vice President, Publisher Relations for Knovel. But she can trace her career back to 1963, when she was hired as a serials assistant in Bell Labs’ technical library, checking in physical copies of magazines. She then moved to a records management position at the New York Port Authority, subject classifying corporate documents. From there it was records management at NBC.

Where Meglio really got in on the ground floor of then nascent “computerized information business” (as she phrased it in 1987), was when she joined the New York Times company as an abstractor in 1969. Progressing through the ranks, she became managing editor of the New York Times Information Bank—which still exists in the LexisNexis NEWS Library as the INFOBK File. That was before full text was widely available; the Information Bank held abstracts of articles that appeared in The New York Times.

Queen of Full Text

In 1983, Meglio moved to California to join Information Access Company (IAC), where its president, Morris (Morry) Goldstein, dubbed her “the queen of full text.” It was the early 1980s when the move from abstracted and indexed electronic sources to full text intensified. That was a technological generation shift. Bandwidth continued to expand, allowing for greater storage capability. Hence, information industry companies added more and more full text.

Authority control was on the radar for Meglio at that time. IAC’s proprietary system enabled automated corrections. If an indexer entered an incorrect subject descriptor, the system would either automatically replace it with the correct descriptor or toss it back to the indexer if there was no obvious thesaurus term. Come to think of it, many of today’s systems use similar automated systems, although on more modern computers. In fact, during Meglio’s tenure in California, she migrated the production system from proprietary systems to client server technology.

IAC was acquired by Ziff Davis in 1980, and then sold to Thomson in 1994. In 1998, Thomson merged IAC, Gale Research, and Primary Media to form Gale Group, headquartered in Ann Arbor. That entity is now owned by Cengage.

Mobile Culture

Meglio decided to head back East in the late 1990s, leaving IAC with the title Senior Vice President, Content Development Division. She devoted her time après IAC to smaller, privately-held companies that were creating technologically advanced electronic products aimed, not at the library market, but a more general audience. One, for a health website, resulted in a joint venture with Henry Schein, the largest distributor of medical products in the U.S.

She still delights in a cultural database she built that covered all forms of the arts—museums, dance, opera, symphonies, theatre—as events that could be put on travelers’ mobile devices and sold into hotels to inform people as to what they could do when they weren’t in business meetings. Not only did Meglio license data, create a web-based production system, arrange for museum updates and cultural feeds, and identify data extraction software, she negotiated with National Geographic to put thumbnails of its photos with the database. It was an example of an early adoption of mobile technology and the marriage of traditional databases with new markets. Yet another generation of the evolving information industry.

Next Generation Full Text

In 2003, a former IAC executive approached Meglio on behalf of Knovel, a producer of full text technical reference materials for applied science and engineering. Knovel needed someone who understood electronic information, databases, and licensing. Hired by CEO Chris Forbes, Meglio quickly moved into a new area for her—science and technology—and another generation of full text information, one that enabled searching across both textual and numeric information. Numbers are critical to the research done by engineers, scientists, and others with a technical bent.

The data Meglio licenses is both from major publishers of reference books and from associations. At the moment, Knovel provides access to almost 2,000 reference works from over 40 international publishers. “We look for publishers with specialty content, some of whom have never before licensed their content for electronic distribution. It takes time to explain the benefits to them.”

Upon joining Knovel, Meglio found it a challenge to explain what the company did, particularly to the publishers whose data she wanted to license. Knovel isn’t a static publisher. Subscribers can not only search for scientific and technical data, they can manipulate that data, change property values, perform calculations, display it the way they want to see it, and alter their suppositions. Knovel isn’t a typical ebook publisher. It’s a different model. It has interactive tables where searchers can show or hide rows and columns, move them around, and download to an Excel spreadsheet. Knovel’s equation plotter lets searchers pick the values that interest them and export that to Excel as well. The interactive nature of Knovel “makes data come alive,” explained Meglio.

Knovel Novelty

Early on, the novelty of Knovel became a selling point. Meglio recalls visiting an association publisher for her first licensing assignment. Settling down to demonstrate the system, she admitted she was no expert in engineering, the association’s major focus, but did understand information. As she showed the interactive tools, the publisher became very intrigued. “Can I sit there?” he asked and took over the keyboard from Meglio. Completely engaged, he was delighted to find answers to his questions. “The system just sold itself,” grinned Meglio. “It’s the right data coupled with the right software tools.”

She also found that Knovel’s customers are very detail oriented. They need precision and they don’t want to be sent to an answer; they don’t want to be referred to a source. They want the answer delivered to them. That’s a very different setting from what she experienced at the Information Bank or IAC. “Abstract and index databases indicate where to find information,” she said. “With Knovel, we give you the information you need.” Harking back to her “A&I” days, Meglio acknowledges that abstracts are an excellent avenue to an overview, particularly when the topic is a new area for the researcher. It’s also geared towards executives who lack the time for an in-depth review. Generally speaking, that’s not Knovel’s core audience.

Meglio is struck by how professionals in different scientific disciplines and companies have diverse approaches to research. Knovel’s customers come mainly from the corporate world—chemical, oil and gas, aerospace, pharmaceutical, civil engineering, construction, manufacturing, and food science companies. However, about 20% are academic and 10% government. Knovel spans some 20 different scientific subject areas, leading to some interesting cross-disciplinary discoveries. She cites the example of a mechanical engineer who found the answer to his problem in a food science text, not the first place he would think to look. Although open access has received much attention, she’s finding no pushback to Knovel’s information, probably because open access concentrates on journal articles while Knovel supplies reference data from handbooks, encyclopedia, manuals, and other reference books, as well as databases.

Full Text Czarina

If Meglio was considered the queen of full text in the 1980s, I think she must now be a czarina. As the information industry shifted to web delivery, the possibilities of full text expanded. No longer limited by bandwidth or storage constraints, full text not only expanded in quantity but in definition. Full text is no longer merely text. It’s numbers, images, maps, charts, graphs, tables, even formulae, equations, and computer source code.

Some things remain the same. Meglio says, “It’s no joke what happens behind the scenes.” The data she licenses does not simply appear on the screen the day after the contract is signed. It must be massaged to fit the Knovel software and be searchable in aggregations with the other sources Knovel offers. And then there’s people. “Creating products, whatever generation we’re talking about, still requires people. Knovel has a robust taxonomy, but human beings are needed to oversee it.”

Meglio also reminds me that some questions regarding full text haven’t changed. “What is the real cost? Who are you trying to reach with your data? Who are you reaching?” When talking with publishers, she needs to reinforce the value of aggregation, that they benefit from being associated with other publishers.

Still at the Infancy Stage

With four decades in the information industry and more than four generations of information products, Delores Meglio has seen business models come and go. She’s watched technology improvements and seen some technologies that never gained traction. Innovation, in her opinion, isn’t ending. The information industry is still in its infancy.

The two trends she’s looking at? “I’m really excited about the integration of technology and content. It’s no longer about mounting a source; it’s embedding that information with software that allows searchers to perform interactive activities.” Looking further ahead, it’s the new technologies, particularly 3-dimensional ones that let searchers visualize answers in 3D and create things that couldn’t have been done in the past. Having survived the information industry’s first infancy, Meglio is looking forward to many more generations of information.

About the Author

Marydee Ojala (Marydee@xmission.com) edits ONLINE, a journal that spans almost as many information generations as Delores Meglio (dmeglio@knovel.com).

Comments { 0 }

Saving Time and Money with a Rule Based Approach to Automatic Indexing

Written by Marjorie M.K. Hlava
President, Access Innovations, Inc.

Getting a Higher Return on Your Investment

There are two major types of automatic categorization systems. These two types of systems are known by many different names. However, the information science theory behind them boils down to two major schools of thought: rule based and statistics based.

Companies advocating the statistics system hold that editorially maintained rule bases take a lot of up-front investment and higher costs overall. They also claim that statistics based systems are more accurate. On the other hand, statistics based systems require a training set up front and are not designed to allow editorial refinement for greater accuracy.

A case study may be the best way to see what the true story is. We did such a study, to answer the following questions:

  • What are the real up-front costs of rule based and training set based systems?
  • Which approach takes more up-front investment?
  • Which is faster to implement?
  • Which has a higher accuracy level?
  • What is the true cost of the system with the additional cost of creating the rule base or collecting the training set?

To answer these questions, we’ll look at how each system works and then the costs of actual implementation in a real project for side-by-side comparison.

First a couple of assumptions and guidelines:

  1. There is an existing thesaurus or controlled vocabulary of about 6000 terms. If not, then the cost of thesaurus creation needs to be added.
  2. Hourly rates and units per hour are based on field experience and industry rules of thumb.
  3. 85% accuracy is the baseline needed for implementation to save personnel time.

The Rule Based Approach

A simple rule base (a set of rules matching vocabulary terms and their synonyms) is created automatically for each term in the controlled vocabulary (thesaurus, taxonomy, etc.). With an existing, well-formed thesaurus or authority file, this is a two-hour process. Rules for both synonym and preferred terms are generated automatically.

Complex rules make up an average of about 10% to 20% of the terms in the vocabulary. Rules are created at a rate of 4 – 6 per hour. So for a 6000 term thesaurus, creating 600 complex rules at 6 per hour takes 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. Accuracy (as compared with what indexing terms a skilled indexer would select) is usually 60% with just the simple rule base and 85 – 92% with the complex rule base.

The rule based approach places no limit on the number of users, the number of terms used in the taxonomy created, or the number of taxonomies held on a server.

The software is completed and shipped via CD-ROM the day the purchase is made. For the client’s convenience, if the data is available in one of three standard formats (tab- or comma-delimited, XML, or left-tagged ASCII), it can be preloaded into the system. Otherwise a short conversion script is applied.

On the average, customers are up and running one month after the contract is complete.

The client for whom we prepared the rule base reported 92% accuracy in totally automated indexing and a four-fold increase in productivity.

The up-front time and dollar investment based on the workflow for implementation for the full implementation is as follows:

Table 1

Table 1

The Statistical Approach – Training Set Solution

To analyze the statistical approach, which requires use of a training set up front, we used the same pre-existing 6000 word thesaurus.

The cost of the software usually starts at about $75,000 to $250,000. (Though costs can run much higher, we will use this lower estimate. Some systems limit the number of terms allowed in a taxonomy, requiring an extra license or secondary file building.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000). Travel expense may be added.

The up-front time and dollar investment, based on the workflow for implementation for the statistical (Bayesian, co-occurrence, etc.) systems, is as follows:

Table 2

Table 2

A two-fold productivity increase was noted using the system. Accuracy has not gone above 72% at present.


The table that follows compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation.

Table 3

Table 3

It is apparent that considerable savings in both time and money can be gained by using a rule based system instead of a statistics based system — by a factor of almost seven, based on the assumptions outlined above.

About Access Innovations

Access Innovations, Inc. is a software and services company founded in 1978. It operates under the stewardship of the firm’s principals, Marjorie M.K. Hlava, President and Jay Ven Eman, CEO.

Closely held and financed by organic growth and retained earnings, the company has three main components- a robust services division, the Data Harmony software line, and the National Information Center for Educational Media (NICEM).

Comments { 0 }

Automatic Indexing: A Matter of Degree

Marjorie M.K. Hlava

Marjorie M.K. Hlava

Picture yourself standing at the base of that metaphorical range, the Information Mountains, trailhead signs pointing this way and that: Taxonomy, Automatic Classification, Categorization, Content Management, Portal Management. The e-buzz of e-biz has promised easy access to any destination along one or more of these trails, but which ones? The map in your hand seems to bear little relationship to the paths or the choices before you. Who made those signs?

In general, it’s been those venture-funded systems and their followers, the knowledge management people and the taxonomy people. Knowledge management people are not using the outlines of knowledge that already exist. Taxonomy people think you need only a three-level, uncontrolled term list to manage a corporate intranet, and they generally ignore the available body of knowledge that encompasses thesaurus construction. Metadata followers are unaware of the standards and corpus of information surrounding indexing protocols, including back-of-the-book, online and traditional library cataloging. The bodies of literature are distinct with very little crossover. Librarians and information scientists are only beginning to be discovered by these groups. Frustrating? Yes. But if we want to get beyond that, we need to learn — and perhaps painfully, embrace — the new lingo. More importantly, it is imperative for each group to become aware of the other’s disciplines, standards and needs.

We failed to keep up. It would be interesting to try to determine why and where we were left behind. The marketing hype of Silicon Valley, the advent of the Internet, the push of the dot com era and the entry of computational linguists and artificial intelligence to the realm of information and library science have all played a role. But that is another article.


The current challenge is to understand, in your own terms, what automatic indexing systems really do and whether you can use them with your own information collection. How should they be applied? What are the strengths and weaknesses? How do you know if they really work? How expensive will they be to implement? We’ll respond to these questions later on, but first, let’s start with a few terms and definitions that are related to the indexing systems that you might hear or read about.

These definitions are patterned after the forthcoming revision of the British National Standard for Thesauri, but do not exactly replicate that work. (Apologies to the formal definition creators; their list is more complete and excellent.)

Document — Any item, printed or otherwise, that is amenable to cataloging and indexing, sometimes known as the target text, even when the target is non-print.
Content Management System (CMS) — Typically, a combination management and delivery application for handling creation, modification and removal of information resources from an organized repository; includes tools for publishing, format management, revision control, indexing, search and retrieval.
Knowledge Domain — A specially linked data-structuring paradigm based on a concept of separating structure and content; a discrete body of related concepts structured hierarchically.
Categorization — The process of indexing to the top levels of a hierarchical or taxonomic view of a thesaurus.
Classification — The grouping of like things and the separation of unlike things, and the arrangement of groups in a logical and helpful sequence.
Facet — A grouping of concepts of the same inherent type, e.g., activities, disciplines, people, natural objects, materials, places, times, etc.
Sub Facet — A group of sibling terms (and their narrower terms) within a facet having mutually exclusive values of some named characteristics.
Node — A sub-facet indicator.
Indexing — The intellectual analysis of the subject matter of a document to identify the concepts represented in the document and the allocation of descriptors to allow these concepts to be retrieved.
Descriptor — A term used consistently when indexing to represent a given concept, preferably in the form of a noun or noun phrase, sometimes known as the preferred term, the keyword or index term. This may (or may not) imply a “controlled vocabulary.”
Keyword — A synonym for descriptor or index term.
Ontology — A view of a domain hierarchy, the similarity of relationships and their interaction among concepts. An ontology does not define the vocabulary or the way in which it is to be assigned. It illustrates the concepts and their relationships so that the user more easily understands its coverage. According to Stanford’s Tom Gruber, “In the context of knowledge sharing…the term ontology…mean(s) a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”
Taxonomy — Generally, the hierarchical view of a set of controlled vocabulary terms. Classically, taxonomy (from Greek taxis meaning arrangement or division and nomos meaning law) is the science of classification according to a pre-determined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis or information retrieval. In Web portal design, taxonomies are often created to describe categories and subcategories of topics found on a website.
Thesaurus — A controlled vocabulary wherein concepts are represented by descriptors, formally organized so that paradigmatic relationships between the concepts are made explicit, and the descriptors are accompanied by lead-in entries. The purpose of a thesaurus is to guide both the indexer and the searcher to select the same descriptor or combination of descriptors to represent a given subject. A thesaurus usually allows both an alphabetic and a hierarchical (taxonomic) view of its contents. ISO 2788 gives us two definitions for thesaurus: (1) “The vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example as ‘broader’ and ‘narrower’) are made explicit” and (2) “A controlled set of terms selected from natural language and used to represent, in abstract form, the subjects of documents.”

Are these old words with clearly defined meanings? No. They are old words dressed in new definitions and with new applications. They mean very different things to different groups. People using the same words but with different understandings of their meanings have some very interesting conversations in which no real knowledge is transferred. Each party believes communication is taking place when, in actuality, they are discussing and understanding different things. Recalling Abbott and Costello’s Who’s on First? routine, a conversation of this type could be the basis for a great comedy routine (SIG/CON perhaps), if it weren’t so frustrating — and so important. We need a translator.

For example, consider the word index. To a librarian, an index is a compilation of references grouped by topic, available in print or online. To a computer science person (that would be IT today), it would refer to the inverted index used to do quick look-ups in a computer software program. To an online searcher, the word would refer to the index terms applied to the individual documents in a database that make it easy to retrieve by subject area. To a publisher, it means the access tool in the back of the book listed by subject and sub-subject area with a page reference to the main book text. Who is right? All of them are correct within their own communities.

Returning to the degrees of application for these systems and when to use one, we need to address each question separately.

What Systems Are There?

What are the differences among the systems for automatic classification, indexing and categorization? The primary theories behind the systems are:

  • Boolean rule base variations including keyword or matching rules
  • Probability of application statistics (Bayesian statistics)
  • Co-occurrence models
  • Natural language systems

New dissertations will bring forth new theories that may or may not fit in this lumping.

How Should They Be Applied?

Application is achieved in two steps. First, the system is trained in the specific subject or vertical area. In rule-based systems this is accomplished by (1) selecting the approved list of keywords to be used and, through matching and synonyms, building simple rules and (2) employing phraseological, grammatical, syntactical, semantical, usage, proximity, location, capitalization and other algorithms — based on the system — for building complex rules. This means that, frequently, the rules are keyword-matched to synonyms or to word combinations using Boolean statements in order to capture the appropriate indexing out of the target text.

In Bayesian engines the system first selects the approved list of keywords to be used for training. The system is trained using the approved keywords against a set of documents, usually about 50 to 60 documents (records, stories). This creates scenarios for word occurrence based on the words in the training documents and how often they occur in conjunction with the approved words for that item. Some systems use a combination of Boolean and Bayesian to achieve the final indexing results.

Natural language systems base their application on the parts of speech and the nature of language usage. Language is used differently in different applications. Think of the word plasma. It has very different meanings in medicine and in physics, although the word has the same spelling and pronunciation, not to mention etymology. Therefore, the contextual usage is what informs the application.

In all cases it is clear that a taxonomy or thesaurus or classification system needs to be chosen before work can begin. The resulting keyword metadata sets depend on a strong word list to start with — regardless of the name and format that may be given to that word list.

What Are the Strengths and Weaknesses?

The weaknesses of the systems compared to human indexing are the frequency of what are called false drops. That is, the keywords selected fit the computer model but do not make sense in actual use. These terms are considered noise in the system and in application. Systems work to reduce the level of noise.

The measure of the accuracy of a system is based on

  • Hits — exact matches to what a human indexer would have applied to the system
  • Misses — the keywords a human would have selected that a computerized system did not
  • Noise — keywords selected by the computer that a human would not have selected

The statistical ratios of Hits, Misses and Noise are the measure of how good the system is. The cut-off should be at 85% Hits out of a total of 100% accurate (against human) indexing. That means that Noise and Misses need to be less than 15% combined.

A good system will provide an accuracy rate of 60% initially from a good foundation keyword list and 85% or better with training or rule building. This means that there is still a margin of error expected and that the system needs — and improves with — human review.

Perceived economic or workflow impacts often render this method unacceptable, leading to the attempt to provide some form of automated indexing. The mitigation of these results so human indexers are not needed is addressed in a couple of ways. On the one hand suppose that the keyword list is hierarchical (the taxonomy view) and goes to very deep levels in some subject areas, maybe 13 levels to the hierarchy. A term can be analyzed and applied only to the final level and therefore its use is concise and plugged into a narrow application.

On the other hand, it may also be “rolled up” to ever-broader terms until only the first three levels of the hierarchy are used. This second approach is preferred in the web-click environment, where popular thinking (and some mouse-behavior research) indicates that users get bored at three clicks and will not go deeper into the hierarchy anyway.

These two options make it possible to use any of the three types of systems for very quick and fully automatic bucketing or filtering of target data for general placement on the website or on an intranet. Achieving deeper indexing and precise application of keywords still requires human intervention, at least by review, in all systems. The decision then becomes how precisely and deeply you will develop the indexing for the system application and the user group you have in mind.

How Do We Know If They Really Work?

You can talk with people who have tried to implement these systems, but you might find that (1) many are understandably reluctant to admit failure of their chosen system and (2) many are cautiously quiet around issues of liability, because of internal politics or for other reasons. You can review articles, white papers and analyst reports, but keep in mind that these may be biased toward the person or company who paid for the work. A better method is to contact users on the vendor’s customer list and speak to them without the vendor present. Another excellent method is to visit a couple of working implementations so that you can see them in action and ask questions about the system’s pluses and minuses.

The best method of all is to arrange for a paid pilot. In this situation you pay to have a small section of your taxonomy and text processed through the system. This permits you to analyze the quality and quantity of real output against real and representative input.

How Expensive Will They Be to Implement?

We have looked at three types of systems. Each starts with a controlled vocabulary, which could be a taxonomy or thesaurus, with or without accompanying authority files. Obviously you must already have, or be ready to acquire or build, one of these lists to start the process. You cannot measure the output if you don’t have a measure of quality. That measure should be the application of the selected keywords to the target text.

Once you have chosen the vocabulary, the road divides. In a rule base, or keyword, system the simple rules are built automatically from the list for match and synonym rules, that is, “See XYZ, Use XYZ.” The complex rules are partially programmatic and partially written by human editors/indexers. The building process averages 4 to 10 complex rules per hour. The process of deciding what rules should be built is based on running the simple rule base against the target text. If that text is a vetted set of records — already indexed and reviewed to assure good indexing — statistics can be automatically calculated. With the Hit, Miss and Noise statistics in hand the rule builders use the statistics as a continual learning tool for further building and refinement of the complex rule base. Generally 10—20% of terms need a complex rule. If the taxonomy has 1000 keyword terms, then the simple rules are made programmatically and the complex rules — 100 to 200 of them — would be built in 10 to 50 hours. The result is a rule base or knowledge extractor or concept extractor to run against target text.

Bayesian, inference, co-occurrence categorization systems depend on the gathering of training set documents. These are documents collected for each node (keyword term) in the taxonomy that represents that term in the document. The usual number of documents to collect for training is 50. Some require more, some less. Collection of the documents for training may take up to one hour or more per term to gather, to review as actually representing the term and to convert to the input format of the categorization system. Once all the training sets are collected, a huge systems processing task set is run to find the logical connections between terms within a document and within a set of documents. This returns a probability of a set of terms being relevant to a particular keyword term. Then the term is assigned to other similar documents based on the statistical likelihood that a particular term is the correct one (according to the system’s findings on the training set). The result is a probability engine ready to run against a new set of target text.

A natural language system trains the system based on the parts of speech and term usage and builds a domain for the specific area of knowledge to be covered. Generally, each term is analyzed via seven methods:

  • Morphological (term form — number, tense, etc.)
  • Lexical analysis (part of speech tagging)
  • Syntactic (noun phrase identification, proper name boundaries)
  • Numerical conceptual boundaries
  • Phraseological (discourse analysis, text structure identification)
  • Semantic analysis (proper name concept categorization, numeric concept categorization, semantic relation extraction)
  • Pragmatic (common sense reasoning for the usage of the term, such as cause and effect relationships, i.e., nurse and nursing)

This is quite a lot of work, and it may take up to four hours to define a single term fully with all its aspects. Here again some programmatic options exist as well as base semantic nets, which are available either as part of the system or from other sources. WordNet is a big lexical dictionary heavily used by this community for creation of natural language systems. And, for a domain containing 3,000,000 rules of thumb and 300,000 concepts (based on a calculus of common sense), visit the CYC Knowledge Base. These will supply a domain ready to run against your target text. For standards evolving in this area take a look at the Rosetta site on the Internet.


There are real and reasonable differences in deciding how a literal world of data, knowledge or content should be organized. In simple terms, it’s about how to shorten the distance between questions from humans and answers from systems. Purveyors of various systems maneuver to occupy or invent the standards high ground and to capture the attention of the marketplace, often bringing ambiguity to the discussion of process and confusion to the debate over performance. The processes are complex and performance claims require scrutiny against an equal standard. Part of the grand mission of rendering order out of chaos is to bring clarity and precision to the language of our deliberations. Failure to keep up is failure to engage, and such failure is not an option.

We have investigated three major methodologies used in the automatic and semi-automatic classification of text. In practice, many of the systems use a mixture of the methods to achieve the result desired. Most systems require a taxonomy in order to start and most systems tag text to each keyword term in the taxonomy as metadata in the keyword name or in other elements as the resultant.

Access Innovations for Document abstracting and indexing • Document conversion • Business Taxonomies • Machine Aided Indexing
All rights reserved. Copyright © 2006 Access Innovations, Inc.

Comments { 0 }