Archive | Articles RSS feed for this section

Conference Buzz: Discovery Systems at Internet Librarian

Written for Unlimited Priorities and DCLnews Blog.

A major topic at current conferences is “discovery,” and this was certainly true at the recent Internet Librarian (IL) conference in Monterey, CA on October 25-27. So what is discovery and how does it differ from other forms of search?

Internet Librarian 2010Information users are trying to access more and more types of information, both in locally stored databases like library catalogs, and external information from commercial systems, and increasingly they want to do it with a single Google-like search box. The problem is that content is often siloed in many different databases, and it comes in a wide variety of formats. Most users therefore become frustrated in finding the information they need because they do not know where the information resides, what the database name is, and how to access it.

The first attempts to help users took the form of “federated search,” in which the user’s search query was presented to several databases in turn, and then the results were aggregated and presented in a single set. The problem with federated search was that the user had to select the databases to be searched, and then had to wait for the sequential searches to be performed and processed, which could result in long response times.

In current discovery systems, a unified index of terms from a wide variety of databases is constructed by the system, and the queries are processed against it. Several discovery systems can aggregate the results and remove duplicate items. This approach has significant advantages over federated search systems:

  • The user does not need to know anything about the databases being searched,
  • Because only one search is performed, searches are faster, and
  • The same interface can be used to search for information from all databases.

A major issue with discovery systems is that considerable effort (and time!) must be expended in installing a system in an organization and customizing it to access only those systems to which access is available.

Activity in discovery systems in currently intense, and several competitors are vying for market share. Examples are Summon from ProQuest, EBSCO Discovery System, WorldCat Local from OCLC, and Primo from Ex Libris. One might ask if discovery systems are better than Google Scholar. One panel at IL looked at this question, and the answer is that they appear to be, but further investigation is needed. And because these systems are newly developed, problems with installation and the relevance of results have not all been solved yet. But it is clear that discovery is an exciting new advance in searching, and we can expect to see new advances coming rapidly in the near future.

Comments { 0 }

The 2010 Digital Forest: Any Data, Any Time, Any Place

Written for Unlimited Priorities and DCLnews Blog

Richard Oppenheim

Richard Oppenheim

As 2010 fades away, let’s take a short look at the technology products and services that will continue to impact our business and personal activities. With marketing, both professional and viral, in full work mode, it is likely that you already heard about most of these 2010 developments. Every development is built on what has been engineered over the past decade. This does not mean that there is nothing new. Rather, it is to say that we are using stuff a whole lot more.

A lot of what happened this year is an evolution of developments from prior years: 3D movies are hot, 3D television is beginning, smartphones are getting smarter, the Apps flood continues, e-books are outselling paper books, e-book readers and software are growing, YouTube and Facebook are having substantial user growth, Wi-Fi hotspots are everywhere and new gadgets for everyone from babies to seniors are in stores and online.

It started with sortable punched card, moved to computer mainframes and transformed electronics with the Bowmar calculator and digital watch. This year, the growth of everything digital can be seen, heard and transmitted from any place to any place. Pictures from the Cassini satellite orbiting Saturn 2 billion miles away are fascinating and a long distance photo op. Mars is a little closer. The Rovers were supposed to work for three months and they are in their sixth year. Digital photography with its instant access is changing the landscape of entertainment, news reporting and information sharing.

It is the new age of the candid camera as scientists, explorers, E-reporters, movie makers, amateur videographers, family photographers, musicians, students, teachers and lots of others share digital pictures from around the globe and far out into space. Digital is also impacting what we do with what we say. Instant messaging, text messaging, and email is replacing lots of voice to voice communications.

In the second quarter of 2010, the Nielsen Company analyzed mobile usage data for teens in the United States. American teens may not be texting all the time, however, this survey discovered that, on average, they send or receive 3,339 texts a month, more than six per every waking hour, an 8% jump from last year. (Source:

The survey also showed that a few years ago, the major reason for a cell phone was security. Today, 43% claim texting is their primary reason for getting a cellphone. For the young users, texting is a lot faster than voice calls. While voice interaction rises and peaks at age 24, only adults over 55 talk less than teens. Teen females, who are more social with their phones, average about 753 minutes per month, while males use around 525 minutes.

The volume of images, videos, audio, documents are at an ever increasing rate.

  • People are watching 2 billion videos a day on YouTube and uploading hundreds of thousands of videos daily. In fact, every minute, 24 hours of video is uploaded to YouTube.
  • Nielsen reports that YouTube is pushing out about 1.2 billion streams every day!
  • Family and business pictures are uploaded to web based albums for viewing and printing.
  • PDF files attached to emails are replacing fax transmissions.
  • On-line storage is used for sharing information from calendars to reports.
  • Cloud/web based applications support business applications.
  • E-commerce means that companies can display their inventory on-line and sell direct.
  • Maps and directions display roads and traffic up to the minute.
  • Music libraries have replaced turntables for vinyl records and CDs.
  • On October 6, 2010, Twitter announced processing of more than 86 million tweets each day.
  • over 1,000 TPS (Tweets per second)
  • 12,000 QPS (queries per second)
  • over 1 billion queries per day


The volume of content is growing exponentially. According to UC, San Diego, every day, more than 34GB of information passes before our eyes that we consciously or unconsciously process. Whether it’s in the clouds, in your office, on your home server or in your portable whatever device, the sound of data traffic grows louder. The need for ever larger data file cabinets is obvious.

Data Storage Disk

The old file storage – letter, legal, albums, 3 ring binders, red ties, and file folders were okay for paper. The new digital data volumes need the new data storage. This table shows the terms for measuring data volume.

Currently, new computers have a minimum 160GB capacity for internal hard drives. External hard drives with a capacity of 1 terabyte are now under $100.

1 Bit=Binary Digit
8 Bits=1 Byte
1000 Bytes=1 Kilobyte
1000 Kilobytes=1 Megabyte
1000 Megabytes=1 Gigabyte
1000 Gigabytes=1 Terabyte
1000 Terabytes=1 Petabyte
1000 Petabytes=1 Exabyte
1000 Exabytes=1 Zettabyte
1000 Zettabytes=1 Yottabyte
1000 Yottabytes=1 Brontobyte
1000 Brontobytes=1 Geopbyte

Mobile as a Lifestyle

The integration of wireless capabilities with brick and mortar commercial locations took large leaps forward this year. In the beginning, Wi-Fi was only available with paid subscriptions or pay per hour fee. While there are exceptions, most commercial locations – hospitals, hotels, airports, etc – have free Wi-Fi. This year, Starbucks changed its policy and made Wi-Fi connections free and accessible. In October, the Starbucks Digital Network was added with exclusive content. This network is in partnership with Yahoo, which means more exclusive web services will be added.

According to a Pew Research survey, lots of mobile gadgets are being purchased. (Source:

  • 85% of all Americans own a cell phone.
  • 96% of 18-to-29 year olds own a cell phone.
  • 76% of Americans own either a desktop or a laptop computer.
  • 47% of American adults own an mp3 player such as an iPod.
  • 42% of Americans own a home gaming device.
  • 62% of online users watched video through a sharing site in April.
  • 19% of users also download podcasts.

With video becoming the major growth area of the web, devices and communication lines will have to be able to handle this increased traffic volume. With easy access to a network, Apps will drive the next set of computing use. New services that have in-app commerce transactions will become the driver for new Web sites.

While most folks (41%, according to Accenture) use their devices for phone calls, everyone loves the Apps on their handset. Total Apps continue to expand. As of this writing, Apple has 280k, Android 100k. This is a growth market, over $17.5 Billion has been spent in app downloads.

Smarter Smartphones

The major equipment announcements always are led by a single product that creates the demand for copycat development. This happened with microcomputers (Apple, IBM, Compaq, Osborne, Radio Shack, et al), cellular phones (Motorola, Nokia, Siemens, Ericsson, et al). The growing world of Smartphones is still led by Apple’s iPhone but there is a growing list of vendors.

Apple’s iPhone was available for sale on June 29, 2007 with long lines of people waiting to be among the first. Many camped out overnight. This year, on June 24, Apple delivered the next upgrade to this line, the iPhone 4. For the third quarter of 2010, AT&T reported activating 5.2 million iPhones, the largest number of iPhone activations in any quarter to date.

Google’s Android operating system and Microsoft’s Win 7 Mobile are chasing Apple’s iOS 4. The race is skewed to Apple, as they still hold the mindshare for innovative products. Blackberry and Palm, now owned by HP, are playing catch up. Motorola and Verizon released the Droid X in July of 2010. Verizon and Samsung have announced the release of the Samsung Fascinate for the fall of 2010.

The Tablet

The major consumer product announce in 2010 was Apple’s iPad. Other companies had developed touchscreen technology, such as Microsoft’s Surface technology. CNN, other news programs, and every forensic crime show have them. The iPad created a whole new consumer category and a lot of public interest. Currently, 25-30 other tablets are in the pipeline. Samsung and Blackberry have been announced. Others are on the way. iPad distribution will not only be at the Apple store and website this Christmas, as new outlets distributing it will include Wal-Mart, Target, AT&T, and Verizon.

Tablets are used for games, entertainment and, especially for the younger set, a laptop replacement. The growth of tablets with expanded wireless networks and digital content is having a huge impact. The iPad is the fastest-selling tech device in history. According to Bernstein Research, the iPad has sold an estimated 8.5 million units and having a measurable effect on PC sales. An October report by Gartner predicts that tablet devices used to access media will reach sales of 19.5 million units in 2010. Gartner also predicted that sales would reach a staggering 150 million units by 2013. (Source:

Eye Test

The range of screen sizes may require an eye test to determine which screen size is right for what we want to do. Smartphones typically have 3–3.5 inch screens, the iPad’s LED-backlit display is 9.7-inches. Research in Motion announced that its new Blackberry Playbook tablet screen will be only 7 inches. Samsung’s Galaxy tablet also touts a 7-inch screen. The HP Slate, which is geared towards business professionals, will have an 8.9-inch touchscreen.

Computer netbooks have a 7 inch screen, laptops range from 13 – 17 inches. External monitors are typically 20 – 27 inches. Attaching one extra monitor is standard today. Adding two monitors would provide 3 screens to display a lot of information from multiple applications.

For the home entertainment center, flat panel displays are getting better. 60 inch screens are in the $2,000 range. There are a few 100 inch screens – for $100,000. High definition and 3D will drive this market and prices will range from just under $1,000 to bundles in the $7-8,000 range. Several vendors, such as Sony, LG and Panasonic are delivering full systems complete with internet access to movie sites.

Computer vendors are delivering television enabled boxes as stand-alone (Apple, Logitech) or built into the television (Google, Netflix) that will provide internet and movie access.

PBS announced the beta launch of a new, featuring local content from member stations. The launch includes the release of PBS for iPad and the PBS App for the iPhone and iPod Touch. PBS states that it wants to become a multi-platform media leader, delivering programs through television, mobile devices, the Web, and other platforms, such as classroom interactive whiteboards.

The Social World

Social media is not new this year. The growth of the big and little players needs to be recognized. Facebook, LinkedIn, Twitters have millions of members with most consumers using all three sites. These sites are a growing resource for 1 to 1 and 1 to many communications. They are changing the way mailing lists and marketing messages are being used.

Other Items of Note

  • Microsoft: Windows 7 OS celebrated birthday #1 (10/22) with the announcement that it has sold over 240 million copies. Office 2010 full product was released, receiving lots of praise.
  • Cloud computing has everyone’s attention – IBM, HP, Google, Oracle, Amazon, Microsoft, AT&T, others.
  • Nanotechnology is a very hot area of science and engineering, for many applications, most notably medical. Learn about the tiny world of Microbots and Nanobots.
  • While electric cars are still a small percentage of car sales, they are on everyone’s radar. Tesla is delivering its sports car, Chevrolet Volt has been demonstrated, and Toyota Prius hybrids draw customer praise. Even the diminutive Smart Car has announced an electric model.
  • Alternative energy developers will get a boost from Google’s $5 Billion commitment to an offshore wind farm.

The flood of digital data will deliver more to watch, more to read, more to store and file. We have choices to make to avoid being strangled by data overload. We can all join hands, virtually, and seek wisdom as to what works best for us this month. There must be an App for that.

About the Author

Richard Oppenheim, CPA, blends business, technology and writing competence with a passion to help individuals and businesses get unstuck from the obstacles preventing their moving ahead. He is a member of the Unlimited Priorities team. Follow him on twitter at

Comments { 0 }

The Importance of Standards in Our Lives

Written for Unlimited Priorities and DCLnews Blog.

Ebook Readers

Photo Credit: Cloned Milkmen

Everywhere you look, travel, and shop, our world is driven by standards which have been developed by organizations that are responsible for their sphere of influence. In our homes, the lighting, air-conditioning, heating, plumbing, and appliances are all built to industry standards. Years of professional contributions by engineers, users and manufacturers has made our lives more safe and comfortable.

In the information world, standards are just as important. I take for granted that when I turn on my computer, my system will boot up and take me to the location I have requested or follow my instructions. I access the world wide web and follow a URL to my favorite web site without ever considering all the standards that were developed to make this process work.

So much of my work is facilitated by the standards that have come before me, operating in the background without any effort on my part. So, given that we have all been pampered by our information world, consider the shock to the system when our world is turned upside down by either the lack of standards or lack of agreement on the future direction. Let’s take a look at the e-content world as an example. The shift from print formats to the electronic book, journal or magazine has taken the public by storm. Sales of e-books are skyrocketing, e-readers are hitting the market, and publishers are moving to publish the print and e-book at the same time.

In the scientific, technical, and medical professional education markets, the shift from the printed journal to electronic formats has been rapid and highly successful. In less than five years, the STM market publishers have developed site license agreements for their total journal output, which now represents over 60% of publishers’ revenue. The STM market is now supplying over 85% of their journal content in electronic form. Printed journal subscriptions have radically declined and many publishers are considering giving up print or shifting to print on demand.

The e-book world is following a similar pattern. Amazon’s introduction of the Kindle served as a wake-up call for the entire publishing industry. Amazon stands to generate over $1 billion from e-books sales by the end of their fiscal year. Amazon hit the market with the first e-reader and a large catalog of digital books at a price that was much cheaper than print. Other book retailers such as Barnes & Noble, with their Nook e-book reader, and Borders, which supports a group of e-readers, have joined in the digital revolution with e-book services. In April 2010, Apple hit the market with the release of their iPad with iBookstore and sold over 3,000,000 iPads in three months along with over 5,000,000 e-books downloaded in the same short time period.

The fact that the market has responded so well to the introduction of this new technology should make everyone happy. Publishers, book retailers, and readers all have something to celebrate. Sales of printed books at Wal-Mart, Barnes & Noble, and Borders stores have been in serious decline for several years. The old brick and mortar retailing outlet stores are losing printed books sales. The e-book is the first bright light in a declining market. Publishers are looking to the mobile market as a possible salvation for book, magazine, and newspaper sales. Mobile internet access now exceeds desktop access. Mobile computing is here to stay and developers are making sure that all their applications work in the mobile environment.

With all of this positive news, why are some still unhappy about where we are going? From my perspective as a user, publishers are not providing the market with the best choices in their e-book products. While we have the ePub standard for books, we also have Mobipocket books, PDFsupport, and a host of DRM software including Amazon’s proprietary DRM (AZW). What I want is a universal e-book format where every book bought at an e-book store could be read on every e-book reader. In simple terms, I want to take any book that I have downloaded to my iPad and share it with my wife, who is reading her books on her Kindle. It is a common practice for members of the same household to share books. Why not build the e-book market on a universal e-book format that offers interoperability between e-readers and e-book stores?

Amazon has developed apps for the iPad and a host of smartphones, so they are making progress in the right direction. Sadly though, e-books bought on the iBookstore with Apple’s Fairplay DRM cannot be read on the Kindle. Perhaps it is not the lack of standards that is at the heart of this issue. We have a range of standards, but the industry lacks the will to cooperate and select one universal standard.

This interoperability problem is perhaps more a political issue than a technical one. Kindle owns Mobipocket and the Kindle AZW file format which is used as their DRM. Both Amazon and Apple have built products on a set of their preferred standards. Each company works with the same group of publishers and offers a software developer kit. In the end, software developers do follow standards to build the various products. So what we have is not a lack of standards; it is more a lack of agreement on which standard to use. In the end the market will shake out and one universal e-book format will prevail, but not before consumers waste significant money.

While we are waiting for this confusion to clear up, the standards organizations are continuing to refine and improve the existing standards in the e-publishing world.

Many developers working on iPad book applications are working with EPUB. The EPUB standard is undergoing a major revision. The EPUB 2.1 working group has identified fourteen main problems that they intend to fix in their next release. High on the list to be implemented are enhanced global language support for Chinese, Japanese, Korean, and Middle Eastern languages. Support for right-to-left reading is a must if EPUB is to become the universal e-book format. In addition, other functions to be supported included rich media, interactivity, post publication annotation support, and advertising. Interactive digital textbooks and rich media magazines are going to be commonplace, and EPUB 2.1 must support these functions.

At the same time as EPUB 2.1 is undergoing development, other groups are working on HTML5. HTML5 is a standard for structuring and presenting content on the Web and incorporates features like video playback, drag-and-drop, and other features which have been dependent on third-party browser plugins. Many of the iPad applications released are supporting HTML5 features, as many parts of the standard are completed even though the standard has not been finalized. Amazon’s Kindle has also embraced the next generation of web programming, and Amazon has an upcoming release, the Kindle Previewer for HTML 5. Industry sources say that the Previewer offers complex layouts, embedded audio and video, and enhanced user interactivity.

Another related standard that impacts the user interface is CSS3. The Cascading Style Sheets style sheet language is used to describe the look and formatting of a document written in a markup language. CSS can allow the same markup page to be presented in different styles for different rendering methods, such as on screen, in print, in voice, and on Braille-based tactile devices. Developers need to pay attention to CSS for any products that are built for the US government and the college and university market where Section 508 compliance is required.

Amazon is currently being sued by various universities because the Kindle does not support Section 508 of the Rehabilitation Act passed by Congress. This Federal law protects people with disabilities and requires all products bought at the Federal level to be compliant. Colleges and universities have a mandate to follow Section 508. Both the e-Readers and mobile devices such as the iPad must support Section 508. Disabled employees and users must have the same public access to information that is comparable to the access available to others. The bottom line is simple and straightforward: all applications sold to colleges and universities as well as the Federal government must be accessible to people with disabilities.

What is clear to me is the unique urgency at this time for our industry to create, endorse and implement standards to take advantage of the advances in technology. The sales of our products and services are dependent on standards. Standards groups are in nearly every aspect of our daily lives. In the publishing world, industry standards for e-readers, our platforms, software tools and even the web are often produced by groups outside of our markets. Publishers and libraries also have standards groups working on their behalf. In the United States, it is ANSI that accredits some 400 organizations as national standards developers. NISO (see links below) is one of the 400 groups working on standards for libraries, publishers and information services provides.

There are two efforts underway which impact our community, and developers should be aware of these efforts. The first is the treatment of supplemental journal article materials.

How do publishers and editors deal with supplemental material in the e-journal world? That is a question that needs answering and a standard resolution. There are questions about readability, usability, preservation and reuse. Authors follow guidelines in the manuscript submission systems of publishers for the primary text of the article.

But in today’s technologically rich research community, the text is often insufficient to describe or facilitate a researcher’s result. There are data sets, background information, methodological details, and additional content that just do not fit into the printed journal. NISO is working for a number of publishers and community leaders to sort out this important problem. We need a recommended practice or best practice guideline for how to handle supplemental materials.

Another problem for the publishers and end users is journal article version control. When you find a scientific article on the web, how do you know which version you are reading? Certainly in medical research the version is a crucial factor. Authors often keep their original submitted manuscript; if Google or some other search engine can locate it on the web, one might find the page proof version, or the published version, corrected version of record, or enhanced version of record. What is important is to have a way to identify which version of an author’s work you are reading so that there is no confusion. This is another area where standards are important, and NISO is working on this problem as well.

Standards are an important part of the quality of our products and services. Standards are evolving as technology changes. Without standards, new markets would be limited and opportunity reduced. Standards are influenced by the corporations that often have the most to gain, but we should try to insure that standards are developed for the common good and not benefit one group. The e-book world sales growth has just begun to gain traction. We still have a number of hurdles to overcome, including interoperability, making sense out of DRMs, and developing rich formats, but we have come a long way in a very short period of time, even with the limitations that we see in the standards field.

We are entering a transformational time as the shift from print to e-formats is impacting all aspects of our society and our educational systems. I can not help but imagine what type of world my grandchildren will find when they begin college. Much of the print format book and journals will have been replaced. One thing is for sure: that the implementation of standards will make this new world possible.

Some related references:

About the Author

Dan Tonkery is president of Content Strategies as well as a contributor to Unlimited Priorities. He has served as founder and president of a number of library services companies and has worked nearly forty years building information products.

Comments { 0 }

A Web by Any Other Name

Written for Unlimited Priorities and DCLnews Blog.

Why We Need to Know About the Semantic Web

Richard Oppenheim

Richard Oppenheim

Some say “Look out — the semantic web is coming.” Some say it is already here. Others say: “what exactly is semantic about the web?”

Whether or not you have ever heard of the “semantic web,” you need to know more about it. Probably the first step for all of us is to get past the hype of yet another marketing term for technology. We know that technology will continuously create new phrases for new features that enable us to do more than yesterday. This includes terms like personal computer, smartphone, internet, world wide web, telecommuting, cloud computing and a lot more. Twenty-five years ago, only a few folks were even using computers, let alone smartphones, e-mail, social networks, and search engines.

Tim Berners-Lee, the person credited with developing the world wide web, said, “the semantic web is not a separate web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

The purpose of the semantic web is to enable words and phrases to provide links to resources, like Wikipedia, to reach across the universe of web-accessible data.

The dictionary definition of “semantics” is a range of ideas that has no defined limits. In written language, such things as paragraph structure and punctuation have semantic content. In spoken language, it is the study of the signs or symbols inside a set of circumstances and contexts. This includes sounds, tones, facial expressions, body language, foot tapping and hand waving.

There are lots of ways to refer to the huge storehouse of data outside our control, such as the internet, the web, cyberspace, geek heaven, or some other term. Whatever term you choose, know that the semantic storehouse is a repository for words, images, and applications that is way too big to measure. It is like trying to count the number of stars in the universe; only rough estimates are available.

To deliver or receive communication, we combine individual elements in small or large quantities to create spoken language, articles, books, web sites, blogs, tweets, photo albums, videos, songs, song albums, audio books, podcasts and more. Words can be from any one or multiple languages. Images can be still or moving, personal or commercial. Each element in our storehouse is always available to be used in any sequence and any quantity.

The semantic web invokes Tim Berners-Lee’s vision of every piece of data immediately accessible by anyone to use in any way they want. His vision expands the use of “linked data” to connect all web-based elements with every other web-based element. Wikipedia provides a peek into how, with its linking of terms from one posting to other entries in other posts. The resounding slogan shouted by Berners-Lee is “Raw Data Now.”

The purpose of the semantic web is to enable words and phrases to provide links to resources, like Wikipedia, to reach across the universe of web-accessible data. A current example is how CNN is expanding its resources. For on-air broadcasting, CNN summarizes its news feeds. You can login to the CNN website to access the “raw” news feeds to watch and listen without an analyst’s intervention.

Growing up, my primary reference material was a paper-based encyclopedia or dictionary or thesaurus. Books in my local town library were available, but it was hard to access books that the town library did not have. Today, all of those resources are accessible at any time without completing a weight training exercise. One fledgling example of this use of raw data is the DBpedia. DBpedia is a community effort to extract structured information from Wikipedia to expand the linkage of data. As of April 2010, the DBpedia knowledge base contains over one billion pieces of information describing more than 3.4 million things.

In 2010, you will not change your life and adopt only the semantic web over currently-used resources. Every forecast tells us that in the few years ahead there will be lots of new and rich resources available. The semantic web will enable us to collect data elements, assemble them, disassemble them and start anew or continue by adding more data elements. It will be one of the 21st century’s functional erector sets, useful for business support, personal search, and even customizable games.

The semantic web, however, is not a game. And it is, of course, under construction today and will likely be under construction for at least the rest of this century. The skeptics are stating that the goals are too lofty and not realistic. But a quick view of very recent history reveals:

  • The internet was first used by a few universities in the 1960s. Thirty years later, the world wide web started its revolutionary integration with our lives.
  • Bar code scanning was first tested on a pack of chewing gum in 1974. It was another ten years before grocery stores started to adopt the thick and thin bars. Today bar coding has grown way beyond grocery store checkout lines.
  • In 1983, Motorola released the first cellular phone for $3,000. 10 years later, the cell phone industry took its first leap. Today cellular and wireless technologies are essential tools for lots and lots of enterprises and individuals of every age.
  • In the past 10 years: LinkedIn was founded in 2002, Facebook in 2004, Twitter in 2006

Raw data without borders will enable, for example, each of us to create our very own Dewey Decimal filing system, including card catalogue, rolodex, and other customized information.

Technology will always ride the sea changes as new capabilities build on what has been tested and used before. Search engines enable us to ask questions and retrieve answers that are some combination of data, some precise, some tangential to the subject, and some totally unrelated to the topic. Today’s data libraries are single location silos, such as Wikipedia, that hold information beyond the capacity of my local library. With the semantic web, these silos will lose their standalone status. The new linking capabilities will deliver infinitely expanded ways to link data in any one silo with data in almost any other silo. Everything, including national security, will still require protection from criminals, hackers, and assorted bad guys.

Raw data without borders will enable, for example, each of us to create our very own Dewey Decimal filing system, including card catalogue, rolodex, and other customized information. Similar to the smartphone app world, we will have a large selection of end-user applications that integrate, combine and deduce information needed to assist us in performing tasks. Of course, we may choose to perform information construction ourselves. This would be like answering our own phone or typing our own correspondence or driving our own car. We can choose to adapt, adopt, or discard any feature that becomes available. The semantic web is real and it is growing. It has the potential to expand beyond any estimate.

About the Author

Richard Oppenheim, CPA, blends business, technology and writing competence with a passion to help individuals and businesses get unstuck from the obstacles preventing their moving ahead. He is a member of the Unlimited Priorities team. Follow him on twitter at

Comments { 0 }

Interview with Deep Web Technologies’ Abe Lederman

Written for Unlimited Priorities and
DCLnews Blog by Barbara Quint

Abe Lederman

Abe Lederman

Abe Lederman is President and CEO of Deep Web Technologies, a software company that specializes in mining the deep web.

Barbara Quint: So let me ask a basic question. What is your background with federated search and Deep Web Technologies?

Abe Lederman: I started in information retrieval way back in 1987. I’d been working at Verity for 6 years or so, through the end of 1993. Then I moved to Los Alamos National Laboratory, one of Verity’s largest customers. For them, I built a Web-based application on top of the Verity search engine that powered a dozen applications. Then, in 1997, I started consulting to the Department of Energy’s Office of Science and Technology Information. The DOE’s Office of Environmental Management wanted to build something to search multiple databases. Then, we called it distributed search, not federated search.

The first application I built is now called the Environmental Science Network. It’s still in operation almost 12 years later. The first version I built with my own fingers on top of a technology devoted to searching collections of Verity documents. I expanded it to search on the Web. We used that for 5 to 6 years. I started Deep Web Technologies in 2002 and around 2004 or 2005, we launched a new version of federated search technology written in Java. I’m not involved in writing any of that any more. The technology in operation now has had several iterations and enhancements and now we’re working on yet another generation.

BQ: How do you make sure that you retain all the human intelligence that has gone into building the original data source when you design your federated searching?

AL: One of the things we do that some other federated search services are not quite as good at is to try to take advantage of all the abilities of our sources. We don’t ignore metadata on document type, author, date ranges, etc. In many cases, a lot of the databases we search — like PubMed, Agricola, etc. — are very structured.

BQ: How important is it for the content to be well structured? To have more tags and more handles?

AL: The more metadata that exists, the better results you’re going to get. In the library world, a lot of data being federated does have all of that metadata. We spend a lot of effort to do normalization and mapping. So if the user wants to search a keyword field labeled differently in different databases, we do all that mapping. We also do normalization of author names in different databases — and that takes work! Probably the best example of our author normalization is in Scitopia.

BQ: How do you work with clients? Describe the perfect client or partner.

AL: I’m very excited about a new partnership with Swets, a large global company. We have already started reselling our federated search solutions through them. Places we’re working with include the European Space Agency and soon the European Union Parliament, as well as some universities.

We pride ourselves on supplying very good customer support. A lot of our customers talk to me directly. We belong to a small minority of federated search providers that can both sell a product to a customer for deployment internally and still work with them to monitor or fix any issues with connectors to what we’re federating, but get no direct access. A growing part of our business uses the SaaS model. We’re seeing a lot more of that. There’s also the hybrid approach, such as that used by DOE’s OSTI. At OSTI our software runs on servers in Oak Ridge, Tennessee, but we maintain all their federated search applications. Stanford University is another example. In September we launched a new app that federates 28 different sources for their schools of science and engineering.

BQ: How are you handling new types of data, like multimedia or video?

AL: We haven’t done that so far. We did make one attempt to build a federated search for art image databases, but, unfortunately for the pilot project, the databases had poor metadata and search interfaces. So that particular pilot was not terribly successful. We want to go back to reach richer databases, including video.

BQ: How do you gauge user expectations and build that into your work to keep it user-friendly?

AL: We do track queries submitted to whatever federated search applications we are running. We could do more. We do provide Help pages, but probably nobody looks at them. Again, we could do more to educate customers. We do tend to be one level removed from end-users. For example, Stanford’s people have probably done a better job than most customers in creating some quick guides and other material to help students and faculty make better use of the service.

BQ: How do you warn (or educate) users that they need to do something better than they have, that they may have made a mistake? Or that you don’t have all the needed coverage in your databases?

AL: At the level of feedback we are providing today, we’re not there yet. It’s a good idea, but it would require pretty sophisticated feedback mechanisms. Some of the things we have to deal with is that when you’re searching lots of databases, they behave differently from each other. Just look at dates (and it’s not just dates), some may not let you search on a date range. A user may want to search 2000-2010 and some databases may display the date, but not let you search on it; some won’t do either. Where the database doesn’t let you search on a date range but displays it, you may get results outside of the date and display them with the unranked results. How to make it clear to the user what is going on is a big thing for the future.

BQ: What about new techniques for reaching “legacy” databases, like the Sitemap Protocol used by Google and other search engines?

AL: That’s used for harvesting information the way that Google indexes web sites. The Sitemap Protocol is used to index information and doesn’t apply to us. Search engines like Google are not going into the databases, not like real-time federated search. Some content owners want to expose all or some of their content existing behind the search forms to search engines like Google. That could include DOE OSTI’s Information Bridge and PubMed for some content. They do expose that content to a Google through sitemaps. A couple of years ago, there was lots of talk about Google statistically filling out forms for content behind databases. In my opinion, they’re doing this in a very haphazard manner. That approach won’t really work.

BQ: Throughout the history of federated search — with all its different names, there have been some questions and complaints about the speed of retrieving results and the completeness of those results from lack of rationalizing or normalizing alternative data sources. Comments?

AL: We’re hearing these days a fair amount of negative comments on federated search and there have been a lot of poor implementations. For example, federated search gets blamed for being really slow, but that probably happens when most federated searches systems wait until each search is complete before displaying any results to the user. We’ve pioneered incremental search results. In our version, results appear within 3-4 seconds. We display whatever results have returned, while, in the background, our server is still processing and ranking results. At any time, the user can ask for a merging of results they’ve not gotten. So the user gets a good experience.

BQ: If the quality of the search experience differs so much among different federated search systems, when should a client change systems?

AL: We’ve had a few successes with customers moving from one federated search service to ours. The challenge is getting customers to switch. We realize there’s a fairly significant cost in switching, but, of course, we love to see new customers. For customers getting federated search as a service, it costs less than if the product were installed on site. So that makes it more feasible to change.

BQ: In my last article about federated searching, I mentioned the new discovery services in passing. I got objections from some people about my descriptions or, indeed, equating them with federated search at all. People from ProQuest’s Serials Solutions told me that their Summon was different because they build a single giant index. Comments?

AL: There has certainly been a lot of talk about Summon. If someone starts off with a superficial look at Summon, it has a lot of positive things. It’s fast and maybe does a better job (or has the potential to do) better relevance ranking. It bothers me that it is non-transparent on a lot of things. Maybe customers can learn more about what’s in it. Dartmouth did a fairly extensive report on Summon after over a year of working with it. The review was fairly mixed, lots of positives and comments that it looks really nice, lots of bells and whistles in terms of limiting searches to peer-reviewed or full text available or library owned and licensed content. But beneath the surface, a lot is missing. It’s lagging behind on indexing. We can do things quicker than Summon. I’ve heard about long implementation times for libraries trying to get their own content into Summon. In federated searching, it only takes us a day or two to add someone’s catalog into the mix. If they have other internal databases, we can add them much quicker.

BQ: Thanks, Abe. And lots of luck in the future.

Related Links

Deep Web Technologies –
Scitopia –
Environmental Science Network (ESNetwork) –

About the Author

Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on

Comments { 0 }

Dan Tonkery on the iPad and the Future of Technical Publications

Written for Unlimited Priorities and DCLnews Blog.

Dan Tonkery

Dan Tonkery

The much-anticipated iPad has arrived! Consumers, publishers, and video game developers are on cloud nine. What can the technical publications industry expect?

The iPad has finally arrived. This magical and revolutionary next-gen tablet computer with its online library of books and magazines, music store, and theater from Apple is now available. Industry pundits claim this device will change the future in publishing and may be the transformational product for our society, equivalent to the introduction of the television or print itself.

The iPad is a very hot looking product that can be used in multiple fields such as entertainment, business, and education. The device is a 9.7-inch touch-screen tablet computer that features a multimedia e-reader and mobile Web surfer. Since Apple’s release of the iPhone in 2007 they have become the leader in mobile devices and the release of the iPad will continue that leadership position.

Pre-orders for the iPad are over 200,000 units and sales for this first year are expected to be over 5 million for 2010. Apple insiders are projecting sales from eight to ten million. While the initial sales are in the United States, sales in the UK and Canada are projected to start by the end of April.

The iPhone app business created in 2007 is already a billion-dollar business as application developers have produced over 150,000 applications for the iPhone. The app business has created an entire industry. Apps have been downloaded some two million times, and most of those applications will work on the iPad. There is currently a gold rush by developers to create applications for this new platform. Apple has been secretive about the iPad and has been offering developers emulation software.

So who are the early adopters and developers of the iPad? A large group of publishers are working on new apps as they see a golden opportunity to reinvent the newspaper, magazine and e-book with a multimedia touch screen offering video mixed media, 360-degree product walk through and new products with embedded audio, video, and streaming. Now is the perfect time for publishers of books, magazines, and newspapers such as the New York Times, Macmillan, and Penguin to develop exciting new products that offer an opportunity to monetize their web applications and reverse the trend that everything on the web is free.

Publishers are not the only community excited about the new platform. Video game developers, retailers, and educators are building apps for this new platform. What is for sure is that the multimedia products will offer a combination of text, audio, still images, animation, video and interactivity content forms. The range of tools available now will give developers the most creative landscape ever and an opportunity to push the edge of technology forward.

The first version of the iPad will not be perfect as many users want features like a camera, multitasking, a GPS, and support for Adobe’s Flash software, but there are sufficient features to make consumers buy the iPad. The early adopters and market leaders have their orders placed. Many corporations that wanted to buy in bulk were turned away. Expect major corporations to spend millions on this new device as product catalogs, documentation, training and other enterprise-wide applications are developed.

Apple is a big winner with the iPad. The applications from publishers, video game developers, and major corporations marketing their favorite food, car, and beverage are going to provide users with sufficient content to justify buying an iPad. Already Seton Hill University has announced that they will supply all incoming freshman this fall with an iPad and other universities are considering similar action.

So what will the iPad add to the technical publishing and documentation industry? What will the introduction of the iPad have on the community that produces and develops the technical documentation for business and industry, including military applications? First and foremost, the introduction of the iPad will raise the bar on the look and feel of technical documentation. No longer are users going to accept the status quo. The tools that are now available to modernize, upgrade, and reinvent technical documentation are abundant, powerful, and ready to be implemented.

Most of the technical documentation applications build on some variation of XML, and most data conversion shops are experts in building applications and products working in the XML world. If your organization is working with DocBook, then the path to EPUB is straightforward. Hopefully your organization is working in a presentation-neutral form that contains the logical structure of the content, and then it is available to be published in a variety of formats.

What is clear for the technical documentation industry is that future applications are going to demand embedded audio, video, streaming and such features as 360-degree product walkthroughs with video-mixed media. Think for a moment about the documentation for a tank repair manual: how much more educational would it be to offer video and audio embedded in various pages? There is an exciting new world opening up and I think that many of the future technical documentation projects are going to utilize a range of the multi-media features.

The iPad and software tools provide an exciting opportunity to open up the power of art and creativity. The introduction of multimedia eBooks is an example where we have multiple formats, hardware, and software. Developers have many choices now to build exciting products.

Even Amazon with their best-selling Kindle is working on an iPad app. Kindle users will be able to read their eBooks on the iPad. Kindle is the best selling e-Book reader to date but the iPad may make the Kindle obsolete. It was a great tool in its day, but Apple may have leapfrogged it.

The benchmark for the look and feel of technical publications has been raised, and users will be looking and expecting to find image and sound support, interactivity, embedded annotation support and affordability of both the device and the product. Publishers are seeking a new platform that offers Digital Rights Management and an opportunity to shift the user from a free Internet mentality to a paid environment, and it is my belief that users will pay for bells and whistles.

The iPad is the first in a new field of tablet computers. The market is going to explode with new apps. Look for a flood of creativity. The convergence of technology that has brought us to this point has been a long time in coming. With the introduction of the iPad, we have crossed over to a new world that has great opportunity for users, developers, and corporations. The biggest impact on society is going to come from the millions of creative and artistic individuals that are going to build applications for all of these new tablet computers.

The new tablet computers will radically impact technical documentation and the data conversion industry. Since the community is already XML-savvy, working with the standards organizations like the International Digital Publishing Forum to build new expanded standards is a must. Whatever markup language you are using will be modified to include multimedia tools and features. Just think of all the potential upgrades to the various technical documentation projects that have already been completed. There should be work for many, many years. You have an opportunity to bring life into your works. Let the fun begin…


About the Author

Dan Tonkery is president of Content Strategies as well as a contributor to Unlimited Priorities. He has served as founder and president of a number of library services companies and has worked nearly forty years building information products.

Comments { 0 }

Federated Searching: Good Ideas Never Die, They Just Change Their Names

Written by Barbara Quint for Unlimited Priorities and DCLnews Blog.

“I don’t want to search! I want to find!!” “Just give me the answer, but make sure it’s right and that I’m not missing anything.” In a world of end-user searchers, that’s what everyone wants, a goal that can explain baldness among information industry professionals and search software engineers. Tearing your hair out isn’t good for the scalp.

And, for once, Google can’t solve the problem. Well, at least, not all the problems. The Invisible or Dark or Deep Web, whatever you call the areas of the Web where legacy databases reside with interfaces old when the Internet was young, where paywalls and firewalls block the paths to high-quality content, where user authentication precedes any form of access — here lie the sources that end-users may need desperately and that information professionals, whether librarians or IT department staff, work to provide their clients.

As the Internet and its Web took over the online terrain, different names emerged, such as portal searching and — the winner in recent years — federated searching.

The challenge of enabling an end-users searcher community to extract good, complete results from numerous, disparate sources with varying data content, designs, and protocols is nothing new. Even back in the days when only professional searchers accessed online databases, searchers wanted some way to find answers in multiple files without having to slog through each database one at a time. In those days, the solution was called multi-file or cross-file searching, e.g. Dialog OneSearch or files linked via Z39.50 (ANSI/NISO standard for data exchange). As the Internet and its Web took over the online terrain, different names emerged, such as portal searching and — the winner in recent years — federated searching.

So what does federated searching offer? It takes a single, simple (these days, usually Google-like) search query and transforms it into whatever format is needed to tap into each file in a grouping of databases. It then extracts the records, manipulates them to improve the user experience (removing duplicates, merging by date or relevance, clustering by topic, etc.), and returns the results to the user for further action. The databases tapped may include both external databases, e.g. bibliographic/abstract databases, full-text collections, web search engines, etc., and internal or institutional databases, e.g. library catalogs or corporate digital records.

The key difference in federated searching is that it usually involves separate journeys by systems to access collections located in different sites.

In a sense, all databases that merge multiple sources, whether Google tracking the Open Web or ProQuest or Gale/Cengage aggregating digital collections of journals and newspapers or Factiva or LexisNexis building search services collections from aggregators and publishers, offer a uniform search experience for searching multiple sources. Even accessing legacy systems that use rigid interfaces is no longer unique to federated services as Google, Microsoft, and other services have begun to apply the open source Sitemap Protocol to pry open the treasures in government and other institutional databases. The key difference in federated searching is that it usually involves separate journeys by systems to access collections located in different sites. This can mean problems in scalability and turnaround speed, if a system gets bogged down by a slow data source.

A good federated system has to know just how each field in each database is structured and how to transform a search query to extract the needed data.

More important, however, are the problems of truly making the systems perform effectively for end-users. Basically, a lot of human intelligence and expertise, not to mention sweat and persistent effort, has to go into these systems to make them “simple” and effective for users. For example, most of the databases have field structures where key metadata resides. A good federated system has to know just how each field in each database is structured and how to transform a search query to extract the needed data. Author or name searching alone involves layers of questions. Do the names appear firstname-lastname or last name-comma-firstname? Are there middle names or middle initials? What separates the components of the names — periods, periods and spaces, just spaces? The list goes on and on — and that’s just for one component.

So how do federated search services handle these problems? In an article written by Miriam Drake that appeared in the July-August 2008 issue of Searcher entitled “Federated Search: One Simple Query or Simply Wishful Thinking,” a leading executive of a federated service selling to library vendors was quoted as saying, “We simply search for a text string in the metadata that is provided by the content providers – if the patron’s entry doesn’t match that of the content provider, they may not find that result.” Ah, the tough luck approach! In contrast, Abe Lederman, founder and president of Deep Web Technologies (, a leading supplier of federated search technology, responded about his company’s work with Scitopia, a federated service for scientific scholarly society publishers, “We spend a significant amount of effort to get it as close to being right as possible for Scitopia where we had much better access to the scientific societies that are content providers. It is not perfect and is still a challenge. The best we can do is transformation.”

A good federated system imposes a tremendous burden on the builders so the users can feel the search process as effortless.

Bottom line, technology is great, but really good federated services depend on human character as much or more than technological brilliance. The people behind the federated service have to be willing and able to track user experience, analyze user needs, find and connect up the right sources, build multiple layers of interfaces to satisfy user preferences and abilities, and then tweak, tweak, tweak until it works right for the user and keeps on working right despite changes in database policies and procedures. A good federated system imposes a tremendous burden on the builders so the users can feel the search process as effortless.

By the way, the name changes are apparently not over. A new phrase has emerged for something that looks a lot like same old/same old: discovery services. EBSCO Discovery Service, ProQuest’s Serials Solutions’ Summon, ExLibris’ Primo, etc. These products focus on the library market and all build on a federated search approach. The main difference that I can distinguish – beyond different content types and sources – lies in the customization features they offer. Librarians licensing the services can do a lot of tweaking on their own. Some of the services even support a social networking function. That could help a lot, since, in this observer’s humble opinion, the most critical element in success for these services, no matter what you call them, lies in the application of human intelligence and a commitment to quality.

About the Author

Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on

Comments { 0 }

Preparing Your Documents for Disaster

Written for Unlimited Priorities and DCLnews Blog.

DCL guest columnist Marydee Ojala describes how digitizing your documents can pay off in the event of a natural disaster or other catastrophe. She also gives some useful tips on developing a disaster preparedness plan that will keep your documents and other critical data safe—even in extreme situations.

You Can’t Hope for the Best If You Don’t Plan for the Worst

On August 4, 2009, a flash flood hit the Louisville Free Public Library in Kentucky, doing some $5 million in damage. Of course I heard of this via Twitter—a sign of the times. My friend Greg Schwartz, Library Systems Manager (, tweeted about the flood as it was happening. His tweets about 6 feet of water in the server room were rapidly picked up by influential bloggers and word spread quickly throughout Twitter land and the blogosphere. The flood destroyed books, processing equipment, bookmobiles, and some personal automobiles, along with the servers, among other things. Greg’s photo of the devastation (seen below) is at

Damage photo shared on Twitter by @gregschwartz

Damage photo shared on Twitter by @gregschwartz

Do you believe disasters only occur to others? When you read about content being destroyed or severely damaged do you think: “Sure, but it can’t happen here, not to me! My organization has never experienced floods, fires, tornadoes, hurricanes, earthquakes, or other disasters. Therefore, we never will.”

Obviously, digitizing old and rare materials is an excellent preservation tool, an antidote to destruction—but taking precautions to keep the original materials safe is important as well.
Sometimes people believe that lightning doesn’t strike twice; in the case of Louisville, there had been one other flash flood, but it was in 1937. Who thought it could happen again?

They may be irrational, but these thoughts are more common than most of us care to admit.

On the other hand, if you haven’t yet embarked on a digitization project, you need to ensure that the materials you intend to convert remain viable and in a condition that allows digitization to proceed. Water-damaged paper may be possible to digitize, but it’s probably going to cost more and the quality of the resulting digital document may not be as high as you’d like.

By their very nature, disasters are unpredictable. When flash floods hit the University of Hawaii, Manoa, campus in October, 2004, the library’s basement—home to Hawaiian government documents and rare maps, along with some faculty offices and the library file servers—quickly filled with water. Initially thought to be destroyed, many of the unique documents and maps were salvaged through a combination of freezing and dehumidifying. However, professors who had not taken care to back up their data on their hard drives lost, in some cases, decades of research.

The David Sarnoff Library in Princeton, New Jersey was flooded in 2007 when a heavy storm hit. The library stored archival materials, including 600 cubic feet of lab notebooks, technical reports, manuals, and manuscripts in boxes in the basement. Executive Director Alexander B. Magoun admitted he never expected to see 20 inches of water waterlog the basement materials.

There are success stories, however. When the National Archives flooded in June 2006, the building sustained water damage, but no records were harmed. This may have had something to do with the fact that the National Archives, in conjunction with the Smithsonian Institution, Library of Congress, and National Park Service, publishes “A Primer on Disaster Preparedness, Management and Response: Paper-Based Materials” on its web site. Presumably, they paid attention to their own advice—and it seems to have paid off.

Of course, natural disasters are not limited to floods and storms. In Cologne, Germany, the archives building collapsed in March 2009, because of underground construction on the local subway tunnel. Unfortunately, the archives had no disaster plan in place, and documents dating back 1,000 years were not only mangled and torn, but waterlogged from ground water at the site. First feared to be a complete loss, later estimates are that 85% of the collection has been recovered, albeit in far-from-pristine condition. It will take years to piece together the archive document fragments, using technology initially developed to recreate East Germany’s Stasi files.

Fire poses threats to archives as well. In 2007, the Georgetown branch of the Washington D.C. public library burned, destroying or severely damaging the cultural heritage stored there, which included oil paintings, maps, photos, and documents. The library had no sprinkler system because management thought there would be more damage from the water than from a fire. In light of what happened, whether or not that was a correct decision is debatable.

When the Los Angeles Central Library burned in April 1986, I remember staff telling me that the building’s architecture helped fan the flames and that the structure was in violation of several parts of the city’s fire code. Ruled arson, the fire destroyed about 20% of the collection. Some of the rare books and manuscripts, such as a 1757 manuscript about California and a 1695 Shakespeare folio, were kept in a fireproof vault in the basement that was untouched. But the room housing a rare collection of U.S. patents dating back to 1790, which took 15 years and over $250,000 to compile was gutted, as was the card file that provided a unique subject index to the library’s fiction collection. This struck home; before it disappeared in the fire, I used those cards to research opinions in fiction toward bankers and business executives.

Disaster prevention, preparedness, response, and recovery are hardly new topics. But I am struck by how frequently information professionals look at their colleagues’ predicaments and comment on the lack of preparedness within their own organizations. Often, even those with a plan haven’t updated it to accommodate the necessities of modern disaster preparedness.

A common thread in disaster preparedness is having a written plan. No need to reinvent the wheel: Several professional organizations put their sample disaster preparedness and recovery plans on the Web. The Society of American Archivists has a list of ideas to raise awareness of disaster readiness. Heritage Preservation: The National Institute for Conservation presents a comprehensive set of links to disaster preparedness resources on its web site. At the Texas State Library and Archives site, you’ll find a five-page template for disaster recovery, the South Central Kansas Library System put its comprehensive manual online, and the Lyrasis consortium provides links to sample plans.

Looking at these resources, several commonalities emerge:

  • Most important is not only to have a plan but also to remind people it’s there. Writing the plan is the beginning not the end. Run some drills. Test the plan. Update it on a regular basis. Even the National Archives’ plan is dated 1993. Emerging technologies can affect disaster plans.
  • Keep a list of individuals and agencies to be notified in case of a disaster. It should include names, phone numbers (landline and mobile), and email addresses. Given today’s social networking environment, you may have people on your contact list who are best reached by direct messaging them via Twitter, Facebook, MySpace, and the like. Understand that word about your situation will surface on social networking sites, particularly Twitter.
  • Don’t store the list only on your organization’s computers—remember those drowned servers in basements? The same holds true for your master database of preserved materials.
  • People come first. Regardless of the historical importance or monetary value of your collection, risking people’s lives to protect it should not be required.
  • Think about the implications of where you store your collection. Basements are prone to flooding, even when there hasn’t been a flood in decades. If you must store items in basements, at least don’t put them on the floor. Think about the implications of sprinkler systems. Ask yourself which might do the most damage: smoke and scorching from fire or water from the sprinklers?
  • Back up computer data and store the backups in a different geographic area from the originals. Digitize as many unique and rare items as possible.
  • What about insurance? This may be a decision made by a parent organization over which you have little control. But consider this: Will you need insurance money to restore damaged materials? Think about the implications of total loss. Would insurance money compensate for not having valuable historic or business mission-critical items?

Perhaps the most important lesson to be learned is to remove the “it can’t happen here” mentality from your organization. No one expects to lose historical collections to natural disasters. Preserving materials through digitization is a laudable endeavor, but shouldn’t end up replacing valuable source materials needlessly destroyed in natural disasters.


Comments { 0 }

Everyone’s an Expert: The Crowdsourcing of History

Written for Unlimited Priorities and DCLnews Blog.

Marydee Ojala

Marydee Ojala

Crowdsourcing is a little like yelling a question at a very large crowd of people. If the crowd is big enough, chances are someone in there is going to have the answer—as long as you’re prepared to sift through a lot of junk responses in order to find it. And so it is with crowdsourcing, the phenomenon in which tasks or questions are outsourced to the masses in an open call.

The phenomenon (if not the word itself) predates the internet era, but as you may imagine the internet has played an integral role in precipitating crowdsourcing’s recent coming of age. With sites like Wikipedia relying exclusively on expertise provided by a vast community of internet users, crowdsourced knowledge has grown into something many people rely on every day.

With sites like Wikipedia relying on expertise provided by a vast community of internet users, crowdsourced knowledge is now something people rely on every day.

Indeed, crowdsourcing is changing the way we think about knowledge. Expertise is no longer the exclusive domain of experts; now John and Jane Q. Public have a say, too. The internet gives a platform to anyone who can type—and plenty of those who can’t—so anyone with an opinion can post a comment and be satisfied that they will be heard. It may be easy to forget, but it wasn’t always like this; if you wanted to say your piece, you had to be published, or else land an interview in a newspaper, TV, or radio. So until recently, most reactions to media and other forms of information occurred in a vacuum. The only people to hear your opinions were those in the room with you.

Crowdsourcing is changing the way we think about knowledge. Expertise is no longer the exclusive domain of experts.

But news on the Internet has an immediacy that changes the dynamic of “talking back.” You can read a story on a newspaper website and immediately express your opinion in a comment box. You can also use social media to correct an article’s facts and give your view of the situation. You might write a blog post for a lengthy response, or limit yourself to the 140 characters of a Twitter tweet. You could call attention to the story’s deficiencies in a Facebook status update. Not only are you getting your ideas out in public, you’re starting a conversation.

When we look at digitized historical materials, we don’t always have the ability to respond. However, as more collections are digitized, expectations of those viewing and reading the documents will change. One driving force for expanded expectations is the Flickr Commons project.

Knowledge Sharing on Flickr Commons

Flickr, of course, is a photo sharing website where anyone can upload their photos. The Commons is an extension of this idea, but with a historical bent; it focuses on photo collections owned by museums, libraries, and archives. It has 30 institutions participating, including the Library of Congress, National Library of Wales, British Library, National Library of New Zealand, Field Museum, Brooklyn Museum, Imperial War Museum (UK), Powerhouse Museum (Australia), Getty Research Institute, and Jewish Women’s Archive. Each uploads photographs from their collections, which have no known copyright restrictions, with the metadata and information they know about the photo. Then they throw open the floor to comments from registered users. The results can be extremely interesting, showing the value of crowdsourcing the expertise of ordinary people.

The first comment pointed to a biography of James Schoonmaker; the second pointed out that the photo did have a date. The third said, ‘I started a Wikipedia entry’

One example is a photograph of J.M. Schoonmaker that the Library of Congress (LOC) added to Flickr Commons ( The LOC included the phrase “unverified data provided by the Bain News Service on the negatives or caption cards” and noted it came from the George Grantham Bain Collection. The first comment pointed to an online biography of James Schoonmaker; the second pointed out that the photo did have a date (6/25/1913) and that date coincided with the publication of a book about Schoonmaker. The third comment said, “I started a Wikipedia entry.”

Not all the comments are as helpful as this one. Frequently, all they say is “nice photo” or words to that effect. Sometimes the photo sparks a question. This happened with the Jewish Women’s Archive’s photo titled “Sarah Brody in Germany, 1945,” which shows three women in front of an airplane ( A posted comment asked, “Anyone know why there is holes in the windows? Air pressure issue maybe?” Ignoring the grammatical error in the question, another participant quickly answered, “The holes were for ventilation. C-47s usually flew below 10,000 feet, eliminating the need for cabin pressurization. The aircraft flew so slow that some pilots would actually open the cockpit window to get a breeze in on a hot summer day.”

The Commons allows other forms of community sharing besides comments. You can tag the photos, just as with other Flickr photos outside the Commons, and you can add notes. The notes pop up when you move your cursor over relevant parts of the photographs. One example is the photography of Gunner Thomas Harold Burton, 178 Brigade, Royal Field Artillery, from the Imperial War Museum’s collection on Flickr. Move your cursor over the box positioned on the insignia on his cap and you’ll see this question: “does anyone know what this means?” In the note just below that one is the answer: “Royal Artillery, British Army” and above it is a reference to the Wikipedia entry on Royal Artillery cap badges.

The idea of talking back regarding historic artifacts demonstrates the power of self-organized online communities. Jessica Johnston blogged about the George Eastman House’s experiences after one year on Flickr Commons. The statistics are impressive—almost 2 million views, 3,961 comments, 9.885 tags, 655 notes, and 26,008 favorites.

Talk Back to History (and History Talks Back)

What do talking back to your television set, commenting on a blog post or news article, and placing a note on a photograph at Flickr Commons have in common? Passion. When you feel passionately enough about something to add your voice, your opinions, your knowledge to an ongoing conversation, you’ve moved the discourse further along. You’ve added your expertise to the community.

The idea of talking back regarding historic artifacts demonstrates the power of self-organized online communities.

On the flip side, you should recognize that you could be wrong. So could the other people commenting. A news item asserts, for example, that company A bought company B in 1963. You’re convinced it was 1972. And say so—in public. Then you research the company history and realize the 1963 date is correct and that 1972 is when company A bought company C. Although I’d like to believe that the wisdom of crowds is worth considering, I’m not about to accept that every single correction of a fact is valid. I can get just as annoyed with incorrect ‘corrections’ as with incorrect ‘facts.’

Validating the individual opinions, comments, notes, and supplemental posts on social media sites is a necessity. When you talk back to your television, local newspaper, or computer screen, it’s between you and that device. When you join a community formed around social media, it’s much more public. Those who monitor the Flickr Commons site for their participating institutions have learned that checking the veracity of the comments is time consuming. However, it’s a worthwhile endeavor, since most of contributed expertise contributes to our greater understanding of our shared past.

About the Author

Marydee Ojala edits ONLINE: Exploring Technology & Resources for Information Professionals and writes its business research column. She speaks frequently at information industry meetings and conferences.

Comments { 0 }

Delores Meglio and the Information Generations

Interviewed by Marydee Ojala in ONLINE: Exploring
Technology & Resources for Information Professionals

Delores Meglio is a survivor, an online information industry survivor. She was profiled on the pages of ONLINE two decades ago (“ONLINE Interviews Delores Meglio of Information Access Company,” by Jeffery K. Pemberton, July 1987, pp. 17-24). Characterized then as an online pioneer, she hasn’t ceased her pioneering activities in the intervening years. Given all the twists and turns in the information industry, she’s seen a lot of changes. “I’ve been through generations in the development of the information industry,” she told me in May 2009, when we sat down to reminisce and look forward during the Enterprise Search Summit East conference.

Meglio is now Vice President, Publisher Relations for Knovel. But she can trace her career back to 1963, when she was hired as a serials assistant in Bell Labs’ technical library, checking in physical copies of magazines. She then moved to a records management position at the New York Port Authority, subject classifying corporate documents. From there it was records management at NBC.

Where Meglio really got in on the ground floor of then nascent “computerized information business” (as she phrased it in 1987), was when she joined the New York Times company as an abstractor in 1969. Progressing through the ranks, she became managing editor of the New York Times Information Bank—which still exists in the LexisNexis NEWS Library as the INFOBK File. That was before full text was widely available; the Information Bank held abstracts of articles that appeared in The New York Times.

Queen of Full Text

In 1983, Meglio moved to California to join Information Access Company (IAC), where its president, Morris (Morry) Goldstein, dubbed her “the queen of full text.” It was the early 1980s when the move from abstracted and indexed electronic sources to full text intensified. That was a technological generation shift. Bandwidth continued to expand, allowing for greater storage capability. Hence, information industry companies added more and more full text.

Authority control was on the radar for Meglio at that time. IAC’s proprietary system enabled automated corrections. If an indexer entered an incorrect subject descriptor, the system would either automatically replace it with the correct descriptor or toss it back to the indexer if there was no obvious thesaurus term. Come to think of it, many of today’s systems use similar automated systems, although on more modern computers. In fact, during Meglio’s tenure in California, she migrated the production system from proprietary systems to client server technology.

IAC was acquired by Ziff Davis in 1980, and then sold to Thomson in 1994. In 1998, Thomson merged IAC, Gale Research, and Primary Media to form Gale Group, headquartered in Ann Arbor. That entity is now owned by Cengage.

Mobile Culture

Meglio decided to head back East in the late 1990s, leaving IAC with the title Senior Vice President, Content Development Division. She devoted her time après IAC to smaller, privately-held companies that were creating technologically advanced electronic products aimed, not at the library market, but a more general audience. One, for a health website, resulted in a joint venture with Henry Schein, the largest distributor of medical products in the U.S.

She still delights in a cultural database she built that covered all forms of the arts—museums, dance, opera, symphonies, theatre—as events that could be put on travelers’ mobile devices and sold into hotels to inform people as to what they could do when they weren’t in business meetings. Not only did Meglio license data, create a web-based production system, arrange for museum updates and cultural feeds, and identify data extraction software, she negotiated with National Geographic to put thumbnails of its photos with the database. It was an example of an early adoption of mobile technology and the marriage of traditional databases with new markets. Yet another generation of the evolving information industry.

Next Generation Full Text

In 2003, a former IAC executive approached Meglio on behalf of Knovel, a producer of full text technical reference materials for applied science and engineering. Knovel needed someone who understood electronic information, databases, and licensing. Hired by CEO Chris Forbes, Meglio quickly moved into a new area for her—science and technology—and another generation of full text information, one that enabled searching across both textual and numeric information. Numbers are critical to the research done by engineers, scientists, and others with a technical bent.

The data Meglio licenses is both from major publishers of reference books and from associations. At the moment, Knovel provides access to almost 2,000 reference works from over 40 international publishers. “We look for publishers with specialty content, some of whom have never before licensed their content for electronic distribution. It takes time to explain the benefits to them.”

Upon joining Knovel, Meglio found it a challenge to explain what the company did, particularly to the publishers whose data she wanted to license. Knovel isn’t a static publisher. Subscribers can not only search for scientific and technical data, they can manipulate that data, change property values, perform calculations, display it the way they want to see it, and alter their suppositions. Knovel isn’t a typical ebook publisher. It’s a different model. It has interactive tables where searchers can show or hide rows and columns, move them around, and download to an Excel spreadsheet. Knovel’s equation plotter lets searchers pick the values that interest them and export that to Excel as well. The interactive nature of Knovel “makes data come alive,” explained Meglio.

Knovel Novelty

Early on, the novelty of Knovel became a selling point. Meglio recalls visiting an association publisher for her first licensing assignment. Settling down to demonstrate the system, she admitted she was no expert in engineering, the association’s major focus, but did understand information. As she showed the interactive tools, the publisher became very intrigued. “Can I sit there?” he asked and took over the keyboard from Meglio. Completely engaged, he was delighted to find answers to his questions. “The system just sold itself,” grinned Meglio. “It’s the right data coupled with the right software tools.”

She also found that Knovel’s customers are very detail oriented. They need precision and they don’t want to be sent to an answer; they don’t want to be referred to a source. They want the answer delivered to them. That’s a very different setting from what she experienced at the Information Bank or IAC. “Abstract and index databases indicate where to find information,” she said. “With Knovel, we give you the information you need.” Harking back to her “A&I” days, Meglio acknowledges that abstracts are an excellent avenue to an overview, particularly when the topic is a new area for the researcher. It’s also geared towards executives who lack the time for an in-depth review. Generally speaking, that’s not Knovel’s core audience.

Meglio is struck by how professionals in different scientific disciplines and companies have diverse approaches to research. Knovel’s customers come mainly from the corporate world—chemical, oil and gas, aerospace, pharmaceutical, civil engineering, construction, manufacturing, and food science companies. However, about 20% are academic and 10% government. Knovel spans some 20 different scientific subject areas, leading to some interesting cross-disciplinary discoveries. She cites the example of a mechanical engineer who found the answer to his problem in a food science text, not the first place he would think to look. Although open access has received much attention, she’s finding no pushback to Knovel’s information, probably because open access concentrates on journal articles while Knovel supplies reference data from handbooks, encyclopedia, manuals, and other reference books, as well as databases.

Full Text Czarina

If Meglio was considered the queen of full text in the 1980s, I think she must now be a czarina. As the information industry shifted to web delivery, the possibilities of full text expanded. No longer limited by bandwidth or storage constraints, full text not only expanded in quantity but in definition. Full text is no longer merely text. It’s numbers, images, maps, charts, graphs, tables, even formulae, equations, and computer source code.

Some things remain the same. Meglio says, “It’s no joke what happens behind the scenes.” The data she licenses does not simply appear on the screen the day after the contract is signed. It must be massaged to fit the Knovel software and be searchable in aggregations with the other sources Knovel offers. And then there’s people. “Creating products, whatever generation we’re talking about, still requires people. Knovel has a robust taxonomy, but human beings are needed to oversee it.”

Meglio also reminds me that some questions regarding full text haven’t changed. “What is the real cost? Who are you trying to reach with your data? Who are you reaching?” When talking with publishers, she needs to reinforce the value of aggregation, that they benefit from being associated with other publishers.

Still at the Infancy Stage

With four decades in the information industry and more than four generations of information products, Delores Meglio has seen business models come and go. She’s watched technology improvements and seen some technologies that never gained traction. Innovation, in her opinion, isn’t ending. The information industry is still in its infancy.

The two trends she’s looking at? “I’m really excited about the integration of technology and content. It’s no longer about mounting a source; it’s embedding that information with software that allows searchers to perform interactive activities.” Looking further ahead, it’s the new technologies, particularly 3-dimensional ones that let searchers visualize answers in 3D and create things that couldn’t have been done in the past. Having survived the information industry’s first infancy, Meglio is looking forward to many more generations of information.

About the Author

Marydee Ojala ( edits ONLINE, a journal that spans almost as many information generations as Delores Meglio (

Comments { 0 }