Written for Unlimited Priorities and
DCLnews Blog by Barbara Quint
Abe Lederman is President and CEO of Deep Web Technologies, a software company that specializes in mining the deep web.
Barbara Quint: So let me ask a basic question. What is your background with federated search and Deep Web Technologies?
Abe Lederman: I started in information retrieval way back in 1987. I’d been working at Verity for 6 years or so, through the end of 1993. Then I moved to Los Alamos National Laboratory, one of Verity’s largest customers. For them, I built a Web-based application on top of the Verity search engine that powered a dozen applications. Then, in 1997, I started consulting to the Department of Energy’s Office of Science and Technology Information. The DOE’s Office of Environmental Management wanted to build something to search multiple databases. Then, we called it distributed search, not federated search.
The first application I built is now called the Environmental Science Network. It’s still in operation almost 12 years later. The first version I built with my own fingers on top of a technology devoted to searching collections of Verity documents. I expanded it to search on the Web. We used that for 5 to 6 years. I started Deep Web Technologies in 2002 and around 2004 or 2005, we launched a new version of federated search technology written in Java. I’m not involved in writing any of that any more. The technology in operation now has had several iterations and enhancements and now we’re working on yet another generation.
BQ: How do you make sure that you retain all the human intelligence that has gone into building the original data source when you design your federated searching?
AL: One of the things we do that some other federated search services are not quite as good at is to try to take advantage of all the abilities of our sources. We don’t ignore metadata on document type, author, date ranges, etc. In many cases, a lot of the databases we search — like PubMed, Agricola, etc. — are very structured.
BQ: How important is it for the content to be well structured? To have more tags and more handles?
AL: The more metadata that exists, the better results you’re going to get. In the library world, a lot of data being federated does have all of that metadata. We spend a lot of effort to do normalization and mapping. So if the user wants to search a keyword field labeled differently in different databases, we do all that mapping. We also do normalization of author names in different databases — and that takes work! Probably the best example of our author normalization is in Scitopia.
BQ: How do you work with clients? Describe the perfect client or partner.
AL: I’m very excited about a new partnership with Swets, a large global company. We have already started reselling our federated search solutions through them. Places we’re working with include the European Space Agency and soon the European Union Parliament, as well as some universities.
We pride ourselves on supplying very good customer support. A lot of our customers talk to me directly. We belong to a small minority of federated search providers that can both sell a product to a customer for deployment internally and still work with them to monitor or fix any issues with connectors to what we’re federating, but get no direct access. A growing part of our business uses the SaaS model. We’re seeing a lot more of that. There’s also the hybrid approach, such as that used by DOE’s OSTI. At OSTI our software runs on servers in Oak Ridge, Tennessee, but we maintain all their federated search applications. Stanford University is another example. In September we launched a new app that federates 28 different sources for their schools of science and engineering.
BQ: How are you handling new types of data, like multimedia or video?
AL: We haven’t done that so far. We did make one attempt to build a federated search for art image databases, but, unfortunately for the pilot project, the databases had poor metadata and search interfaces. So that particular pilot was not terribly successful. We want to go back to reach richer databases, including video.
BQ: How do you gauge user expectations and build that into your work to keep it user-friendly?
AL: We do track queries submitted to whatever federated search applications we are running. We could do more. We do provide Help pages, but probably nobody looks at them. Again, we could do more to educate customers. We do tend to be one level removed from end-users. For example, Stanford’s people have probably done a better job than most customers in creating some quick guides and other material to help students and faculty make better use of the service.
BQ: How do you warn (or educate) users that they need to do something better than they have, that they may have made a mistake? Or that you don’t have all the needed coverage in your databases?
AL: At the level of feedback we are providing today, we’re not there yet. It’s a good idea, but it would require pretty sophisticated feedback mechanisms. Some of the things we have to deal with is that when you’re searching lots of databases, they behave differently from each other. Just look at dates (and it’s not just dates), some may not let you search on a date range. A user may want to search 2000-2010 and some databases may display the date, but not let you search on it; some won’t do either. Where the database doesn’t let you search on a date range but displays it, you may get results outside of the date and display them with the unranked results. How to make it clear to the user what is going on is a big thing for the future.
BQ: What about new techniques for reaching “legacy” databases, like the Sitemap Protocol used by Google and other search engines?
AL: That’s used for harvesting information the way that Google indexes web sites. The Sitemap Protocol is used to index information and doesn’t apply to us. Search engines like Google are not going into the databases, not like real-time federated search. Some content owners want to expose all or some of their content existing behind the search forms to search engines like Google. That could include DOE OSTI’s Information Bridge and PubMed for some content. They do expose that content to a Google through sitemaps. A couple of years ago, there was lots of talk about Google statistically filling out forms for content behind databases. In my opinion, they’re doing this in a very haphazard manner. That approach won’t really work.
BQ: Throughout the history of federated search — with all its different names, there have been some questions and complaints about the speed of retrieving results and the completeness of those results from lack of rationalizing or normalizing alternative data sources. Comments?
AL: We’re hearing these days a fair amount of negative comments on federated search and there have been a lot of poor implementations. For example, federated search gets blamed for being really slow, but that probably happens when most federated searches systems wait until each search is complete before displaying any results to the user. We’ve pioneered incremental search results. In our version, results appear within 3-4 seconds. We display whatever results have returned, while, in the background, our server is still processing and ranking results. At any time, the user can ask for a merging of results they’ve not gotten. So the user gets a good experience.
BQ: If the quality of the search experience differs so much among different federated search systems, when should a client change systems?
AL: We’ve had a few successes with customers moving from one federated search service to ours. The challenge is getting customers to switch. We realize there’s a fairly significant cost in switching, but, of course, we love to see new customers. For customers getting federated search as a service, it costs less than if the product were installed on site. So that makes it more feasible to change.
BQ: In my last article about federated searching, I mentioned the new discovery services in passing. I got objections from some people about my descriptions or, indeed, equating them with federated search at all. People from ProQuest’s Serials Solutions told me that their Summon was different because they build a single giant index. Comments?
AL: There has certainly been a lot of talk about Summon. If someone starts off with a superficial look at Summon, it has a lot of positive things. It’s fast and maybe does a better job (or has the potential to do) better relevance ranking. It bothers me that it is non-transparent on a lot of things. Maybe customers can learn more about what’s in it. Dartmouth did a fairly extensive report on Summon after over a year of working with it. The review was fairly mixed, lots of positives and comments that it looks really nice, lots of bells and whistles in terms of limiting searches to peer-reviewed or full text available or library owned and licensed content. But beneath the surface, a lot is missing. It’s lagging behind on indexing. We can do things quicker than Summon. I’ve heard about long implementation times for libraries trying to get their own content into Summon. In federated searching, it only takes us a day or two to add someone’s catalog into the mix. If they have other internal databases, we can add them much quicker.
BQ: Thanks, Abe. And lots of luck in the future.
About the Author
Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on Infotoday.com.