Tag Archives | databases

Copyright Clearance Center Launches RightFind™ XML for Mining

XML for Mining is built on the RightFind™ platform, CCC’s unique suite of cloud-based workflow solutions that offer immediate, easy access to a full range of STM peer-reviewed journal content.

Linguamatics I2E text mining software is the first third-party text mining platform integrated with RightFind XML for Mining; integrations with other third-party solutions are planned.Publishers participating in the offering include Springer Science+Business Media, Wiley, BMJ, the Royal Society of Chemistry, Taylor & Francis, SAGE, Cambridge University Press, American Diabetes Association, American Society for Nutrition, Future Medicine and more. The module is available to businesses through the sales teams of CCC and RightsDirect, CCC’s European subsidiary.Using RightFind XML for Mining, researchers will be able to identify articles associated with their research from publications to which they subscribe and from those that fall outside their subscriptions.

Learn more at: Copyright Clearance Center Launches Text Mining Solution

Comments { 0 }

Code of Best Practices in Fair Use for Academic and Research Libraries released by ARL

ARL Best PracticesLast week, The Association of Research Libraries (ARL) announced the release of the Code of Best Practices in Fair Use for Academic and Research Libraries. It was developed in partnership with the Center for Social Media and the Washington College of Law at American University.

This is a welcome document. The United States Copyright Office site provides only very general guidelines. For example, they state:

The distinction between fair use and infringement may be unclear and not easily defined. There is no specific number of words, lines, or notes that may safely be taken without permission. Acknowledging the source of the copyrighted material does not substitute for obtaining permission.

The ARL document defines “fair use” as “the right to use copyrighted material without permission or payment under some circumstances, especially when the cultural or social benefits of the use are predominant.” This has been a controversial subject for many years and the controversy only continues to grow more intense as published material is rapidly re-purposed for use on the Internet — a Google search of the phrase “fair use” turns up 33,600,000 pages, most of which add little real guidance.

The document points out:

Fair use is the right to use copyrighted material without permission or payment under some circumstances, especially when the cultural or social benefits of the use are predominant. It is a general right that applies even —- and especially -— in situations where the law provides no specific statutory authorization for the use in question. Consequently, the fair use doctrine is described only generally in the law, and it is not tailored to the mission of any particular community. Ultimately, determining whether any use is likely to be considered “fair” requires a thoughtful evaluation of the facts, the law, and the norms of the relevant community.

While the Code is written specifically for academic librarians it offers guidance that can be applied in other situations. It was developed from interviews with experienced research followed by small group discussions held with library policymakers around the country to reach a consensus on applying fair use.

The Code deals with such questions as: when and how much copyrighted material can be digitized for student use; whether video should be treated the same way as print; how libraries’ special collections can be made available online; and whether libraries can archive websites for the use of future students and scholars.

The Code identifies the relevance of fair use in eight recurrent situations for librarians:

  • Supporting teaching and learning with access to library materials via digital technologies
  • Using selections from collection materials to publicize a library’s activities, or to create physical and virtual exhibitions
  • Digitizing to preserve at-risk items
  • Creating digital collections of archival and special collections materials
  • Reproducing material for use by disabled students, faculty, staff, and other appropriate users
  • Maintaining the integrity of works deposited in institutional repositories
  • Creating databases to facilitate non-consumptive research uses (including search)
  • Collecting material posted on the web and making it available

Each situation is described in detail and then followed by a fair-use statement. The fair-use statement is then clarified with limitations and enhancements. For example, for the situation “Supporting teaching and learning with access to library materials via digital technologies” the Code states “It is fair use to make appropriately tailored course-related content available to enrolled students via digital networks”, but clarifies that with limitations that include “Use of more than a brief excerpt from such works on digital networks is unlikely to be transformative and therefore unlikely to be a fair use” and “Only eligible students and other qualified persons (e.g., professors’ graduate assistants) should have access to material.” The enhancements to the statement include “The case for fair use is enhanced when libraries prompt instructors, who are most likely to understand the educational purpose and transformative nature of the use, to indicate briefly in writing why particular material is requested, and why the amount requested is appropriate to that pedagogical purpose.”

The takeaway? It’s said well by Nancy Simms, the Copyright Program Librarian at the University of Minnesota Libraries, writing on The Copyright Librarian Blog:

The specific facts are of course still the real determinants of whether a particular use is fair, and of whether and how an institution chooses to tolerate the uncertainty that is necessarily concomitant with a fair use justification for any activities. But the Best Practices document gives the library community a great jumping-off point for deeper examinations of many of our common copyright use situations, and are a great contribution to the toolbox of anyone dealing with copyright issues, in libraries and beyond.

Comments { 0 }

Interview with Deep Web Technologies’ Abe Lederman

Written for Unlimited Priorities and
DCLnews Blog by Barbara Quint

Abe Lederman

Abe Lederman

Abe Lederman is President and CEO of Deep Web Technologies, a software company that specializes in mining the deep web.

Barbara Quint: So let me ask a basic question. What is your background with federated search and Deep Web Technologies?

Abe Lederman: I started in information retrieval way back in 1987. I’d been working at Verity for 6 years or so, through the end of 1993. Then I moved to Los Alamos National Laboratory, one of Verity’s largest customers. For them, I built a Web-based application on top of the Verity search engine that powered a dozen applications. Then, in 1997, I started consulting to the Department of Energy’s Office of Science and Technology Information. The DOE’s Office of Environmental Management wanted to build something to search multiple databases. Then, we called it distributed search, not federated search.

The first application I built is now called the Environmental Science Network. It’s still in operation almost 12 years later. The first version I built with my own fingers on top of a technology devoted to searching collections of Verity documents. I expanded it to search on the Web. We used that for 5 to 6 years. I started Deep Web Technologies in 2002 and around 2004 or 2005, we launched a new version of federated search technology written in Java. I’m not involved in writing any of that any more. The technology in operation now has had several iterations and enhancements and now we’re working on yet another generation.

BQ: How do you make sure that you retain all the human intelligence that has gone into building the original data source when you design your federated searching?

AL: One of the things we do that some other federated search services are not quite as good at is to try to take advantage of all the abilities of our sources. We don’t ignore metadata on document type, author, date ranges, etc. In many cases, a lot of the databases we search — like PubMed, Agricola, etc. — are very structured.

BQ: How important is it for the content to be well structured? To have more tags and more handles?

AL: The more metadata that exists, the better results you’re going to get. In the library world, a lot of data being federated does have all of that metadata. We spend a lot of effort to do normalization and mapping. So if the user wants to search a keyword field labeled differently in different databases, we do all that mapping. We also do normalization of author names in different databases — and that takes work! Probably the best example of our author normalization is in Scitopia.

BQ: How do you work with clients? Describe the perfect client or partner.

AL: I’m very excited about a new partnership with Swets, a large global company. We have already started reselling our federated search solutions through them. Places we’re working with include the European Space Agency and soon the European Union Parliament, as well as some universities.

We pride ourselves on supplying very good customer support. A lot of our customers talk to me directly. We belong to a small minority of federated search providers that can both sell a product to a customer for deployment internally and still work with them to monitor or fix any issues with connectors to what we’re federating, but get no direct access. A growing part of our business uses the SaaS model. We’re seeing a lot more of that. There’s also the hybrid approach, such as that used by DOE’s OSTI. At OSTI our software runs on servers in Oak Ridge, Tennessee, but we maintain all their federated search applications. Stanford University is another example. In September we launched a new app that federates 28 different sources for their schools of science and engineering.

BQ: How are you handling new types of data, like multimedia or video?

AL: We haven’t done that so far. We did make one attempt to build a federated search for art image databases, but, unfortunately for the pilot project, the databases had poor metadata and search interfaces. So that particular pilot was not terribly successful. We want to go back to reach richer databases, including video.

BQ: How do you gauge user expectations and build that into your work to keep it user-friendly?

AL: We do track queries submitted to whatever federated search applications we are running. We could do more. We do provide Help pages, but probably nobody looks at them. Again, we could do more to educate customers. We do tend to be one level removed from end-users. For example, Stanford’s people have probably done a better job than most customers in creating some quick guides and other material to help students and faculty make better use of the service.

BQ: How do you warn (or educate) users that they need to do something better than they have, that they may have made a mistake? Or that you don’t have all the needed coverage in your databases?

AL: At the level of feedback we are providing today, we’re not there yet. It’s a good idea, but it would require pretty sophisticated feedback mechanisms. Some of the things we have to deal with is that when you’re searching lots of databases, they behave differently from each other. Just look at dates (and it’s not just dates), some may not let you search on a date range. A user may want to search 2000-2010 and some databases may display the date, but not let you search on it; some won’t do either. Where the database doesn’t let you search on a date range but displays it, you may get results outside of the date and display them with the unranked results. How to make it clear to the user what is going on is a big thing for the future.

BQ: What about new techniques for reaching “legacy” databases, like the Sitemap Protocol used by Google and other search engines?

AL: That’s used for harvesting information the way that Google indexes web sites. The Sitemap Protocol is used to index information and doesn’t apply to us. Search engines like Google are not going into the databases, not like real-time federated search. Some content owners want to expose all or some of their content existing behind the search forms to search engines like Google. That could include DOE OSTI’s Information Bridge and PubMed for some content. They do expose that content to a Google through sitemaps. A couple of years ago, there was lots of talk about Google statistically filling out forms for content behind databases. In my opinion, they’re doing this in a very haphazard manner. That approach won’t really work.

BQ: Throughout the history of federated search — with all its different names, there have been some questions and complaints about the speed of retrieving results and the completeness of those results from lack of rationalizing or normalizing alternative data sources. Comments?

AL: We’re hearing these days a fair amount of negative comments on federated search and there have been a lot of poor implementations. For example, federated search gets blamed for being really slow, but that probably happens when most federated searches systems wait until each search is complete before displaying any results to the user. We’ve pioneered incremental search results. In our version, results appear within 3-4 seconds. We display whatever results have returned, while, in the background, our server is still processing and ranking results. At any time, the user can ask for a merging of results they’ve not gotten. So the user gets a good experience.

BQ: If the quality of the search experience differs so much among different federated search systems, when should a client change systems?

AL: We’ve had a few successes with customers moving from one federated search service to ours. The challenge is getting customers to switch. We realize there’s a fairly significant cost in switching, but, of course, we love to see new customers. For customers getting federated search as a service, it costs less than if the product were installed on site. So that makes it more feasible to change.

BQ: In my last article about federated searching, I mentioned the new discovery services in passing. I got objections from some people about my descriptions or, indeed, equating them with federated search at all. People from ProQuest’s Serials Solutions told me that their Summon was different because they build a single giant index. Comments?

AL: There has certainly been a lot of talk about Summon. If someone starts off with a superficial look at Summon, it has a lot of positive things. It’s fast and maybe does a better job (or has the potential to do) better relevance ranking. It bothers me that it is non-transparent on a lot of things. Maybe customers can learn more about what’s in it. Dartmouth did a fairly extensive report on Summon after over a year of working with it. The review was fairly mixed, lots of positives and comments that it looks really nice, lots of bells and whistles in terms of limiting searches to peer-reviewed or full text available or library owned and licensed content. But beneath the surface, a lot is missing. It’s lagging behind on indexing. We can do things quicker than Summon. I’ve heard about long implementation times for libraries trying to get their own content into Summon. In federated searching, it only takes us a day or two to add someone’s catalog into the mix. If they have other internal databases, we can add them much quicker.

BQ: Thanks, Abe. And lots of luck in the future.

Related Links

Deep Web Technologies – www.deepwebtech.com
Scitopia – www.scitopia.org
Environmental Science Network (ESNetwork) – www.osti.gov/esn

About the Author

Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on Infotoday.com.

Comments { 0 }

Federated Searching: Good Ideas Never Die, They Just Change Their Names

Written by Barbara Quint for Unlimited Priorities and DCLnews Blog.

“I don’t want to search! I want to find!!” “Just give me the answer, but make sure it’s right and that I’m not missing anything.” In a world of end-user searchers, that’s what everyone wants, a goal that can explain baldness among information industry professionals and search software engineers. Tearing your hair out isn’t good for the scalp.

And, for once, Google can’t solve the problem. Well, at least, not all the problems. The Invisible or Dark or Deep Web, whatever you call the areas of the Web where legacy databases reside with interfaces old when the Internet was young, where paywalls and firewalls block the paths to high-quality content, where user authentication precedes any form of access — here lie the sources that end-users may need desperately and that information professionals, whether librarians or IT department staff, work to provide their clients.

As the Internet and its Web took over the online terrain, different names emerged, such as portal searching and — the winner in recent years — federated searching.

The challenge of enabling an end-users searcher community to extract good, complete results from numerous, disparate sources with varying data content, designs, and protocols is nothing new. Even back in the days when only professional searchers accessed online databases, searchers wanted some way to find answers in multiple files without having to slog through each database one at a time. In those days, the solution was called multi-file or cross-file searching, e.g. Dialog OneSearch or files linked via Z39.50 (ANSI/NISO standard for data exchange). As the Internet and its Web took over the online terrain, different names emerged, such as portal searching and — the winner in recent years — federated searching.

So what does federated searching offer? It takes a single, simple (these days, usually Google-like) search query and transforms it into whatever format is needed to tap into each file in a grouping of databases. It then extracts the records, manipulates them to improve the user experience (removing duplicates, merging by date or relevance, clustering by topic, etc.), and returns the results to the user for further action. The databases tapped may include both external databases, e.g. bibliographic/abstract databases, full-text collections, web search engines, etc., and internal or institutional databases, e.g. library catalogs or corporate digital records.

The key difference in federated searching is that it usually involves separate journeys by systems to access collections located in different sites.

In a sense, all databases that merge multiple sources, whether Google tracking the Open Web or ProQuest or Gale/Cengage aggregating digital collections of journals and newspapers or Factiva or LexisNexis building search services collections from aggregators and publishers, offer a uniform search experience for searching multiple sources. Even accessing legacy systems that use rigid interfaces is no longer unique to federated services as Google, Microsoft, and other services have begun to apply the open source Sitemap Protocol to pry open the treasures in government and other institutional databases. The key difference in federated searching is that it usually involves separate journeys by systems to access collections located in different sites. This can mean problems in scalability and turnaround speed, if a system gets bogged down by a slow data source.

A good federated system has to know just how each field in each database is structured and how to transform a search query to extract the needed data.

More important, however, are the problems of truly making the systems perform effectively for end-users. Basically, a lot of human intelligence and expertise, not to mention sweat and persistent effort, has to go into these systems to make them “simple” and effective for users. For example, most of the databases have field structures where key metadata resides. A good federated system has to know just how each field in each database is structured and how to transform a search query to extract the needed data. Author or name searching alone involves layers of questions. Do the names appear firstname-lastname or last name-comma-firstname? Are there middle names or middle initials? What separates the components of the names — periods, periods and spaces, just spaces? The list goes on and on — and that’s just for one component.

So how do federated search services handle these problems? In an article written by Miriam Drake that appeared in the July-August 2008 issue of Searcher entitled “Federated Search: One Simple Query or Simply Wishful Thinking,” a leading executive of a federated service selling to library vendors was quoted as saying, “We simply search for a text string in the metadata that is provided by the content providers – if the patron’s entry doesn’t match that of the content provider, they may not find that result.” Ah, the tough luck approach! In contrast, Abe Lederman, founder and president of Deep Web Technologies (www.deepwebtech.com), a leading supplier of federated search technology, responded about his company’s work with Scitopia, a federated service for scientific scholarly society publishers, “We spend a significant amount of effort to get it as close to being right as possible for Scitopia where we had much better access to the scientific societies that are content providers. It is not perfect and is still a challenge. The best we can do is transformation.”

A good federated system imposes a tremendous burden on the builders so the users can feel the search process as effortless.

Bottom line, technology is great, but really good federated services depend on human character as much or more than technological brilliance. The people behind the federated service have to be willing and able to track user experience, analyze user needs, find and connect up the right sources, build multiple layers of interfaces to satisfy user preferences and abilities, and then tweak, tweak, tweak until it works right for the user and keeps on working right despite changes in database policies and procedures. A good federated system imposes a tremendous burden on the builders so the users can feel the search process as effortless.

By the way, the name changes are apparently not over. A new phrase has emerged for something that looks a lot like same old/same old: discovery services. EBSCO Discovery Service, ProQuest’s Serials Solutions’ Summon, ExLibris’ Primo, etc. These products focus on the library market and all build on a federated search approach. The main difference that I can distinguish – beyond different content types and sources – lies in the customization features they offer. Librarians licensing the services can do a lot of tweaking on their own. Some of the services even support a social networking function. That could help a lot, since, in this observer’s humble opinion, the most critical element in success for these services, no matter what you call them, lies in the application of human intelligence and a commitment to quality.

About the Author

Barbara Quint of Unlimited Priorities is editor-in-chief of Searcher: The Magazine for Database Professionals. She also writes the “Up Front with bq” column in Information Today, as well as frequent NewsBreaks on Infotoday.com.

Comments { 0 }