Tag Archives | rdf

Peeling Back the Semantic Web Onion

Written for Unlimited Priorities and DCLnews Blog.

An Interview with Intellidimension’s Chris Pooley & Geoff Chappell

Chris Pooley is the CEO and co-founder of Intellidimension. His role is to lead its corporate development and business development efforts. Previously, Chris was Vice-President of Business Development at Thomson Scientific and Healthcare where he was responsible for acquisitions and strategic partnerships.

Richard Oppenheim

Richard Oppenheim

As stated in the first article, “the Semantic Web is growing and coming to a neighborhood near you.” (Read Richard Oppenheim’s first article here) Since that article, I had a conversation with Chris Pooley, CEO and co-founder of Intellidimension. Chris understands how the web and the Semantic Web work today. So let’s peel back some of the layers surrounding the semantic web onion and bring the hype down to earth.

Chris has spent years working with and developing applications specifically for the semantic web. Along with Geoff Chappell, Intellidimension president, our conversation ranged around the semantics of the Semantic Web and, more importantly, the impact it will have for access to information resources.

The vision of the founding fathers of the World Wide Web Consortium was for information to be accessible easily and in large volume with a process enabling the same information to be used for infinite purposes. For example, a weather forecast may determine whether your family picnic will be in sunshine or needs to be rescheduled. For the farmer the weather forecast is a key to what needs to be done for the planting and harvesting of crops. The retail store owner decides whether to have a special promotion for umbrellas or sunscreen lotion. The same information is used for different questions and actions.

Data publishers of all sizes and categories have information available. These publishers range from newspapers to retail stores to photo albums to travel sites, and a lot more; get the breaking news story, buy a book, connect with family albums, or book a flight. The web provides access to these benefits in endless combinations. The sites are holders of large volumes of data waiting for you to ask a question, or search. The applications are designed for human consumption so that people can find things when they choose to look.

The Semantic Web is a modified information agent in that there are one or more underlying software applications designed to aggregate information and create a unique pipeline of data for each specific user.

The foundation of the Semantic Web is all about relationships between any two items…

One of the key attributes of the web is that we can link any number of individual pages together. You can be on one page and click to go to another page of your choice. You can send an email that has an embedded link to allow the reader one click access to a specific page.

Chris emphasizes that the Semantic Web is not about links between web pages. “The foundation of the Semantic Web is all about relationships between any two items,” says Chris. Tuesday’s weather has a relationship to a 2pm Frontier flight leaving from Denver. Mary’s booking on that flight means that her ticket and seat assignment also has a relationship. In the semantic web sense, there is a relationship between Tuesday’s weather and Mary.

The growth of the Semantic Web will expand the properties of things to include lots of elements, such as price, age, meals, destination, and so on. The language for describing this information and associated resources on the web is the Resource Description Framework (RDF). Putting information into RDF files, makes it possible for computer programs (“web spiders”) to search, discover, pick up, collect, analyze and process information from the web. The Semantic Web uses RDF to describe web resources.

For end users, the continued adoption of the Semantic Web technologies will mean that when they search for product comparisons they will find more features in the comparisons which should make the process easier, faster, and provide better results.

Whether you seek guidance from the Guru on the mountain top or the Oracle at Delphi, information will range from numbers to statistical charts, from words to books, from images to photo albums, from medication risks to medical procedure analysis to doctor ratings.

Chris Pooley states, “For end users, the continued adoption of the Semantic Web technologies will mean that when they search for product comparisons they will find more features in the comparisons which should make the process easier, faster, and provide better results. For a business user or enterprise the benefits will be huge. By building Semantic Web enabled content, businesses will be able to leverage their former content silos; and the cost of making changes or adding new data elements (maintaining their content) will be reduced while flexibility will be improved, by using the rules-based approach for Semantic Web projects.”

With this vast increase in data volume, users should remember to be certain they trust the data that is retrieved. As part of the guidelines for proper use of the semantic web, we need to establish base levels of reliability for the sources being accessed. This requires some learning and practice to determine what maps appropriately to the level of accuracy needed. The weather forecast can be off a few degrees. Sending a space vehicle to Mars requires far greater accuracy since being off even one degree will cause the vehicle to miss its intended target.

Both end users and enterprise users will learn new ways to pay attention to the data validity. Trusting the source may require a series of steps that includes tracking the information over an extended time period. This learning process will also include a clear explanation of why that information is out there. For example, a company’s historical financial information is not the same as the company’s two year marketing forecast.

There is a chicken and egg aspect to approaching growing accessibility to more data. More data means more opportunity to collect valuable information. It also means that more care needs to be exercised to identify and separate meaningful relevant data from data noise. For example, the retailer Best Buy has started down this path by collecting 60% more bits of information from user clicks on their web site. This enriched data delivers added value to the retailer for more accurate and timely business decisions about products and selling techniques.

One of the intoxicating things about the web is that the vast majority of data, entertainment and resources are all free to anyone with an internet connection. While Chris acknowledges the current state of free resources, he also anticipates that in the future, there will likely be a need for some fee structure for the aggregator of content. With data demand growing exponentially, there will be a corresponding demand for huge increases in both storage capacity and internet bandwidth. The Semantic Web will require more big data mines and faster communications.

There is a significant difference between infrastructure and the applications that ride on that structure. Bridges are constructed to enable cars to use the span to get from one side to the other. The infrastructure of the bridge demands it holds all of the bridge weight as the weight of all cars at any one moment is insignificant to the bridge’s total weight.

Chris Pooley’s company, Intellidimension, builds infrastructure products delivering a useful and usable bridge for enterprise users. These users then create aggregating and solution oriented applications that travel along the appropriately named information super highway. Chris says, “The evolving Semantic Web technologies will offer benefits for the information producer and the information user that will enrich and enlarge what we see and how we see it.”

About the Author

Richard Oppenheim, CPA, blends business, technology and writing competence with a passion to help individuals and businesses get unstuck from the obstacles preventing their moving ahead. He is a member of the Unlimited Priorities team. Contact him by e-mail or follow him on Twitter at twitter.com/richinsight.

Comments { 0 }

OWL Exports From a Full Thesaurus

Jay Ven Eman, Ph.D.

Jay Ven Eman, Ph.D.

What do you make of “198”? You could assume a number. Computer applications make no reliable assumptions since it could be an integer and decimal but not octal, but it could also be something else, too. Neither you nor the computer could do anything useful with it. What if, we added a period, so “198” becomes “1.98”? Maybe it represents the value of something such as its price. If we found it embedded with additional information, we would know more. “It cost 1.98.” The reader now knows that it is a price, but software applications still are unable to figure it out. There is much the reader still doesn’t know. “It cost ¥1.98.” “It cost £1.98.” “It cost $1.98.” There is even more information you would want. Wholesale? Retail? Discounted? Sale price? $1.98 for what?

Basic interpretation is something humans do very well, but software applications do not. Now imagine a software application trying to find the nearest gasoline station to your present location that has gas for $1.98 or less. Per gallon? Per liter? Diesel or regular? Using your location from your car’s GPS and a wireless Internet connection such a request is theoretically possible, but beyond the most sophisticated software applications using Web resources. They cannot do the reasoning based upon the current state of information on the Web.

Finding Meaning

Trying to search the Web based upon conceptual search statements adds more complications. Looking for information about “lead” using just that term returns a mountain of unwanted information about leadership, your water, and conditions at the Arctic Ocean. Refining the query to indicate you are interest in “lead based soldering compounds” helps. Software applications still cannot reason or draw inferences from keywords found in context. At present, only humans are adept at interpreting within context.

Semantic Web

The “Semantic Web” is a series of initiatives to help make more of the vast resources found via the Web, available to software applications and agents, so that these programs can perform at least rudimentary analysis and processing to help you find that cheaper gasoline. The Web Ontology Language (OWL) is one such initiative and will be described herein in relation to thesauri and taxonomies.

At the heart of the Semantic Web are words and phrases that represent concepts that can be used for describing Web resources. Basic organizing principles for “concepts” exist in the present thesaurus standards (ANSI/NISO Z39.19 found at www.niso.org and ISO 2788 and ISO 5964 found at www.iso.org). They are being expanded and revised. Drafts of the revisions are available for review.

The reader is directed to the standards’ Web sites referenced above and to www.accessinn.com, www.dataharmony.com, and www.willpowerinfo.co.uk/thesprin.htm for basic information on thesaurus and taxonomy concepts. It is assumed here that the reader will have a basic understanding of what a thesaurus is, what a taxonomy is, and related concepts. Also, a basic understanding of the Web Ontology Language (OWL) is required. OWL is a W3C recommendation and is maintained at the W3C Web site. For an initial investigation of OWL, the best place to start is the Guide found at W3C.

OWL

From the OWL Guide, “OWL is intended to provide a language that can be used to describe the classes and relations between them that are inherent in Web documents and applications.” OWL formalizes a domain by defining classes and properties of those classes; defining individuals and asserting properties about them; and reasoning about these classes and individuals.

Ontology is borrowed from philosophy. In philosophy, Ontology is the science of describing the kinds of entities in the world and how they relate.
An OWL ontology may include classes, properties, and instances. Unlike ontology from philosophy, an OWL ontology includes instances, or members, of classes. Classes and members, or instances, can have properties and those properties have values. A class can also be a member of another class. OWL ontologies are meant to be distributed across the Web and to be related as needed. The normative OWL exchange syntax is RDF/XML (www.w3.org/RDF/).

Thesaurus

A thesaurus is not an ontology. It does not describe kinds of entities and how they are related in a why that a software agent could use. One could draw useful inferences about the domain of medicine by studying a medical thesaurus, but software cannot. You would discover important terms in the field, how terms are related, what terms have broader concepts and what terms encompass narrower concepts. An inference, or reasoning engine, would be unable to draw any inferences beyond a basic “broader term/narrower term” pairing like “nervous system/central nervous system,” unless specifically articulated. Is it a whole/part, instance, parent/child, or other kind of relationship?

Using OWL, more information about the classes represented by thesauri terms, the relationship between classes, subclasses, and members can be described. In a typical thesaurus, the terms “nervous system” and “central nervous system” would have the labels BT and NT, respectfully. A software agent would not be able to make use of these labels and the relationship they describe unless the agent is custom coded. The purpose of OWL is to provide descriptive information using RDF/XML syntax that would allow OWL parsers and inference engines, particularly those not within the control of the owners of the target thesaurus, to use the incredible intellectual value contained in a well developed thesaurus.

The levels of abstraction should be apparent at this point. At one level there are terms. At another level the relationships between groups of terms are described within a thesaurus structure. The thesauri standards do not dictate how to label thesaurus relationships. A term could be USE Agriculture or Preferred Term Agriculture or PT Agriculture. Hard coding of software agents with all of the possible variations of thesaurus labels is impractical.

OWL then is used to describe labels such as BT, NT, NPT, and RT1, etc., and to describe additional properties about classes and members such as the type of BT/NT relationship between two terms. Additional power can be derived when two or more thesauri OWL ontologies are mapped. This would allow Web software agents to determine the meaning of subject terms (key words) found in the meta-data element of Web pages, to determine if other Web pages containing the same terms have the same meaning, and to make additional inferences about those Web resources.

An OWL output from a full thesaurus provides semantic meaning to the basic classes and properties of a thesaurus. Such an output becomes a true Web resource and can be used more effectively by automated processes. Another layer of OWL wrapped around subject terms from an OWL level thesaurus and the resources (such as Web pages) these subject terms are describing would be an order of magnitude more powerful, but also more complicated and difficult to implement.

OWL Thesaurus Output

An OWL thesaurus output contains two major parts. The first part articulates the basic definition of the structure of the thesaurus. It is an XML/RDF schema. As such, a software agent can use the resolving properties in the schema to locate resources that provide the necessary logic needed to use the thesaurus.

FIGURE 1 – XML/RDF/OWL DECLARATIONS

<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
<!ENTITY owl "http://www.w3.org/2002/07/owl#" >
<!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" > ]>
<rdf:RDF
xmlns    ="http://localhost/owlfiles/DHProject#"
xmlns:DHProject ="http://localhost/owlfiles/DHProject#"
xmlns:base ="http://localhost/owlfiles/DHProject#"
xmlns:owl ="http://www.w3.org/2002/07/owl#"
xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd ="http://www.w3.org/2001/XMLSchema#">
<owl:Ontology rdf:about="">
<rdfs:comment>OWL export from MAIstro</rdfs:comment>
<rdfs:label>DHProject Ontology</rdfs:label>
</owl:Ontology>

Without agonizing over the details, Figure 1 provides the necessary declarations in the form of URL’s so that software agents can locate additional resources related to this thesaurus. The software agent would not have to have any of the W3C recommendations (XML, RDF, OWL) hard coded into its internal logic. It would have to have ‘resolving’ logic such as, “if you encountered a URL, then do the following…”

FIGURE 2 – SAMPLE TERM RECORD OUTPUT IN XML

<TermInfo>
<T>Agrotechnology</T>
<BT>Biotechnology</BT>
<NT>Animal management technologies</NT>
<NT>Controlled environment agriculture</NT>
<NT>Genetically modified crops</NT>
<RT>Agricultural science</RT>
<RT>Food technology</RT>
<UF>Plant engineering</UF>
<Scope_Note></Scope_Note>
<Editorial_Note></Editorial_Note>
<Facet></Facet>
<History></History>
</TermInfo>

Figure 2 shows a sample thesaurus term record output in XML for the term, “Agrotechnology”. This term has BT, NT, RT, Status, UF, Scope_Note, Editorial_Note, Facet, and History as a complex combination of classes, members, and properties. Anyone familiar with thesauri can determine what the abbreviations mean such as BT, NT, and RT and, thus, they can infer the relationships between all of the terms in the term record. An OWL thesaurus output provides additional intelligence that helps software make the same inferences.

After the declarations portion, shown in Figure 1, the remaining portion of the first part of an OWL thesaurus output is the schema describing the classes, subclasses, and members that comprise a thesaurus and all of the properties of each. Each of the XML elements (e.g. <RT>) in Figure 2 is defined in the schema as well as their properties and relationships. These definitions conform to the OWL W3C recommendation.

The first part of an OWL thesaurus output contains declarations and classes, subclasses, and their properties. It contains all of the logic needed by a specialized agent to make sense of your thesaurus and other OWL thesaurus resources on the Web.

FIGURE 3 – OWL OUTPUT OF TERM RECORD “AGROTECHNOLOGY”

</PreferredTerm>
<PreferredTerm rdf:ID="T131">
<rdfs:label xml:lang="en">Agrotechnology</rdfs:label>
<BroaderTerm rdf:resource="#T603"
  newsindexer:alpha="Biotechnology"/>
<NarrowerTerm rdf:resource="#T252"
  newsindexer:alpha="Animal management technologies"/>
<NarrowerTerm rdf:resource="#T1221"
  newsindexer:alpha="Controlled environment agriculture"/>
<NarrowerTerm rdf:resource="#T2166"
  newsindexer:alpha="Genetically modified crops"/>
<Related_Term rdf:resource="#T127"
  newsindexer:alpha="Agricultural science"/>
<Related_Term rdf:resource="#T2020"
  newsindexer:alpha="Food technology"/>
<Non-Preferred_Term rdf:resource="#T3898"
  newsindexer:alpha="Plant engineering"/>
</PreferredTerm>

The second part of an OWL thesaurus contains the terms of your thesaurus marked up according to the OWL recommendation. Figure 3 shows an OWL output for our sample term, “Agrotechnology”. (Note, since there are no values found in Figure 2 for Scope_Note, Editorial_Note, Facet, and History, these elements are not present in Figure 3.)

Now our infamous software agent could infer that “Agrotechnology” is a ‘NarrowerTerm’ of “Biotechnology”. “Agrotechnology” has three ‘NarrowerTerms’, two “RelatedTerms’, and one “NonPreferredTerm’. From the OWL output, the software agent can resolve the meaning and use of ‘BroaderTerm’, ‘NarrowerTerm’, ‘RelatedTerm’, and ‘NonPreferredTerm’ by navigating to the various URL’s. The agent can determine from the schema dictates that, if a term has property value, ‘NarrowerTerm’, then it must have property type value, ‘BroaderTerm’. A term can’t be a narrower term, if it doesn’t have a broader term. A term that is a ‘BroaderTerm’ must also be a ‘PreferredTerm’ and so on.

FIGURE 4 – OWL OUTPUT OF TERM RECORD “PLANT ENGINEERING”

<NonPreferredTerm rdf:ID="T3898">
<rdfs:label xml:lang="en">Plant engineering</rdfs:label>
<USE rdf:resource="T131" newsindexer:alpha="Agrotechnology"/>
</NonPreferredTerm>

Our thesaurus software agent can infer from Figure 3 that the thesaurus it is evaluating uses “Agrotechnology” for “Plant engineering”. Figure 4 identifies “Plant engineering” as a ‘NonPreferredTerm’ and identifies “Agrotechnology” as the ‘PreferredTerm’. (The logic in the schema dictates that if you have a “NonPreferredTerm”, then it must have a “PreferredTerm”.)

Suppose our software agent encounters “Plant engineering” at another Web site and uses it to locate resources there. Now the agent locates your Web site. The agent would first use “Plant engineering”. From your OWL thesaurus output it would infer that at your site it should use your preferred term, “Agrotechnology”, to locate similar resources.

All the terms and terms relationship in your thesaurus or taxonomy would be defined in part two of the OWL thesaurus output. It is now a Web resource that can be used by software agents. Designed to be distributed and referenced, a given base OWL thesaurus can grow as other thesaurus ontologies reference it.

More Meaning Needed

Even a thesaurus wrapped in OWL falls short of the full potential of the Semantic Web. This ‘first order’ output allows other thesaurus applications to make inferences about classes, subclasses, and members of a thesaurus. By “reading” the OWL wrappings, any thesaurus OWL software agent can make useful infers. By using classes, subclasses, and members and their properties, Web software agents would be able to reproduce the hierarchical structure of a thesaurus outside of the application used to construct it.

However, a lot is still missing. For example, knowing a term’s parent, children, other terms it is related to, and terms it is used for, does not tell you what the term means and what it might be trying to describe. Additional classes, subclasses, and members all with properties are needed. How a term is supposed to be used and why this ‘term’ is preferred over that ‘term’ would be enormously useful properties for improving the performance of software agents.

A more difficult layer of semantic meaning is the relationship between a thesaurus term and the entity, or object, it describes. An assignable thesaurus term is a member of class “PreferredTerm”. When it is assigned to an object, for example a research report or Web page, that term becomes a property of that object. For a Web page, descriptive terms become attributes of the ‘Meta’ element:

<META NAME="KEYWORDS" CONTENT="content management software,
xml thesaurus, concept extraction, information retrieval,
knowledge extraction, machine aided indexing,
taxonomy management system, text management, xml">

None of the intelligence found in an OWL thesaurus output is found in the Meta element. Having that intelligence improves the likelihood that our software agent can make useful inferences about this Web resource.

This intelligence is not currently available because HTML does not allow for OWL markup of keywords in the Meta element. There are major challenges to doing this. To illustrate, the single keyword, “machine aided indexing”, is rendered in Figure 5 as an OWL thesaurus output. This is very heavy overhead.

FIGURE 5 – OWL OUTPUT OF TERM RECORD MACHINE AIDED INDEXING

<PreferredTerm rdf:ID="T131">
<rdfs:label xml:lang="en">Machine aided indexing</rdfs:label>
<BroaderTerm rdf:resource="#T603"
 newsindexer:alpha="Information technology"/>
<NarrowerTerm rdf:resource="#T1221"
 newsindexer:alpha="Concept extraction"/>
<NarrowerTerm rdf:resource="#T2166"
 newsindexer:alpha="Rule base techniques"/>
<Related_Term rdf:resource="#T127"
 newsindexer:alpha="Categorization systems"/>
<Related_Term rdf:resource="#T2020"
 newsindexer:alpha="Classification systems"/>
<Non-Preferred_Term rdf:resource="#T3898"
 newsindexer:alpha="MAI"/>
</PreferredTerm>

The entire rendering depicted in Figure 5 would not be necessary for each keyword assigned to the Meta element of a Web page. A shorthand version could be designed that would direct software agents to the OWL thesaurus output, but such a shorthand method is not available.

Even if HTML incorporates a shorthand OWL markup for Meta keywords, the intelligence required to apply the right keywords automatically, for example, making the determination, “Web page x is about Machine aided indexing”, is not in the current OWL output. Automatic or semiautomatic indexing is the only way to handle volume and variety, especially dealing with Web pages.

Commercial applications such as Data Harmony’s M.A.I.™ Concept Extractor© and similar products provide machine automated indexing solutions. Theoretically, the knowledge representation systems that drive machine automated indexing and classification systems could incorporate OWL markup. When a machine indexing system assigned a preferred term to a Web page, it would write it into the Meta element along with its OWL markup.

However, to truly achieve the objectives of the Semantic Web the OWL W3C recommendation should be extended to include the decision algorithms used in the machine automated indexing process. Or, alternative W3C recommendation regarding the Semantic Web should be used in conjunction with OWL. If this could be accomplished, then software agents could determine the logic used in assigning terms. Next, the agent could compare the logic used at other Web sites and would then be able to make comparisons and to draw conclusions about various Web resources; conclusions like, of the eighteen Web sites your software agent reviewed that discussed selling gasoline, only eight were actual gas stations and only four of the eight had data the agent could determine was the retail price for unleaded.

We have moved closer to locating the least expensive gasoline within a five-mile radius of our current location. What has been described herein is actually being done, but so far only in closed environments where all of the variables are controlled. For example, there are Web sites that specialize in price comparison shopping.

Beyond these special cases, for the open Web the challenges are great. The sheer size of the Web and its speed of growth are obvious. More challenging is capturing meaning in knowledge representation systems like OWL (and other Semantic Web initiatives at W3C like SKOS, Topic Maps, etc.). How many OWL thesauri will there be? How many are needed? How much horsepower will be needed for an agent to resolve meaning when OWL thesauri are cross-referencing each other in potentially endless loops?

For these and other reasons, the Semantic Web may not live up to its full promise. The complexity and the magnitude of the effort may prove to be insurmountable. That said, there will be a Semantic Web and OWL will play an important role, but it will probably be a more simplified semantic architecture and more isolated, for example, to vertical markets or specific fields and disciplines.

For the reader, before you launch your own initiatives, assess your internal resources and measure the level of internal commitment, particularly at the upper levels of your organization. Know what is happening in your industry or field. If Semantic initiatives are happening in your industry, then the effort needed to deploy a taxonomic strategy (OWL being one piece of the solution) should be seriously considered. If you don’t make the effort, your Web resources and your vast internal, private resources risk being lost in the ‘sea of meaninglessness’, putting you at a tremendous competitive disadvantage.

1. BT – broad term, NT – narrow term, NPT – non-preferred term, RT – related term

Comments { 0 }