Meet the Author - Thomas B. Passin

Meet the Author
Thomas B. Passin
by Dirk Schreckmann

JavaRanch: Thomas, you recently published a book titled "Explorer's Guide to the Semantic Web". It's likely no surprise that my first question is, "What is the Semantic Web?"

Thomas: It's hard to give a concise answer because there are so many different ideas about what the Semantic Web is all about. The first chapter in my book talks about these views in some detail. It turns out that they do have some things in common. Here's how I put it on JavaRanch a few days ago

"The idea is that there are a lot of things that we do on the web today that could be automated if only ... if only computers could understand the data better (HTML has to be read by people for much of its meaning to be available)... if only a software agent could act for you... if only searching could be smarter... and so on.

"The Semantic web is really intended to increase the amount of help that computers can give us, to relieve us of some of the more time-consuming and mundane tasks, to handle details that require information (on the Web) that is currently implicit but not really available to software as things are now."

The key element here is to have computers be able to make better use of data on the Web that is out there but not very accessible to computers today. Then we ought to be able to get the computers to do good things for us with much less need for us to do all the steps ourselves.

The shorthand way to express this thought is to say that "Computers will understand the meaning of data on the Web." But remember that the sense of the words "understand" and "meaning" may be pretty loose.

JavaRanch: Could you expand on what you mean by, "...the sense of the words "understand" and "meaning" may be pretty loose?"

Thomas: I'd rather not! Seriously, these words are very hard to define, and usually lead to arguments, heavy philosophical discussion, or both. What I meant by "loose" is that a system can respond to some situation in an appropriate way without "understanding" it in any normal human sense of the word you might pick.

For example, a statistical spam filter does not have any notion of the sentence structure in an email, nor of the "meanings" of the words, yet it may do a good job of filtering out the spam.

On the other hand, another kind of filter might in fact have some knowledge of the appropriateness of words and phrases given the context of the rest of the email. We might say that this second system has a higher level of "understanding" of the email and of spam than the first. But strictly speaking (at the level of theories of meaning, say), this kind of understanding would still be pretty far from what we imagine that humans do.

So I say "loosely". I think we should leave it at that, especially since we really don't know much about what human-style "understanding" or "meaning" really is or how it works in our heads.

JavaRanch: How intelligent or sophisticated will these agents need to be?

Thomas: For some of the advanced scenarios, very sophisticated. I think that mid-range progress will be more likely to come from trying to learn how to make as much mileage as possible from less sophisticated or intelligent agents.

JavaRanch: What's a classic example scenario of using this Semantic Web in the future?

Thomas: Classic scenarios often involve some kind of an "agent" that makes arrangements for you. For example, you are traveling abroad and your plans change. The agent finds flights to the new destination, makes sure the connections will work, books the tickets, makes new hotel reservations for you, and offers up a list of restaurant suggestions for tomorrow's dinner. It knows enough to look for vegetarian restaurants because it knows your food preferences, and it can check the menus in several languages by means of multi-lingual ontologies.

JavaRanch: Do you have an idea for a use of the Semantic Web that perhaps strays a bit from the classic examples? What will we be doing with it that we perhaps don't currently expect to be doing with it?

Thomas: Here's a Semantic Web style capability that I would really like to have, but I haven't seen written about. It involves a system for annotating on-line material. Imagine that you are reading some work in your (semantically advanced) browser. You highlight a passage and write an annotation that is, perhaps, several sentences long.

Behind the scenes, the system analyzes both the surrounding passage and your annotation to arrive at an understanding of the context for and content of your note. Then, over a period of minutes, hours, or maybe even days, it goes out on the web and performs semantically-enhanced searches for material relevant to your note. It communicates with other annotation agents, monitors relevant URLs for changes, and so forth.

Eventually, when the system has integrated enough data, it presents you with a selection of information relating to your note and possibly to the surrounding text as well. This might well be information you had known nothing about. A really advanced agent might even offer suggestions about the validity of your annotation! Ultimately, your note along with a selection of the enhanced material, would become available for others to see as well.

JavaRanch: Who is behind the Semantic Web? Is this the brain child of an individual, or small group of individuals? What larger organizations are developing it today? Are any particular interests (such as scientific groups, some part of the government, library organizations, big business) behind it?

Thomas: Well, Tim Berners-Lee first articulated the term and vision of a "Semantic Web". He seems to have seen it as a natural extension of the current Web. The W3C (World Wide Web Consortium), which of course he heads, eventually established a Semantic Web activity. There has been a fair amount of academic work in the area, and DARPA, the U.S. Defense Advanced Research Projects Agency, has funded related work in ontologies (e.g., DAML, which later morphed into the W3C's OWL) and in autonomous agents.

Generally speaking, "big business" has not shown a lot of interest, but there are some exceptions. HP has funded some important work in RDF tools, for example. In terms of governments, there is no government-wide push, but individual agencies are searching for help with problems that Semantic Web technologies could be helpful with. So, as usual, there is government funded or supported work but you can't really say that various governments are "behind it".

The obvious areas for near-term payoff are in specifying ontologies and adapting several ontologies to work together, and in integrating disparate data sources. Governments and large businesses are starting to see how important these areas are becoming.

One thing to bear in mind is that many Semantic Web technologies will be useful in themselves even if the Semantic Web itself never materializes in the form people are currently thinking of.

JavaRanch: Was the WWW initially designed to evolve into this Semantic Web, or has it been more of an emerging concept, come to life as people have seen what the WWW can do and have dreamed about what the WWW could be doing in the future?

Thomas: I would say mainly the latter, although Berners-Lee has written that he always had Semantic Web capability in mind, right from the start. But I am sure that his ideas evolved over time just like everyone else's.

JavaRanch: How far are we from experiencing the Semantic Web more in our day to day lives? How much older will I be before I should expect my cell phone to make arrangements for my birthday party?

Thomas: It's starting to sneak in, mostly in invisible ways. RDF, for example, is starting to get used a little more at a time, but behind the scenes where you cannot see it. There is a new (still in beta) application for organizing and sharing your photographs, called Tidepool, which eventually will let you tie the photos into stories you write about them. Sun Microsystems has had a substantial project that uses RDF to help capture and organize a lot of their distributed company knowledge. I'm told that in the course of doing this, they were able to discover "islands" of knowledge that were related but not previously strongly connected.

But having capabilities for making smart arrangements, arrangements that require common sense, that will take a lot of doing, for sure. So that's not right around the corner, not by a long shot. In addition to the technical challenges, there is the whole issue of trust and security on the Web. We know that is hard even when people are closely involved, let alone when "smart" agents will be out there interacting on their own.

JavaRanch: Could you expand on the issues of trust and security? Does current Semantic Web thought suggest that any particular security mechanisms will dominate in the agent arena?

Thomas: Most of what I read has to do with either:

Using logical reasoners on digitally signed documents (or fragments) and their provenance to prove their bona-fides. This process may (and probably would) extend to the certifiers of the signatures and keys.
Using social aspects to attempt to establish levels of trust and confidence in the sender or creator of information, in an effort to assess whether the information is reliable.

JavaRanch: What could we be doing at JavaRanch, with our web site, to be moving towards this Semantic Web? Could we be annotating content to better provide machines with a mechanism to understand things? Who might benefit from this, today? How so?

Thomas: It's a bit hard to say without taking into account what you would like to do, what directions you want to go in. Generally speaking, there are things you can do that would set you up to use Semantic Web technologies are they become more capable, things that would cost very little and would have some value on their own.

For example, you could promote more effective searching of material - say, of questions and answers in the forums. You see more and more of this in the better blogs - say, Jon Udel's blog and Ongoing, just to name a few. Generally, more use is being made of keywords. The del.icio.us site (http://del.icio.us) lets users add their own keywords and apply them to bookmarks. This provides for the group to share bookmarks on common subjects. But almost no one talks about the difficulty of using and maintaining keywords as their numbers get large. This can be hard.

I have tackled this problem for my own collection of bookmarks, using a topic map application I wrote (available on the Sourceforge project "TM4JScript"), so I know that one can do a lot better without a huge effort.

I'd have to say that we don't know the best ways to annotate sites to improve search yet. Jon Udel has been exploring this subject in his columns on InfoWorld, and they are worth reading.

Generally speaking, RDF and Topic Maps are good for piecing together bits of information from different sources (among other things). You can also arrange an overlay - RDF or Topic Maps - onto relational data, meaning you can make it look like it is, say, RDF. This approach provides a way to integrate various data sources, although I don't know how JavaRanch is organized so I can't say if this would fit in.

I have started to advocate a few strategies that I think will help position people to take advantage of some Semantic Web capabilities in the near future. First, the next time you decide to create a new XML format for data, bear in mind that it is often easy to make slight modifications and turn it into legal RDF. The XML syntax for RDF is frequently disparaged (think "rant"), but you don't have to use all the features by any means. If you take this approach, you get XML data that looks "normal", that can be handled by regular XML processing methods, and that can also be processed by RDF processors. For practically no cost, except for a bit to time getting up to speed on the technique, you can have the benefits of both XML and RDF. In addition, the thought needed to make sure you have proper RDF tends to help you understand your requirements better, anyway.

In a similar way, the next time you decide you need to create a new vocabulary, you can do it in OWL. OWL can be written in RDF, and just as above, you can make your OWL files easy to process with regular XML techniques. Again the cost is low, although I do have to say that your mind will probably need to get a little twisted around to get used to the OWL way of doing things. But that was true when HTML came in, too, and we all got used to that. With OWL, you can express your vocabulary with quite a bit of what is called "expressiveness", you can convert it into other forms, such as HTML documentation without much effort, and you are using a standard.

Now I won't claim that RDF and OWL processing - today - will necessarily add that much, but in the near future, as the RDF databases and query languages and logical reasoners evolve (which is happening rapidly), the value will start to emerge. In the meantime, if you are going to create file formats and vocabularies anyway, no harm done if you do it this way.

So potentially, sites can be better organized and especially navigated (navigation links can often be provided right out of a topic map if there is one for the data a site is built from), keywords and categories can be better managed, searching can be improved (even though we don't know the best ways to accomplish this yet), and it can be easier to aggregate information. For example, if we know that Python is a computer language, then a search for computer languages would turn up hits about Python even if we did not explicitly ask for it. Yes, that can be done today without RDF, etc., if you want to do it enough to take the trouble. But technologies like RDF, Topic Maps, and OWL give you a leg up because they are standardized, and so you only have come up with the data model and not all the rest of the machinery as well.

JavaRanch:What needs to happen in order for my smart PDA to arrange my travel plans? Will the future Semantic Web agents require any data they use to be described in RDF and OWL?

Thomas: That's unclear at this point. Certainly many very visible people think (or at least talk as if they think) that mostly it will be RDF and OWL. However, I see more and more people writing about the importance of natural language techniques and other text processing to handle the vast amount of data that will never be marked up or be in RDF/OWL.

Of course, the output of such processing may be RDF.

JavaRanch: Thanks for the interview, Thomas!

Thomas Passin is Principal Systems Engineer with Mitretek Systems, a non-profit systems and information engineering company. He has been involved in data modeling and created several complex database-backed web sites and also became engaged in a range of conceptual modeling approaches and graphical modeling technologies. He was a key member of a team that developed several demonstration XML-based web service applications, and worked on creating XML versions of draft standards originally written in ASN.1.

He graduated with a B. S. in physics from the Massachusetts Institute of Technology, then studied graduate-level physics at the University of Chicago. He became involved with XML-related work in 1998, with Topic Maps in 1999 and developed the open-source TM4JScript Javascript topic map engine.

Thomas Passin is the author of the book, "Explorer's Guide to the Semantic Web", Manning Publications, March 2004.