The future of the semantic web
dr. Albert Benschop
University of Amsterdam
translation: Connie Menting
A web of meaningful data
"The Web was designed as an information space, with the goal that it should
be useful not only for human-human communication, but also that machines would
be able to participate and help. One of the major obstacles to this has been
the fact that most information on the Web is designed for human consumption,
and even if it was derived from a database with well defined meanings (in at
least some terms) for its columns, that the structure of the data is not
evident to a robot browsing the web. Leaving aside the artificial intelligence
problem of training machines to behave like people, the Semantic Web approach
instead develops languages for expressing information in a machine processable
form" [Tim Berners-Lee, Semantic Web Road Map, sept. 1998]
The World Wide Web was designed for human consumption. Although computers can read everything on the WWW, computers cannot understand the information itself. Considering the size and pace of growth of the WWW, the development of automatic software and intelligent search and unlocking instruments has a high priority for many scientific and commercial institutions. Several new technologies are developed to decode information from several data resources and to transform it into meaningful results. In particular exchange languages are involved which are used to provide web documents and messages with extra information (special markings with metadata) that has a univocal meaning for humans and computers.
A broad vision
"The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. It promises to radically improve our ability to find, sort, and classify information, tasks that consume a majority of the time spent on and off-line" [W3C].
The first generation of the web was characterized by HTML pages and human processing of information. We are now on the threshold of a second web generation, the 'semantic web', aimed at information that can be processed by computers. The semantic web is not a completely new or separate web, but an extension of the present web. It is a web in which information gets a clearly defined meaning, enabling computers and human beings to cooperate. The semantic web is a network of meaningful data
, structured in such a way that a computer programme has the disposal of sufficient information about the data to be able to process them.
The fundamental condition for the semantic web can be summarized in the slogan: 'make content understandable for computers'. Not until the formal meaning of content is connected to a formal description of itself (i.e. with metadata) does content become understandable for computers. The major part of the content of the WWW, as we know it today, has been designed for humans to read, not for computer programmes to manipulate meaningfully [Berners-Lee/Hendler/Lassila 2001]. The semantic web should structure the meaningful content of web pages. This creates an environment where software agents can roam from page to page in order to carry out sophisticated tasks for users.
HTML is the most successful electronic publishing language ever invented. Yet, this language is rather superficial. HTML only describes how a web browser should arrange the text, pictures and buttons on a page. The present mark-up language HTML is capable of connecting documents, but incapable of connecting conceptual data. Because the web deals with large quantities of semi-structured information this causes great problems [Harmelen/Fensel 1999].
- The search for information
The present search engines that search on keywords disclose irrelevant information that uses a certain word in another meaning, or they lack information in which different words are used for the desired content. Anyone who has ever used a search engine such as Google, AltaVista or HotBot, knows that the input of a few keywords and the receipt of thousands of hits isn't always very useful. After this, intensive manual 'pulling up' of information is required. We long for search machines that can find pages containing syntactically different, but semantically similar words, by making use of ontologies (formal terminologies).
- Extraction of information
At present, if one wants to extract relevant information from information resources, it can only be done by way of human browsing and reading. After all, the automatic agents lack any common sense knowledge required to extract information from textual representations, and they are incapable of integrating information that is distributed over different sources. The HTML tags are mainly aimed at the layout and provide relatively little information on the content of a web page; that's why that information is difficult to re-use in a different context.
The maintenance of weakly structured resources is difficult and time-consuming when such resources grow too large. To keep such collections consistent, correct and up-to-date, something else is required: mechanized representation of semantics and constraints that help trace anomalies.
- Automatic document generation
We want adaptive websites that tune their dynamic reconfiguration to user profiles or other relevant aspects [Perkowitz/Etzioni 1997]. For this purpose semi-structured information presentations of semi-structured data have to be generated. This requires the development of a representation of the semantics of these information sources, which is accessible to machines.
Declarative and Procedural Strategies
Two alternative strategies are possible to solve these problems. First of all, information resources can be declaratively enriched with annotations, making their semantics accessible to machines and processable by intelligent software. Secondly, programmes (filters, wrappers, extraction programmes) can be written that procedurally extract such semantics of web resources [Muslea 1998]. The procedural and declarative approaches are complementary. The procedural approach can be used to generate annotations for web sources and existing annotations make procedural access to information much easier [Harmelen/Fensel 1999].
Whereas HTML enables visualization of information on the web, it is insufficiently capable of describing this information in such a way that one can use software to find or interpret this information. HTML doesn't allow for inclusion of contextual or conceptual data. Therefore a computer programme cannot see, for example, whether the figures in a web document represent an amount, age or date. For this the computer needs metadata in the shape of encoding that tells what one can expect in a document. With this encoding computer programmes can perform certain actions, such as ordering addresses and telephone numbers. In theory the solution is very simple: use marking that indicates what the information is, not what it should look like.
New exchange languages: XML, RDF and DAML
Exchange languages are languages that provide documents and messages with extra information, making the meaning for both man and computer univocal. Or, more precise:
"Exchange languages are formal languages that add markings to documents for the purpose of interpretation. The documents are exchanged between people and/or computers. Also the languages that describe interpretations of and operations on thus marked documents belong to the exchange languages” [Van der Steen 2001].
In order to introduce metadata to documents the W3C has proposed a number of new standards. The first of these was PICS (Platform for Internet Content Selection). PICS is a mechanism that communicates rating labels for web pages from a server to clients. These rating labels contain information on the content of web pages. For example: whether a certain page contains a peer-reviewed research article, whether it has been written by an acknowledged researcher, whether it contains sex, nudity, violence or abusive language. PICS doesn't establish criteria with regard to content, but is a general mechanism for the creation of rating label systems. Several organizations can provide rating labels with regard to content based on their own objectives and values; users can set their browser in such a way that each web page that doesn't comply with their own criteria can be filtered out. The development of PICS was especially motivated in anticipation on restrictions the American government wanted to impose on the internet by means of the 'Communication Decency Act' (later rejected by the Federal Supreme Court).
PICS offers a limited framework for metadata. Following on this the W3C has taken several initiatives to develop a more general framework for metadata. There are at least two important technologies for the development of the semantic web: eXtensible Markup Language (XML) and the Resource Description Framework (RDF).
HTML was designed to display information; XML was designed to describe information. XML is a way to structure, file and send information. Both the mark-ups that are used to design HTML documents and the structure of HTML documents are fixed. Writers of HTML documents can only us mark-ups that are defined in the HTML standard. XML, on the other hand, enables authors to define their own tags and their own document structure. XML is not a replacement for HTML. In the future XML will be used to describe the information, whereas HTML is used to format and present the same information.
XML enables site builders to mark their information semantically. With XML anyone can make their own tags that annotate web pages or parts of a document on a page. Scripts or programmes can make use of these tags, but the scriptwriter has to know what the page writer is using each tag for. So, by means of XML users can give their documents a structure of their own.
XML is not semantics or a tag set, but just a meta-language specifically for the description of markup languages. XML is a grammatical system for the construction of special descriptive languages. It is, for example, possible to make a special descriptive language with XML for mathematical, psychological or sociological information. These descriptive languages, made with XML, are called XML applications.
XML documents highly resemble HTML documents. XML works with the same building blocks as HTML: elements, attributes and values. An element has an opening tag with a name, between a bigger than (<) and a smaller than sign (>). If necessary the elements are followed by attributes with values. The markings appear in pairs as as <starttags> and </endtags>.The markings surround a document-part that in itself can also contain document-parts. This embedding gives the document a tree-structure. A 'document scheme' describes which parts a document can contain and in which order.
A document scheme also records where hypermedial information can be included. Designers of document schemes can decide themselves which names they select for tags and attributes and which orders and embeddings of document-parts they apply. So, XML is an open standard.
Grand Unification Theory
"RDF has always had the appeal of a Grand Unification Theory of the Internet, promising to create an information backbone into which many diverse information sources can be connected. With every source representing information in the same way, the prospect is that structured queries over the whole Web become possible.
That's the promise, anyway. The reality has been somewhat more frustrating. RDF has been ostracized by many for a complex and confusing syntax, which more often than not obscures the real value of the platform" [Dumbill 2000]
The W3C has designed a new logical language, allowing computers to represent and exchange data. That language is called RDF
, an acronym of Resource Description Framework. It has been designed to facilitate interoperability of applications which generate and process machine-understandable representations of data about resources on the web. RDF integrates several activities to enter web based metadata: sitemaps, content rating, data collection of search engines ('web crawling'), digital library collections and distributed writing.
RDF offers a framework for the construction of logical languages for collaboration in the semantic web. It is an XML based language in which computers can represent and exchange data. RDF provides information on the meaning of information. In RDF a documents makes assumptions that specific things have properties which have values..
RDF is a foundation for the processing metadata: applications can exchange information on the web ('interoperability') that can be understood by machines. One of the aims of RDF is specifying a standardized (and thus interchangeable) semantics for data based on XML. So, in a certain way it is a general mechanism for knowledge representation. The definition of the mechanism is domain-neutral: the semantics of specific domains is not fixed, but the mechanism is usable for the description of information from any domain.
RDF is a scheme language. An RDF document has a pointer to its RDF scheme at the top. This is a list of the terms of data that are used in the document. Anybody can make a new scheme document.
RDF metadata can be used in a variety of application domains. It can be used:
- in resource discovery to provide better search engine capabilities.
- in cataloguing for describing the content and content available at a particular web site, page or digital library.
- by intelligent software agents to facilitate knowledge sharing and exchange.
- in content rating.
- in describing collections of pages that represent a single logical 'document'.
- for describing intellectual property rights of web pages.
- for expressing the privacy preferences of a user as well as the privacy policies of a web site.
Basic RDF model
The basic data model consists of three object types::
Resources are identified by a resource identifier. A resource identifier is a URI plus an optional anchor ID. Consider, for example, the following sentence:
Albert Benschop is the editor of the resource http://www.sociosite.net.
All thing being described by RDF expressions are called resources. A resource may be an entire web page, such as the HTML document "http://www.w3c.org". A resource may be a part of a web page, such as a specific HTML or XML element within the document source. A resource may also be a whole collection of pages, such as an entire website. A resource can also be an object that is not directly accessible via the web, such as a printed book. Resources are always named by URIs plus optional anchor ids. Anything can have a URI.
A property is a specific aspect, characteristic, attribute or relation that is used to describe a resource. Each property has a specific meaning, defines its permitted values, the types of resources it can describe, and its relation with other properties.
A specific resource, together with a named property, plus the value of that property for that resource is a RDF statement. These three individual parts of a statement are called, respectively, the subject, the predicate and the object. The object of a statement (i.e. the property value) can be another resource or can be literal (i.e. a resource, specified by a URI, or a simple string or primitive datatype defined by XML).
This sentence has the following parts:
The DARPA Agent Markup Language (DAML) is a standard language for the construction of ontologies and ontology-based knowledge representations. It contains instruments for ontology development, consistency control and mediation between ontologies. The aim of DAML is to develop a language and instruments that facilitate the semantic web. In the future DAML should become the 'lingua franca' for artificial intelligence. Just like RDF DAML is based on XML, making it easier to integrate with other web technologies.
A useful graphic survey of exchange languages is given in The XML Family of Specifications: The Big Picture by Ken Sall.
Typologie van verbindingen
In linear texts, such as books or articles in conventional journals, every information unit is connected with at the most two other units: with the preceding information unit (page, paragraph, word) and the following one. To moderate the restrictions of linearly structured documents several instruments were developed in the course of the years: footnotes, references, quotations, index, keywords and name register. Since the birth of the World Wide Web we are capable of stepping outside the borders of the linear structuring of information.
The web is a strongly growing series of information resources that are interrelated by links. The strength of a hypertextual link is 'that anything can be connected with anything'. Links are cross-references between or within documents. This enables writing and publishing in a non-linear order. Researchers can finally escape from the restrictions of the linear organization of all texts and bibliographies. They can follow a reference path of their own choice now, matching their interest [Berners-Lee 1999:38].
A link is a reference in a document to another document (external link) or within the same document (internal link).
What is a link?
A link is an explicit relation between resources or parts of resources. A resource is any addressable unit of information or service. Examples of resources are files, pictures, documents, programmes and search results. The means that is used to address
a resource is a URI (Uniform Resource Identifier) reference. It is also possible to address a part of a resource.
- Internal link: works in two directions.
- Table of contents
- See ch., par. or page
- External link: works in one direction
- Notes and footnotes
In hypertext the external (also called 'normal') links are those between a hypertext document and an external document. Internal (also called 'embedded') links indicate that something is going to happen to the document: one jumps to another part of the document, an image appears in a web page, a programme is activated or a simulation is shown.
Internal and external links are merely technical, and therefore meaningless links. A semantic link is a reference from one concept (=meaning) to another concept. Therefore they are meaning carrying hyperlinks.
Two of the more recent languages enlarging the flexibility of XML are XML Linking Language (XLink) and XML Pointer Language (XPointer).
XLink adds a number of new functionalities to the hyperlinks on the web:
- Links leading to more destinations.
- Bi-directional links: links that can be followed to both sides, regardless of where you were first.
- Links that annotate read-only documents: you can make links that are visible when people look at a document, even if you aren't the owner of the document.
- Links with special conducts, such as expand-in-place, new window vs. replacement, automatic follows, etc.
- Link databases, with all possibilities of filtering, sorting, analysing, and processing of link collections.
With Xlink XML
authors can construct sophisticated hyperlinks. Apart from an extensive link semantics Xlink supports annotation services and precise addressing of sub-resources when Xpointers are used. Xlink describes how both simple uni-directional links (as in HTML) and more complex multi-directional links can be added to XML documents. By means of a link set a whole series of files, separate positions in a file or both at the same time can be connected with each other.
XPointer specifies a mechanism for pointing at random fragments ('chunks') of a goal document, even when the original author of the goal document hasn't introduced identifications of fragments (for example with the tag "this_fragment#section_4"). Xlink and Xpointer are based on two mature standards for the publishing world: TextEncoding Initiative (TEI) and Hypermedia/Time-based Structuring Language (HyTime).
See also the (provisional) typology of semantic links of Harmsze  and Kircz/Harmsze .
Ontologies are the building blocks for the semantic web [Klein/Fensel 2001]. Ontologies play an important part on the web as the allow the processing, sharing and re-use of knowledge between programmes. An ontology is a classification system for concepts and their underlying connections within a specific domain of knowledge. It is a kind of proto-theory, indicating which elements exist within a specific domain and how these elements can be related to each other. An ontology a representation of shared conceptualisation of a specific domain [Decker e.a. 2000]. They support the integration of heterogeneous and distributed information resources [Fensel 2001a].
Ontologies in electronic commerce
For e-commerce as well the development of the semantic web is of great importance. One of the most crucial problems is the integration of heterogeneous and distributed product descriptions. Suppliers and salesmen of online products and services usually don't reach a consensus on the products and services belonging to a domain, on how they can be described and how a product catalogue should be structured. So, an ontology for the domain of electronic commerce is primarily established by the construction of a common product catalogue that can be used for all search actions and transactions.
Dieter Fensel  provides a survey of the initiatives that have been taken to construct ontologies to mediate and order the electronic commerce. These ontologies allow for a semantics of data that can be processed by machines. Within such an infrastructure of meaningful data completely new kinds of automated services can be grafted. Intelligent software agents can relatively independently search the whole internet for products and services the user is interested in, they can compare prices and make suggestions, they can form coalitions of buyers and sellers, they can deal about products and prices, or help configurate products and services in such a way that they come up to the specified demands of the users.
A detailed ontology for the hrm-domain can be found in HR-XML Consortium. The consortium specializes in the development of promotion of a standard suite of XML-specifications to enable e-business and the automation of human resources-related data exchanges.
The ontologies that make the semantic web possible are formal conceptualisations of specific domains that are shared by people and/or application systems. Ontologies also form the backbone of the management of metadata because they pave the way for the semi-automatic semantic annotation of web pages as well as for retrieval of information from web resources.
An ontology usually contains a hierarchic description of important concepts in a domain, and describes crucial qualities of each concept by means of a property-value mechanism. Moreover, the relations between concepts can be described by additional logical sentences. Finally, individuals in a specific domain are assigned one or more concepts to give them their proper type.
Ontologies offer a shared and common understanding of some domain that can be communicated between people and application systems. The naming within document schemes is of great importance, especially in the exchange of structured information. Therefore there is an urgent need for ontologies for names.
Due to the global character of the web there is a need to provide search agents with a universal reference framework. Such a domain specific ontology also plays an important part in data mining and extraction of knowledge from documents.
In order to make ontologies a number of instruments are circulating in the meantime, such as Protégé 2000, OILed, OntoEdit. There are also libraries with ontologies that can be re-used. Examples of these are Ontolingua and the DAML ontology library.
Common and shared understanding
It is the mission of scientific institutions to produce new knowledge, and to distribute it in such a way that this knowledge can be shared and a common understanding evolves that serves as the foundation for ongoing collaboration. The question is how the semantic web can contribute to the development of new knowledge and the reinforcement of common understanding within the scientific domains.
In order to understand something new we have to make a link with other things we already understand well [Berners-Lee 1999:193]. Thus, new knowledge is defined in terms of what we already know. All scientific definitions and concepts are relational, i.e. they are related to other definitions and concepts, just like in a dictionary. By following the links on the semantic web, a computer can convert any term he doesn't understand into a term he does understand. So this relative form of 'meaning' can be processed by a machine. In order to link terms we need inference languages. With inference languages computers can convert data of one classification into another. The level of inference enables computers to link definitions and concepts. In this way the semantic web can contribute to the production of new knowledge.
Traditional systems of knowledge representation typically have been centralized. Domain specific and controlled keyword systems and thesauri were developed and spread by usually library institutions. In such indexes and thesauri a definition of common concepts (such as 'fruit', 'parent', 'conflict', 'inequality' or 'car') is established. This central control makes such systems inflexible. The increasing size and scope make such a system rapidly unmanageable. Moreover, centralized limit the questions that the computer can reliably answer [Berners-Lee/Handler/Lassila 2001].
Higher computer languages are required to interpret the data on a website in such a manner that they understand that a word in one document has the same meaning as a word in another document, even if something else is written. There is no need to know the meaning of the word itself, as long it is indicated that the word can be considered as part of a certain concept. This is a much better way of knowledge representation than when computers have to rely on descriptions of individual words such as 'a wheel is a thing and a car has four wheels'. By making connections between conceptual information the semantic web can gradually collect knowledge about its own domain. When smart programmes ('intelligente agents') can collect all information the user wishes, surfing by the user is actually not necessary anymore.
Apart from the production of new knowledge the question is within any scientific knowledge domain how within the borders of controversy and diversity we can still grasp such a common understanding
of basic concepts and methodologies that, within such a domain, and preferably also between several disciplinary domains, we are capable of a self-reflexively controlled exchange of arguments (including well-founded rules and standards for argumentation, methodologies, empirical verification, and statistical relevance and validity criteria). The constitution of such common understanding is, however, at the same time a complex (with several layers) en dynamic (with several temporalities) process. People reach a common understanding by making a series of consistent associations between words, that sufficiently resemble each other. Common understanding is a condition for collaboration [Berners-Lee 1999:192].
The semantic web is a utopia coming closer and closer. More and more organizations and networks such as W3C take an effort in reaching a standardization of the exchange languages of the web. These organizations specify the semantics using formal languages and inference mechanisms. The real challenge, however, is to link these formal semantics with deeper meaning as reflected by consensus discovered among users on the semantic web [Beherens/Kashyap 2001]. The issue is not establishing worldwide norms authoritatively prescribing such norms is simply impossible but working up to partial understanding [Berners-Lee 1999:197]. Therefore the semantic web is much more an infrastructure paving the way for the creation of common understanding. Just like the internet the semantic web will be as decentralized as possible
Laborious searching zoeken
Many scientific resources are very hard to find. They are missed by search engines, or you have to struggle through tens of pages with hits, hoping that somewhere there is a relevant link to the document you are looking for. The restrictions of the present generation search engines have far greater consequences for scientists than for regular users.
Most search engines work with crawler programmes that index a web page, jump to another page that is referred to, index this page, etc. Due to the enormous growth of the web these search engines will soon become unusable. Even the most extensive search machines hardly cover half of the total amount of web pages. And then only half of the static pages of the surface web, and not the information that is stored in the deep web of databases.
New search technologies promise to increase the precision of queries on the web substantially. The introduction of XML, for example, makes it possible to restrict a query to scientific documents, or to documents that belong to a highly specialized scientific area. It is expected that within a few years for most researchers keywords searching on the complete web will belong to the past. Personal queries will increasingly leave from specialized scientific search portals.
Some search engines are already setting out a new course. An important innovation in the search technology has been inspired by the citation-analyses that are applied to scientific literature. Conventional search engines use algorithms and simple rules of thumb to order the pages based on the frequency of the keywords that are specified in a query. New types of search engines are using the crisscross of links between web pages. Pages that are referred to from many other sites are regarded as 'authorities', and are placed highest in rank in the search results. Owing to this, in less than a year Google developed by two American students Sergey Brin and Lawrence Page has become the most popular search machine because she yields more precise results for most queries than the conventional engines. In this order not only the number of links is looked at, but also where they come from. A link from a reputable scientific journal has greater weight than a link from a random homepage. 'Some links are more equal than others'.
New algorithms are being developed that do not only analyse documents on keywords, but also on concepts. Generally hereby use is made of extensive thesauri that can recognize thousands of concepts. In this case search engines look for defined patterns of terms and analyse their contextual relation. Users can insert a scientific document in the search engine. This document is then automatically analysed, identifying the most important concepts and making a profile that is used to search for similar texts. Users can refine their queries by adapting the weight that is attributed to each separate concept. An example of this new search technology is the Dutch project of Collexis. Besides the standard data and information retrieval capabilities Collexis technology is able to discover relationships of different information items (via clustering and/or aggregation) and thus uncover important implicit knowledge.
The position of search engines will be increasingly taken over by 'intelligent' programmes that are searching by using their experience with the needs, interests and preferences of their users. They learn from former search sessions. As the search technologies improve also publishers of journals and administrators of electronic archives will be able to see to it that searching for scientific documents on the web becomes easier. In journals the references in the articles published are increasingly linked to the resources.
An increasing quantity of information is available via networks and databases. The present search engines support users only to a certain extent in localizing the relevant information. With 'intelligent agents' passive search engines can be converted into active, personal assistants.
The semantic web structures the meaningful content of web pages and creates an environment in which software agents, wandering from page to page, can carry out complex tasks for users in a relatively independent way. Intelligent agents are small software programmes that scour the internet to find information that meets the instructions of the users. So, agents are semi-autonomous computer programmes that assist the user in dealing with computer applications. Agents do not only make use of the semantic infrastructure, but can also contribute to the creation and maintenance of that infrastructure. Good agents will enable people to spend less time on searching for information, and more time on using or analysing 'relevant' information that is automatically opened up. A good internet agent should be communicative, capable, autonomous and adaptive [Hendler 1999].
- With a good agent you have to be able to communicate in the first place. This is only possible when an agent speaks the same language as its user. An agent who doesn't understand where you want to go to or what you want to do there is not very helpful. The biggest problem with current search engines is that, although they are based on language, they have no knowledge of the domains in question. The solutions to this problem involve ontologies and inference rules.
- Ontologies are formal definitions of knowledge domains. An example of this is: "If X is a car, then X has 4 tyres".
- An example of an inference rule is: "If one thing is part of something else, and that latter thing is itself a compoment of an assembly, then the firt item is a part of the assembly".
- An agent should both make suggestions and be able to act. A good agent not only provides advice, but also provides a service. Thus, good agents are capable of doing things on the web of which you need not know the details. Users have to be able to delegate certain tasks to their intelligent assistants. Not only tasks such as searching, sorting and filing information are at stake, but also reading electronic mail, making appointments, keeping a calendar, and setting up a travel programme.
- An agent should be able to do things without or with as little supervision as possible. Users do not only want a 'humble slave' who dances to the master's piping, but also and especially a 'smart slave' who makes every effort on his own initiative to take over activities from his master. An intelligent assistant can operate relatively autonomously within the parameters of his user.
- Finally, a good agent should use their experiences in order to help the user. An agent should be able to adapt his behaviour on the basis of a combination of feedback of the user and environmental factors. This also implies that intelligent agents take the expertise of the user into account and contribute to the reduction of the barriers for new users. Being adaptive presupposes a learning capacity of the agent. A good agent has some domain knowledge but learns what the user would like to do based on the acts of the user.
Such internet agents have to be largely developed. The present experimental agents still are by far not robust enough for large-scale use. Here the key problem is building and maintaining ontologies for web use. In order to interact with an agent we need a new language we can communicate with.
Electronic Publishing Environment (EPE)
Structuring scientific information
We have seen why the present search engines, which make use of the technique of keywords are unfit for the detailed type of searching the scientific community needs. With the help of XML and other advanced web languages the structuring of scientific material can be organized on a higher level. This makes it easier for web agents to find core aspects of scientific documents.
Modularisation of information units
A module is "a uniquely characterised, self-contained representation of a conceptual information unit aimed at communicating that information. Not its length, but the coherence and completeness of the information it contains make it a module. This definition leaves open that modules are textual or, e.g., pictorial. Modules can be located, retrieved and consulted separately as well as in combination with related modules. Elementary modules can be assembled into higher-level, complex modules. We define a complex module as a module that consists of a coherent collection of (elementary or complex) modules and the links between them. Using a metaphor, elementary modules are 'atomic' entities that can be bound into a 'molecular' entity: a complex module." [Kircz/Harmsze 2000].
In the paper era the book and the article were the crucial units of scientific information. Books and articles in journals referred to other books and journals, if possible with a specification of the chapter or the page. In the digital era this classical unit of scientific information has been crushed. The structure of digitally presented scientific information is much more intricate ('granularity of information'). This has made the digital information units much more coherent and therefore more autonomous. The information modules or molecules are interrelated by means of simple (internal and external) hyperlinks and by semantic (= meaningful) hyperlinks. By the use of new exchange languages (XML, RDF and DAML) it is not only possible to define and exchange meaningful information units, but the repertoire of more advanced hyperlinks is also considerably extended.
Introduction of higher exchange languages leads to a distinct division between meaningful content and display. Owing to this a formerly unattainable goal comes closer with rapid strides: the cognitive structures of academic publications become transparent by means of connections between scientific concepts: ideas, hypotheses, argumentations, refutations, interpretations, comments, etc.
An electronic publishing environment is a clear-cut virtual space in which scientific documents are published and discussed. Scientific documents are digital resources with their own identity that is composed of four elements: authenticity, traceability, quotability and dynamics.
The foundation of the social sciences is the interpretation and reinterpretation of primary and secondary resources. Constructing a convincing argument depends on the recognition of the authenticity of the resource material. Authenticity is mainly based on judgements on originality, comprehensiveness and internal integrity of a document. However, there are still few reliable methods to establish the authenticity of digital resources. That is not surprising in a situation in which the amount of resources has excessively increased, in which it is extremely easy to make changes in digital information, and in which there is a great chance that many digital copies of the same work exist, with small differences. The digital technology has not only made fraud and resource counterfeit easier but also more tempting. To prevent this several technical and social strategies have been developed with which the authenticity of digital resources can be identified. First of all, there are a number of public methods to determine and guarantee the authenticity of resources. The best known forms of these are depositing copyrights (intellectual property and/or right of reproduction), certifying original resources, registering unique identifications of documents, and defining metadata in which the authenticity of documents are recorded. Secondly there are a number of secret methods with which data are hidden in the object that specify her resource. Examples are: introducing digital watermarks, steganography and digital signatures. Finally there are a number of functionally independent methods with which specific technologies are linked to the resource. Famous examples are encryption and 'object encapsulation'.
None of these methods in itself will bring eternal happiness. Authenticity plays a part in each stage of the scientific research process. Both in finding, retrieving and using resources a different combination of methodologies will have to be used to establish the authenticity of these resources.
Digital resources have to be traceable. By linking each document to a unique web address (a url or uri) they are universally traceable. By making consistent use of domain specific descriptive languages (ontologies) it will also be possible to make scientific concepts and data traceable (even when they are filed in the databases of the deep web).
To quote a resource on the web it is insufficient to merely refer to the url. Quoting online resources at least also includes an entry of the title, first date of publication, the electronic journal in which it was published, the institution that published the document and location indication of the institution. Such data will be more and more stored in the documents themselves, in metatags that can also be processed by computer programmes. The quotability of electronic documents is qualitatively extended when not only documents as a whole can be univocally quoted, but also meaningful parts of each document (specific data, concepts or methodologies).
Articles in conventional scientific journals are, by the nature of the data carrier, by definition static. Once printed, nothing can be changed in a publication. Possible corrections in the article can at best be included in the following issue of the journal. Correcting articles and books and bringing them up to date therefore always require a new publication. Elementary mistakes in the content, spelling and grammar remain visible in the original document. Actualisations and new insights cannot be added to such a document. Documents in electronic publishing environments are by definition dynamic in composition. If desired, they can be changed, completed and updated at any minute. There are already electronic journals with the editorial policy that authors commit themselves to actualise their document at least once a year. Dynamic documents, however, also require a transparent version management.
The peculiarities of electronic publishing are extensively dealt with in Interactief publiceren in elektronische tijdschriften [Dutch only]. Here, separate attention is also paid to the possibilities of reproducing and improving the mechanism of quality control from colleagues ('peer reviewing').
An electronic publishing environment is a programme with which one can publish digital documents on the internet, facilitate and moderate discussions on these publications, stimulate criticism from colleagues, and institutionalise quality control ('peer reviewing'). An electronic publishing environment is hypertextual, multimedial, interactive and flexible.
The information transfered in printed articles is displayed by definition in a linear discussion. Each information unity is preceded and followed by one other information unit each time. Readers are directed through the text via one and the same route, i.e. from beginning to end.
Digitally distributed scientific documents can already make use now of the possibilities to make hypertextual links between and within documents. In hypertext any information unit can be linked to several other information units. The reader can choose himself which of these links are followed. The structuring of a hypertext only establishes the boundaries within which the reading behaviour can vary and establishes within these boundaries the chance that specific reading routes are actually followed. In the network of mutually linked information units readers choose a route that meets their own interests and preferences.
Hypertext enables a better presentation of complexly structured theoretical discussions. After all, complex theories usually operate simultaneously on several levels of abstraction, which are often linked as nested hierarchies. Apart from these advantages in presentation hypertext also enables direct access to the complete bibliographical quotation and the quoted documents themselves. Moreover, before long the semantic revolution makes it possible to follow really meaningful hyperlinks, so that we can search specifically for several branches of scientific concepts and cross-connections in data structures.
The conventional model of academic publication is a model of almost sheer letters, usually black on white. Usually detailed arrangements are made on the quantity and colour scheme of illustrations, especially due to the costs. EPEs offer the possibility to combine text, pictures and sound again (including animations, simulations and demonstrations) in a relatively cheap way. Multimedial also means that scientists are enabled to make use of the most diverse internet technologies: from sheer textual email via direct sound connections to complex video conferences.
Conventional publications in paper journals are little interactive. It is mainly one-way traffic from authors to readers, even if nearly always some opportunity is offered for discussion and critical reactions. However, due to the inherent slowness of the paper medium these reactions aren't published till months later. It is not unusual that there is a time difference of at least a year between the submission of a manuscript for peer reviewing and the publication of a reaction from a colleague.
EPEs allow a permanent interaction between authors, editorial staff, reviewers and readers. Already from the first stage of the research process a free rein can be offered for communication with colleagues. EPEs do not only offer room for interaction between producers of domain specific knowledge (and thus for discussion and criticism from colleagues), but also between editors and producers of scientific knowledge (and thus for quality control in 'peer-to-peer' relations among colleagues).
Advanced EPEs offer their readers/users the chance to comment immediately on scientific documents or specific parts thereof. From this not only the authors concerned can profit, but also the readers. While working through a scientific document the reader each time has the chance to read the comments of colleagues on the document or on certain parts of the document.
As said before, conventional scientific journals are, due to the specific nature of the information bearer ('the patient paper') extremely static. Once something has been published, it always remains as it is until the acid paper pulverizes. Mistakes made can hardly be corrected (unless one makes a second edition); additions, actualisations and rewritings require a new publication. Once something has been printed it is difficult to change. In this respect EPEs are much more flexible.
In principle it is always possible and relatively simple in electronic publishing environments to add changes to already published digital documents. Smaller and bigger mistakes can immediately be corrected, and updates, changes and additions can be quickly carried out at all times. For 'publications in the making' this can be done without much ado, stating the date on which the document was lastly modified (version control). With 'arrived publications' precise metadata have to see to it that users are able to identify the different versions of the document. In the archives of the EPEs the earlier versions of the documents concerned can always be stored.
EPEs are not only flexible in the sense that changes can easily be made in documents. They are also more flexible as regards content. As it happens, well organized EPEs enable multiple use of information units that have been published by different authors.
An EPE is a permanent virtual space for scientific knowledge development and discussion on a disciplinarily or thematically defined domain. It is a space where connections can be made between conceptual information, allowing for a gradual collection of knowledge on his ?? own domain.
- Arciniega, Fabio A. 
What is XLink?
An excellent introduction to the functioning of Xlink.
- Automate or Die
- Bearman, David / Trant, Jennifer 
Authenticity of Digital Resources
In: D-Lib Magazine, June 1998.
- Baeza-Yates, Ricardo / Ribeiro-Neto
Modern Information Retrieval
- Behrens, Clifford / Kashyap, Vipul 
The "Emergent" Semantic Web: A Consensus Approach for Deriving Semantic Knowledge on the Web [pdf]
Paper presented at the 'Semantic Web Working Symposium' (SWWS), July 30 - August 1, 2001, Stanford University, California, USA.
- Berners-Lee, Tim 
Semantic Web Road Map
An overall plan for the architecture of the Semantic WWW.
- Berners-Lee, Tim 
De wereld van het WWW.
Amsterdam: Uitgeverij Nieuwezijds.
- Berners-Lee, Tim / Hendler, James / Lassila, Ora 
The Semantic Web
In: Scientific American, May 2001.
- Bots and Intelligent Agents
- Bradley, Neil 
The XML Companion.
- Brin, Seghey / Page, Lawrence
The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Cargill, C. 
Information Technology Standardization: Theory, Process, and Organizations.
Bedford, MA: Digital Press.
- Castro, Elizabeth 
XML voor het World Wide Web.
Amsterdam: Peachpit Press.
A European Union project directed at the stimulation of a next generation scientific research that requires intensive calculation and analysis of shared large-scale databases and that is distributed over several scientific communities.
- Davenport, T.H. / Prusak, L. 
Working Knowledge: How Organizations Manage What They Know.
Boston: Harvard Business School Press.
- Daviers, John / Fensel, Dieter / Harmelen, Frank van 
Towards the Semantic Web: Ontology-Driven Knowledge Management
Specialized in designing and developing solutions for data engineering and electronic publishing with XML technologies.
- Decker, Stefan / Fensel, Dieter / Harmelen, Frank van e.a. 
Knowledge Representation on the Web [pdf]
In: International Workshop on Description Logics.
- Description Logics
Redactie: Carsten Lutz.
- Dublin Core Metadata Initiative (DCMI)
- Dumbill, Edd 
Putting RDF to Work
- Electronic Publishing Initiative at Columbia (EPIC)
EPIC aims at a new way of scientific and educational publication by making use of new media technologies in an integrated research and production environment.
- Erdmann, Michael / Studer, Rudi 
Ontologies as Conceptual Models for XML Documents
- Fensel, D. [2001a]
Ontologies: Silver Bullets for Knowledge Management and Electronic Commerce.
Berlijn: Springer Verlag.
- Fensel, D. [2001b]
Understanding is based on Consensus.
Panel on semantics on the web. 10thInternational WWW Conference, Hong Kong.
- GCA: Graphic Communication Association
- Gilliland-Swetland, Anne J. / Eppard, Philip, B. 
Preserving the Authenticity of Contingent Digital Objects
In: D-Lib Magazine, July/August 2000.
- Golbeck, Jennifer / Grove, Michael / Parsia, Bijan / Kalyanpur, Adtiya / Hendler, James 
New tools for the semantic web
In: Proceedings of 13th International Conference on Knowledge Engineering and Knowledge Management. Suguenza, Spain.
- Golbeck, Jennifer / Parsia, Bijan / Hendler, James 
Trust networks on the semantic web
In: Proceedings of Cooperative Intelligent Agents.
- Gruber, T.R. 
A Translation Approach to Portable Ontology Specification.
In: Knowledge Acquisition 5: 199-220.
- Harmelen, Frank van / Fensel, Dieter 
Practical Knowledge Representation for the Web
In: Proceedings of the IJCAI'99 Workshop on Intelligent Information Integration.
- Harmsze, Frédérique-Anne P. 
Modular structure for scientific articles in an electronic environment. (PhD thesis)
See for other articles.
- Hendler, James [DARPA]
Agent Based Computing [.ppt]
Professor Hendler is one of the founders of DAML, the exchange language developed at the Defense Advanced Research Projects Agency [DARPA].
- Hendler, James 
Is There an Intelligent Agent in Your Future?
In: Nature, 11 March 1999.
- Hendler, James 
Agents and the Semantic Web.
In: IEEE Intelligent Systems 16(2).
- Hendler, James /Berners-Lee, T. / Miller E. 
Integrating applications on the semantic web
Journal IEE Japan 122(10: 676-80.
- Hendler, J. / McGuinness, D.L. 
The DARPA Agent Markup Language.
In: IEEE Intelligent Systems, 15(6): 72-3.
- Hendler, James /Parsia, Bijan 
XML and the semantic web
In: XML Journal.
- Horn, Robert (MacroVU)
Can Computers Think?
In 1950 the English mathematician and inventor of the computer, Allen Turing, wrote: "I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted". In the subsequent discussion the brains of several scientific disciplines have tackled the question whether computers can think. Here the full debate has been literally recorded.
- Information Society Technologies (IST) 
Workshop Semantic Web Technologies
- Klein, Michel / Fensel, Dieter 
Ontology versioning on the Semantic Web [pdf]
Paper presented at the "Semantic Web Working Symposium" (SWWS), July 30 - August 1, 2001, Stanford University, California, USA.
- Knowledge Representation
Editor: Enrico Franconi.
- Lambrix, Patrick
The most extensive collection of online information on description logics. Description logics are languages for knowledge representation that are tailored to the expression of knowledge on concepts and concept hierarchies.
- McIIraith, Sheila A. / Cao Son, T. / Zeng, Honglei 
Mobilizing the Semantic Web with DAML-Enabled Web Services
Semantic Web Workshop 2001 Hongkong, China.
- MusicXML: Music Markup Language
An XML vocabulary designed to present musical notes, and in particular the usual western musical notation from the 17th century on. It has been designed as an exchangeable format for notation, analysis, retrieval and presentation applications.
- Northern Light Special Edition 
- Noy, Natalya F. / McGuinnes, Deborah L. 
A Guide to Creating Your First Ontology
A project of the Information Society Technologies (IST) programme for Research, Technology Development & Demonstration. The programme attempts to make use of the full power of the ontological approach to facilitate efficient knowledge management. The technical backbone of the project is the use of ontologies for the different tasks of information integration and mediation. One of the results of the project is the Ontology Inference Layer (OIL). OIL is a standard for the specification and exchange of ontologies that provide for shared and common understanding of a domain that can be communicated between people and computer and web applications.
System for electronic access and use of mathematical information.
- Parkowitz, M. / Etzioni, O. 
Adaptive Web Sites: an AI challenge.
- RecipeML: Recipe Markup Language
An XML vocabulary for the presentation of recipes. RecipeML has been designed to facilitate the transfer of recipes via the web. Individuals, restaurants, food producers and publishers can make use of this language to file substantive recipes, to exchange and publish them. The ultimate goal is to facilitate the transfer of ingredients of recipes straight to the shopping list of the consumer.
- Rijsbergen, C.J. van 
A site completely dedicated to the collection of current information on the semantic web. It functions as a discussion forum for people interested in the semantic web.
- Semantic Web Workshop
Text of and on the 2nd international workgroup on the semantic web, Hongkong, 1 may 2001/Stanford, 30-31 july, 2001.
- SGML/XML Users Group Holland
A knowledge platform with members aiming at the creation, sharing and propagation of hard core knowledge on the application possibilities of SGML/XML.
- Steen, G.J. van der 
Naar de menselijke maat. Het perspectief van Uitwisselingstalen
- UMBC AgentWeb
Information, newsletters and discussion forums on intelligent information agents, intentional agents, software agents, softbots, knowbots, inforbots, etc., published by the Laboratory for Advanced Information Technology of the University of Maryland.
- Uschold, M. / Grüniger, M. 
Ontologies: Principles, methods and applications.
In: Knowledge Engineering Review, 11(2), 1996.
- Vickery, B.C. 
In: Journal of Information Science 23(4): 277-86.
- Vliet, Eric van der
Building a Semantic Web Site
- WDVL: XML
Excellent documentation on XML, presented by the Web Developer's Virtual Library.
- Weibel, Stuart / Miller, Eric 
An Introduction to Dublin Core
A mailing list of W3C for those interested in RDF.
A reliable source with accurate and current information on the application of XML in industrial and commercial settings. It also serves as a reference point for specific XML standards such as vocabularies, DTSs, schemes and name spaces.
- XML Cover Pages
On of the best online guides on XML and SGML. It contains a.o. an excellent SGML/XML Bibliography. Editor: Robin Cover.
Portal site with XML and SGML links, divided into categories and application areas.
dr. Albert Benschop
Social & Behavioral Sciences
Sociology & Anthropology
University of Amsterdam
Published: April, 2004
20th September, 2013