You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2001/12/11 12:42:06 UTC

Semantic Relevance Rating

<context>
In this message I talk about indexing, searching and retrieving large
quantities of semi-structured XML data (mostly documents).
</context>

XML documents can be seen as text and indexed as such. Unfortunately,
this would totally break usefulness since <b>Hello</b> would not be
indexed as 'hello' between 'b' but simply as that textual token.

Such a model would result in a very poor searching experience since the
user must know in advance how the markup is used. Add namespaces to the
picture, and this becomes clearly useless.

                                     - o -

An improved model indexes the results of the parsing stage, separating
the 'textual content' (element contents and attributes value) from the
'contextual content' (element names and attribute names).

This is, of course, a better indexing, but still has a few defects:

 1) markup information is lost
 2) summary information is hard to recreate because of #1

One solution is to avoid indexing attributes, hoping that element
content is more structured in itself, thus the summary gives better
results. Unfortunately, attributes convey content as much as elements
do. One cannot assume or force the opposite.

                                     - o -

I propose a solution which I call 'Semantic Relevance Rating' (shortly
SRR, that you can pronounce as 'SoRRy' :)

SRR adds the ability to merge semantic information from the markup
structure into the index, thus avoiding to loose the contextual
information embedded in them.

Let me show you an example of what this is. Suppose we have to index
this document:

 <document>
  <metadata>
   <title>How <keyword>SRR</keyword> works</title>
   <authors>
    <author name="Stefano Mazzocchi" email="stefano@apache.org/>
   </authors>
   <legal>Copyright &copy; Stefano Mazzocchi. All rights
reserved.</legal>
   <abstract>
    <para>
     This article explains how <keyword>semantic relevance
rating</keyword>
     works and outlines possible usage cases of the technology
    </para>
   </abstract>
   <related-concepts>
    <concept>semantic search</concept> 
    <concept>indexing</concept>
    <concept>data mining</concept>
   </related-concepts>
  </metadata>
  <content>
   <section title="Introduction">
    <para>Here I explain how <strong>Semantic Relevance Rating</strong>
works.</para>
   </section>
   ...
   <section title="A usage case">
    <para>This is a usage case:</para>
    <figure href="schema.jpg" caption="Cocoon-based indexing system"/>
   </section>
   ...
  </content>
 </document>

just like a browser wouldn't be able to present the above document
without style information, an indexer is not able to 'understand' or
interpret the semantic relevance of each textual piece without some
information that matches their contextual part.

Following the stylesheet concept, I propose the creation of a
'relevance-sheet' which contains this information for the indexer that
allows it to index the structured content in the way it was intended
from the document authors.

Here is an example relevance-sheet for the above document:

document.srr:
<srr:relevances xmlns:srr="..." extends="metadata.srr">
 <srr:context xpath="section/@title" relevance="2"/>
 <srr:context xpath="para" relevance="1"/>
 <srr:context xpath="strong" relevance="1.5"/>
 <srr:context xpath="figure/@href" relevance="0"/>
 <srr:context xpath="figure/@caption" relevance="1.5"/>
</srr:relevances>

metadata.srr:
<srr:relevances xmlns:srr="...">
 <srr:context xpath="title" relevance="2"/>
 <srr:context xpath="author/@name" relevance="2"/>
 <srr:context xpath="author/@email" relevance="0"/>
 <srr:context xpath="keyword" relevance="5"/>
 <srr:context xpath="abstract" relevance="2"/>
 <srr:context xpath="related-concepts/concept" relevance="3.5"/>
</srr:relevances>

The way the SRR sheet is associated with the document is not defined
here since it is another concern.

                                     - o -

The SRR solution yields a few interesting results:

1) the cost of 'semantizing' the information is proportional to the
number of schemas included in the data corpus to index, unlike RDF-like
solutions which costs are proportional to the entire information
included in the corpus.

For example, in a system where there are 10 different schemas and a
milion documents, the cost will be associated in creating SRR
relevance-sheets for those 10 schemas, compared to the cost of adding
semantic RDF information in each and every file.

This is the exact same concept of SoC between content and style, here
associated to the separation between content and its semantic relevance
interpretation.

2) the user experience is no different from the one he/she's used to: he
doesn't need to know the schema of the documents nor any information
about metadata or metadata fields in order to obtain the information.

3) the SRR drives the indexing behavior and indicates whether or not
some information should be indexed or not.

Since the relevance factors are multiplied by the indexer, text that is
associated to a context of 'zero' semantic relevance is skipped and
avoids polluting the index with information that might 'get in the way'
unwanted.

                                     - o -

I'm currently working with the Lucene guys to add the ability to 'rate'
textual input in the indexer. When this is done, adding SRR capabilities
with Cocoon is a matter of writing the relevance sheets for our DTDs.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Semantic Relevance Rating

Posted by Stefano Mazzocchi <st...@apache.org>.

giacomo wrote:

> > document.srr:
> > <srr:relevances xmlns:srr="..." extends="metadata.srr">
> >  <srr:context xpath="section/@title" relevance="2"/>
> >  <srr:context xpath="para" relevance="1"/>
> >  <srr:context xpath="strong" relevance="1.5"/>
> >  <srr:context xpath="figure/@href" relevance="0"/>
> >  <srr:context xpath="figure/@caption" relevance="1.5"/>
> > </srr:relevances>
> >
> > metadata.srr:
> > <srr:relevances xmlns:srr="...">
> >  <srr:context xpath="title" relevance="2"/>
> >  <srr:context xpath="author/@name" relevance="2"/>
> >  <srr:context xpath="author/@email" relevance="0"/>
> >  <srr:context xpath="keyword" relevance="5"/>
> >  <srr:context xpath="abstract" relevance="2"/>
> >  <srr:context xpath="related-concepts/concept" relevance="3.5"/>
> > </srr:relevances>
> 
> I think you missed to tell us that an SRR rates a schema not a
> document (I know this because we've talked about it privately).

Yeah, well, I thought it was evident below.

> How do you think overriding releveance values should be handled? Suppose
> the following:
> 
>  document.srr:
>  <srr:relevances xmlns:srr="..." extends="metadata.srr">
>   <srr:context xpath="section/@title" relevance="2"/>
>   <srr:context xpath="para" relevance="1"/>
>   <srr:context xpath="strong" relevance="1.5"/>
>   <srr:context xpath="figure/@href" relevance="0"/>
>   <srr:context xpath="figure/@caption" relevance="1.5"/>
>   <srr:context xpath="keyword" relevance="15"/>
>  </srr:relevances>
> 
>  metadata.srr:
>  <srr:relevances xmlns:srr="...">
>   <srr:context xpath="title" relevance="2"/>
>   <srr:context xpath="author/@name" relevance="2"/>
>   <srr:context xpath="author/@email" relevance="0"/>
>   <srr:context xpath="keyword" relevance="5"/>
>   <srr:context xpath="abstract" relevance="2"/>
>   <srr:context xpath="related-concepts/concept" relevance="3.5"/>
>  </srr:relevances>
> 
> Note the xpath to "keyword" in both SRRs.

Just as CSS do: replacement (overload). That means: 'keywork' has
relevance '5' on shared metadata markup, but has relevance '15' on the
document schema.

Did you have something different in mind?
 
> > The way the SRR sheet is associated with the document is not defined
> > here since it is another concern.
> 
> Another concern is how SRR are obtained to rate schemas.

That's what I meant.

> And also how SRR (obtained from different sources) for the exact same
> schema are handled. I think all relevance rating values in an SRR should
> sum up to a fixed value to yeald equal ratings among different sources.
> 
> But then all SRR for different schemas are equally rated. Are there SRRs
> (read schemas) which are more important than others?
> 
> >
> >                                      - o -
> >
> > The SRR solution yields a few interesting results:
> >
> > 1) the cost of 'semantizing' the information is proportional to the
> > number of schemas included in the data corpus to index, unlike RDF-like
> > solutions which costs are proportional to the entire information
> > included in the corpus.

This is where I said that SRR rate schemas. Sorry, I thought it was
clear enough.

> > For example, in a system where there are 10 different schemas and a
> > milion documents, the cost will be associated in creating SRR
> > relevance-sheets for those 10 schemas, compared to the cost of adding
> > semantic RDF information in each and every file.
> >
> > This is the exact same concept of SoC between content and style, here
> > associated to the separation between content and its semantic relevance
> > interpretation.
> 
> This is an economic and realistic approach because without automated
> RDFizability of documents nobody will pay such a prize for semantic
> searching capabilities. And talking about automated RDFizability will
> raise the question how relevant can this be made.

This is the key point.

Many people in the XML world started to show explicit bad feelings about
RDF and question the need for a semantic web if something like Google is
so powerful.

I don't like the equation RDF='semantic web' on which the W3C bases all
their work, because there isn't only one way of unlocking the
possibilities a semantically marked-up hypermedia system (see MPEG-7 for
another non-RDF attach to a semantic hypermedia effort)

> > 2) the user experience is no different from the one he/she's used to: he
> > doesn't need to know the schema of the documents nor any information
> > about metadata or metadata fields in order to obtain the information.
> 
> If she/he is guided by some tools which handles the validation according
> to the underlying schema of the document.

??? what do you mean?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Semantic Relevance Rating

Posted by giacomo <gi...@apache.org>.

On Tue, 11 Dec 2001, Stefano Mazzocchi wrote:

> <context>
> In this message I talk about indexing, searching and retrieving large
> quantities of semi-structured XML data (mostly documents).
> </context>
>
> XML documents can be seen as text and indexed as such. Unfortunately,
> this would totally break usefulness since <b>Hello</b> would not be
> indexed as 'hello' between 'b' but simply as that textual token.
>
> Such a model would result in a very poor searching experience since the
> user must know in advance how the markup is used. Add namespaces to the
> picture, and this becomes clearly useless.
>
>                                      - o -
>
> An improved model indexes the results of the parsing stage, separating
> the 'textual content' (element contents and attributes value) from the
> 'contextual content' (element names and attribute names).
>
> This is, of course, a better indexing, but still has a few defects:
>
>  1) markup information is lost
>  2) summary information is hard to recreate because of #1
>
> One solution is to avoid indexing attributes, hoping that element
> content is more structured in itself, thus the summary gives better
> results. Unfortunately, attributes convey content as much as elements
> do. One cannot assume or force the opposite.
>
>                                      - o -
>
> I propose a solution which I call 'Semantic Relevance Rating' (shortly
> SRR, that you can pronounce as 'SoRRy' :)

:)

> SRR adds the ability to merge semantic information from the markup
> structure into the index, thus avoiding to loose the contextual
> information embedded in them.
>
> Let me show you an example of what this is. Suppose we have to index
> this document:
>
>  <document>
>   <metadata>
>    <title>How <keyword>SRR</keyword> works</title>
>    <authors>
>     <author name="Stefano Mazzocchi" email="stefano@apache.org/>
>    </authors>
>    <legal>Copyright &copy; Stefano Mazzocchi. All rights
> reserved.</legal>
>    <abstract>
>     <para>
>      This article explains how <keyword>semantic relevance
> rating</keyword>
>      works and outlines possible usage cases of the technology
>     </para>
>    </abstract>
>    <related-concepts>
>     <concept>semantic search</concept>
>     <concept>indexing</concept>
>     <concept>data mining</concept>
>    </related-concepts>
>   </metadata>
>   <content>
>    <section title="Introduction">
>     <para>Here I explain how <strong>Semantic Relevance Rating</strong>
> works.</para>
>    </section>
>    ...
>    <section title="A usage case">
>     <para>This is a usage case:</para>
>     <figure href="schema.jpg" caption="Cocoon-based indexing system"/>
>    </section>
>    ...
>   </content>
>  </document>
>
> just like a browser wouldn't be able to present the above document
> without style information, an indexer is not able to 'understand' or
> interpret the semantic relevance of each textual piece without some
> information that matches their contextual part.
>
> Following the stylesheet concept, I propose the creation of a
> 'relevance-sheet' which contains this information for the indexer that
> allows it to index the structured content in the way it was intended
> from the document authors.
>
> Here is an example relevance-sheet for the above document:
>
> document.srr:
> <srr:relevances xmlns:srr="..." extends="metadata.srr">
>  <srr:context xpath="section/@title" relevance="2"/>
>  <srr:context xpath="para" relevance="1"/>
>  <srr:context xpath="strong" relevance="1.5"/>
>  <srr:context xpath="figure/@href" relevance="0"/>
>  <srr:context xpath="figure/@caption" relevance="1.5"/>
> </srr:relevances>
>
> metadata.srr:
> <srr:relevances xmlns:srr="...">
>  <srr:context xpath="title" relevance="2"/>
>  <srr:context xpath="author/@name" relevance="2"/>
>  <srr:context xpath="author/@email" relevance="0"/>
>  <srr:context xpath="keyword" relevance="5"/>
>  <srr:context xpath="abstract" relevance="2"/>
>  <srr:context xpath="related-concepts/concept" relevance="3.5"/>
> </srr:relevances>

I think you missed to tell us that an SRR rates a schema not a
document (I know this because we've talked about it privately).

How do you think overriding releveance values should be handled? Suppose
the following:

 document.srr:
 <srr:relevances xmlns:srr="..." extends="metadata.srr">
  <srr:context xpath="section/@title" relevance="2"/>
  <srr:context xpath="para" relevance="1"/>
  <srr:context xpath="strong" relevance="1.5"/>
  <srr:context xpath="figure/@href" relevance="0"/>
  <srr:context xpath="figure/@caption" relevance="1.5"/>
  <srr:context xpath="keyword" relevance="15"/>
 </srr:relevances>

 metadata.srr:
 <srr:relevances xmlns:srr="...">
  <srr:context xpath="title" relevance="2"/>
  <srr:context xpath="author/@name" relevance="2"/>
  <srr:context xpath="author/@email" relevance="0"/>
  <srr:context xpath="keyword" relevance="5"/>
  <srr:context xpath="abstract" relevance="2"/>
  <srr:context xpath="related-concepts/concept" relevance="3.5"/>
 </srr:relevances>

Note the xpath to "keyword" in both SRRs.

> The way the SRR sheet is associated with the document is not defined
> here since it is another concern.

Another concern is how SRR are obtained to rate schemas.

And also how SRR (obtained from different sources) for the exact same
schema are handled. I think all relevance rating values in an SRR should
sum up to a fixed value to yeald equal ratings among different sources.

But then all SRR for different schemas are equally rated. Are there SRRs
(read schemas) which are more important than others?

>
>                                      - o -
>
> The SRR solution yields a few interesting results:
>
> 1) the cost of 'semantizing' the information is proportional to the
> number of schemas included in the data corpus to index, unlike RDF-like
> solutions which costs are proportional to the entire information
> included in the corpus.
> For example, in a system where there are 10 different schemas and a
> milion documents, the cost will be associated in creating SRR
> relevance-sheets for those 10 schemas, compared to the cost of adding
> semantic RDF information in each and every file.
>
> This is the exact same concept of SoC between content and style, here
> associated to the separation between content and its semantic relevance
> interpretation.

This is an economic and realistic approach because without automated
RDFizability of documents nobody will pay such a prize for semantic
searching capabilities. And talking about automated RDFizability will
raise the question how relevant can this be made.

> 2) the user experience is no different from the one he/she's used to: he
> doesn't need to know the schema of the documents nor any information
> about metadata or metadata fields in order to obtain the information.

If she/he is guided by some tools which handles the validation according
to the underlying schema of the document.

> 3) the SRR drives the indexing behavior and indicates whether or not
> some information should be indexed or not.
>
> Since the relevance factors are multiplied by the indexer, text that is
> associated to a context of 'zero' semantic relevance is skipped and
> avoids polluting the index with information that might 'get in the way'
> unwanted.
>
>                                      - o -
>
> I'm currently working with the Lucene guys to add the ability to 'rate'
> textual input in the indexer. When this is done, adding SRR capabilities
> with Cocoon is a matter of writing the relevance sheets for our DTDs.
>
> Comments?

You got it now :)

Giacomo


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org