You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Juan Jose Pablos <ch...@che-che.com> on 2003/09/11 16:49:44 UTC

about lucent and exist

Hi,

I started looking at Ramon Padres bug. On the todo list I can see:

     - Improve ForrestIndexer: It should work with accented characters
      ("a" and "á" should be the same) and should reduce indexes to their
      roots (i.e. jump, jumper, jumping should all be the same index).

Which make me realize that lucene is a *text search engine*.

We can fix issues related with the fact that lucene is not xml aware, 
and help them with the testing, but I do not feel that it is an ideal 
situation. Does anyone know if lucene is moving to a more xml awareness?

Should we look at exist instead?, I saw their demo[1] and it is very 
much what the "semantic searching" is about isn't?

Cheers,
Cheche

[1] http://130.83.186.203/exist/simple/xquery.xsp

RE: about lucent and exist

Posted by Ramon Prades <rp...@porcelanosa.com>.

OK,

That seems to be a good reason to forget Exists.

Thanks

> -----Mensaje original-----
> De: Steven Noels [mailto:stevenn@outerthought.org] 
> Enviado el: viernes, 12 de septiembre de 2003 8:55
> Para: forrest-dev@xml.apache.org
> Asunto: Re: about lucent and exist
> 
> 
> Ramon Prades wrote:
> 
> > I've been browsing Exist site. It seems that can be used instead of 
> > Lucene, but I'm not sure yet about the  advantages. I'll 
> have a closer 
> > look and we can discuss this issue again in a few days to 
> decide what 
> > to use before doing any more work.
> 
> Keep in mind that eXist is LGPL-licensed, so the library jars 
> cannot be 
> stored nor distributed through Apache CVS / websites, and there's a 
> coding policy not to make use of LGPL libs in ASF code.
> 
> Lucene might be the safest bet IMHO.
> 
> </Steven>
> -- 
> Steven Noels                            http://outerthought.org/
> Outerthought - Open Source Java & XML            An Orixo Member
> Read my weblog at            http://blogs.cocoondev.org/stevenn/
> stevenn at outerthought.org                stevenn at apache.org
> 
> 
>

Re: about lucent and exist

Posted by Steven Noels <st...@outerthought.org>.

Ramon Prades wrote:

> I've been browsing Exist site. It seems that can be used instead of Lucene,
> but I'm not sure yet about the  advantages. I'll have a closer look and we
> can discuss this issue again in a few days to decide what to use before
> doing any more work.

Keep in mind that eXist is LGPL-licensed, so the library jars cannot be 
stored nor distributed through Apache CVS / websites, and there's a 
coding policy not to make use of LGPL libs in ASF code.

Lucene might be the safest bet IMHO.

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source Java & XML            An Orixo Member
Read my weblog at            http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org                stevenn at apache.org

RE: about lucent and exist

Posted by Ramon Prades <rp...@porcelanosa.com>.

Hi

I've been browsing Exist site. It seems that can be used instead of Lucene,
but I'm not sure yet about the  advantages. I'll have a closer look and we
can discuss this issue again in a few days to decide what to use before
doing any more work.

Regards.

Ramon

> -----Mensaje original-----
> De: Juan Jose Pablos [mailto:cheche@che-che.com] 
> Enviado el: viernes, 12 de septiembre de 2003 2:36
> Para: forrest-dev@xml.apache.org
> Asunto: Re: about lucent and exist
> 
> 
> Ramon Prades wrote:
> > 
> >>Which make me realize that lucene is a *text search engine*.
> > 
> > 
> > That's the main advantage about lucene: it's language 
> independent. In 
> > fact, Forrest isn't concerned at all about the input documents: you 
> > have to write an indexer for each format you want to use, 
> i.e. if you 
> > want to search in Microsoft Word documents, you have to 
> write a class 
> > to open and process them.
> > 
> 
> I am not worry about fixing just one issue. Being XML aware 
> means that 
> you can do a:
> 
> (after using forms to create this Xpath query) 
> //faqs/part/id['general']/faq/question[containts(.,'xsl')]
> 
> So you would search for "xsl" within a collection of FAQ XML 
> documents 
> that have a faq part called 'general'
> 
> I am not sure how dificult is to get there with lucene, but 
> exist seems 
> to get it already.
> 
> > You can do the same with Lucene, it's all down to the Indexer. In 
> > mine, I index forrest documents by mixing all the text. This is 
> > because I don't think queries like "p:lucene" (read: 
> "search all docs 
> > with word "lucene" inside a "p" tag) are a good idea (specially for 
> > non-programmers).
> 
> I do not think that users should deal with that, for them 
> that language 
> is hidden.
> 
> > 
> > Having said that, I think certain tags with a very strong 
> meaning can 
> > be used. For example "authors" and "title" (both working in 
> my code): 
> > this can be useful, specially if we have radio buttons for 
> "search in 
> > authors only" and "search in title only".
> 
> Semantics searching ( I thought about something similar before I knew 
> the name) is about using tags to limited the search and get 
> better results.
> 
> > 
> > I wanted to do all this a few weeks ago, but I've been awfully busy 
> > (who isn't?). I plan to start again in 2 or 3 weeks.
> > 
> 
> I will help you as I promised, I got that bug assigned to me.Using 
> lucene wihtin forrest and having exist support are compatible 
> tasks, you 
> got the first one almost done. Spain..go..go..go!
> 
> Cheers,
> Cheche
> 
> 
> 
> 
> 
>

RE: Lucent and Xindice (Re: about lucent and exist)

Posted by Ramon Prades <rp...@porcelanosa.com>.

Hi Cheche

Reading Xindice docs I've seen it requires a server daemon to be started,
and I'm not sure if this is what we want.

At first sight, Xindice will bring a lot of power to Forrest. Apart from the
searching tool, Xindice can be used to alter existing docs. For example (I
don't want to start a discussion to see if this is a good idea or not - this
is just an example), we could have Forrest to add automatically check-boxes
next to todo items so an administrator can mark them as completed and
Xindice could move the item from "todo" to "changes" (again, it's just an
example). Another example can be a "What's new" page generated with the help
of Xindice.

But on second thoughts maybe using a service that needs to be started and so
on will go against the simplicity of Forrest, demanding more configuration
and maybe more administration. 

I will keep thinking about this issues, but in my opinion whatever we do has
to be done with the ultimate goal of keeping Forrest as simple and easy to
use as it is now (and that includes static sites).

Any comments?

Ramon



> -----Mensaje original-----
> De: Juan Jose Pablos [mailto:cheche@che-che.com] 
> Enviado el: miércoles, 17 de septiembre de 2003 8:40
> Para: forrest-dev@xml.apache.org
> Asunto: Lucent and Xindice (Re: about lucent and exist)
> 
> 
> Ramon Prades wrote:
>  > Hi Juan Jose
>  >
>  > Do you think we should drop Lucene and use Xindice instead?
> 
> I think that we should not drop anything until we get a 
> replacement that 
> improves the actual situation. Lucene works and there is room 
> for Lucene 
> and xindice.
> 
> 
>  > - Populate the database using a crawler and cocoon's xml-views.
> 
> Doing this it will allow to populate your indices from varios 
> sources, 
> not only files. But this implementation is independent on 
> wherever you 
> use Xindice or Lucene.
> 
> 
>  > - Create a search page with a number of options as in "search in 
> content",
>  > "search in title" and so on.
> 
> I have been thinking a bit on this. Not about the search page itself, 
> but about the power of been able to search to any XML format 
> and get a 
> link to the HTML/PDF page makes a big step.
> 
> But on todays forrest's situation we only have a few xml schemas:
> 
> document
> howto
> faq
> changes/todo/contributors??
> book/site
> sdocbook/docbook
> 
> 
> Out of these schema I have not found many use case examples of search:
> 
> Document-v*
> -----------
> Search for an author/person
> Search for an acronym
> Search for a figure.
> Search for fixme notes.
> 
> Howto
> -----------
> Search for an author/person
> Search for an audience (novice... etc)
> 
> FAQ
> -----------
> Search for an author/person
> Search for a question.
> Search for an answer.
> 
> ...
> 
> 
> So The work actually neede to implement in our actual release 
> does not 
> requiere much.
> 
> What do you think?
> 
> Cheers,
> Cheche
> 
> 
> > 
> > This is what I think:
> > 
> > - Use Xindice.
> > - Populate the database using a crawler and cocoon's xml-views.
> > - Create a search page with a number of options as in "search in 
> > content", "search in title" and so on.
> > 
> > Regards.
> > 
> > Ramón
> > 
> > 
> >>-----Mensaje original-----
> >>De: Juan Jose Pablos [mailto:cheche@che-che.com]
> >>Enviado el: sábado, 13 de septiembre de 2003 17:56
> >>Para: forrest-dev@xml.apache.org
> >>Asunto: Re: about lucent and exist
> >>
> >>
> >>Stefano Mazzocchi wrote:
> >>
> >>>Lucene is based on algorithms that don't allow the above.
> >>>
> >>
> >>Thanks for backing this up. That was my initial feeling.
> >>
> >>
> >>>For that, you need what is called an "xml database", which
> >>
> >>could be,
> >>
> >>>in
> >>>the most simple case, a collection of files in a file
> >>
> >>system and a very
> >>
> >>>slow incremental collector that opens all files, scans them
> >>
> >>and collects
> >>
> >>>the matching elements and returns the results as a new
> >>
> >>document. In the
> >>
> >>>best case, it's a semi-structured database with multidimensional
> >>>indexing features (exist and xindice are much closer to that).
> >>>
> >>
> >>I am happy to look at xindice.
> >>
> >>
> >>>You are trying to create "virtual documents" out of
> >>
> >>XML-aware queries
> >>
> >>>over a repository of hierarchical content (not necessarely XML, but
> >>>XML-viewable).
> >>
> >>Are you saying that because we are making the request to 
> document-v12
> >>schema? I am not sure about this. I am not thinking about doing the 
> >>request to the document-v12 schema.
> >>
> >>In Forrest we are importing from another schema and on that
> >>process we 
> >>are losing information ( i.e. <author/> becames <p> ). So I 
> >>would like 
> >>to get a search on the source and get the results to where I can 
> >>retrieve that document.
> >>
> >>
> >>>Eh, if it was that easy. You are implying that:
> >>>
> >>> 1) a tag is used to indicate the semantics of the nodes contained 
> >>>therein. Although this is generally the case (and there is
> >>
> >>the ability
> >>
> >>>to have RDF/XML to performm this way) this is not generalizable.
> >>
> >>I would like to see an example on this.
> >>
> >>
> >>> 2) without namespaces, there is a tremendous semantic
> >>
> >>collision. With
> >>
> >>>namespaces, you are assuming that the namespace refers to
> >>
> >>the 'meaning'
> >>
> >>>of the tag, again not generalizable.
> >>>
> >>
> >>ok, I have not mention anything about namespaces, the request
> >>that put 
> >>as an example only deals with faq schema. I had not thought 
> >>about multi 
> >>  namespace documents or other type of XML input.
> >>
> >>
> >>>This said, I agree that having the ability to run XQuery
> >>
> >>queries over a
> >>
> >>>content repository that exposes XML views would be a
> >>
> >>tremendous help.
> >>
> >>>Just don't call it "semantic searching", because that's not
> >>
> >>even close
> >>
> >>>(but very few are able to explain the difference and the
> >>
> >>reason why we
> >>
> >>>need the entire RDF stack in the first place, so don't worry).
> >>>
> >>>--
> >>>Stefano.
> >>
> >>ok, I will not used that name, I will not worry either.
> >>
> >>Cheers,
> >>Cheche
> >>
> >>
> > 
> > 
> > 
> 
> 
> 
>

Lucent and Xindice (Re: about lucent and exist)

Posted by Juan Jose Pablos <ch...@che-che.com>.

Ramon Prades wrote:
 > Hi Juan Jose
 >
 > Do you think we should drop Lucene and use Xindice instead?

I think that we should not drop anything until we get a replacement that 
improves the actual situation. Lucene works and there is room for Lucene 
and xindice.


 > - Populate the database using a crawler and cocoon's xml-views.

Doing this it will allow to populate your indices from varios sources, 
not only files. But this implementation is independent on wherever you 
use Xindice or Lucene.


 > - Create a search page with a number of options as in "search in 
content",
 > "search in title" and so on.

I have been thinking a bit on this. Not about the search page itself, 
but about the power of been able to search to any XML format and get a 
link to the HTML/PDF page makes a big step.

But on todays forrest's situation we only have a few xml schemas:

document
howto
faq
changes/todo/contributors??
book/site
sdocbook/docbook


Out of these schema I have not found many use case examples of search:

Document-v*
-----------
Search for an author/person
Search for an acronym
Search for a figure.
Search for fixme notes.

Howto
-----------
Search for an author/person
Search for an audience (novice... etc)

FAQ
-----------
Search for an author/person
Search for a question.
Search for an answer.

...


So The work actually neede to implement in our actual release does not 
requiere much.

What do you think?

Cheers,
Cheche


> 
> This is what I think:
> 
> - Use Xindice.
> - Populate the database using a crawler and cocoon's xml-views.
> - Create a search page with a number of options as in "search in content",
> "search in title" and so on.
> 
> Regards.
> 
> Ramón
> 
> 
>>-----Mensaje original-----
>>De: Juan Jose Pablos [mailto:cheche@che-che.com] 
>>Enviado el: sábado, 13 de septiembre de 2003 17:56
>>Para: forrest-dev@xml.apache.org
>>Asunto: Re: about lucent and exist
>>
>>
>>Stefano Mazzocchi wrote:
>>
>>>Lucene is based on algorithms that don't allow the above.
>>>
>>
>>Thanks for backing this up. That was my initial feeling.
>>
>>
>>>For that, you need what is called an "xml database", which 
>>
>>could be, 
>>
>>>in
>>>the most simple case, a collection of files in a file 
>>
>>system and a very 
>>
>>>slow incremental collector that opens all files, scans them 
>>
>>and collects 
>>
>>>the matching elements and returns the results as a new 
>>
>>document. In the 
>>
>>>best case, it's a semi-structured database with multidimensional 
>>>indexing features (exist and xindice are much closer to that).
>>>
>>
>>I am happy to look at xindice.
>>
>>
>>>You are trying to create "virtual documents" out of 
>>
>>XML-aware queries
>>
>>>over a repository of hierarchical content (not necessarely XML, but 
>>>XML-viewable).
>>
>>Are you saying that because we are making the request to document-v12 
>>schema? I am not sure about this. I am not thinking about doing the 
>>request to the document-v12 schema.
>>
>>In Forrest we are importing from another schema and on that 
>>process we 
>>are losing information ( i.e. <author/> becames <p> ). So I 
>>would like 
>>to get a search on the source and get the results to where I can 
>>retrieve that document.
>>
>>
>>>Eh, if it was that easy. You are implying that:
>>>
>>> 1) a tag is used to indicate the semantics of the nodes contained
>>>therein. Although this is generally the case (and there is 
>>
>>the ability 
>>
>>>to have RDF/XML to performm this way) this is not generalizable.
>>
>>I would like to see an example on this.
>>
>>
>>> 2) without namespaces, there is a tremendous semantic 
>>
>>collision. With
>>
>>>namespaces, you are assuming that the namespace refers to 
>>
>>the 'meaning' 
>>
>>>of the tag, again not generalizable.
>>>
>>
>>ok, I have not mention anything about namespaces, the request 
>>that put 
>>as an example only deals with faq schema. I had not thought 
>>about multi 
>>  namespace documents or other type of XML input.
>>
>>
>>>This said, I agree that having the ability to run XQuery 
>>
>>queries over a 
>>
>>>content repository that exposes XML views would be a 
>>
>>tremendous help.
>>
>>>Just don't call it "semantic searching", because that's not 
>>
>>even close 
>>
>>>(but very few are able to explain the difference and the 
>>
>>reason why we 
>>
>>>need the entire RDF stack in the first place, so don't worry).
>>>
>>>-- 
>>>Stefano.
>>
>>ok, I will not used that name, I will not worry either.
>>
>>Cheers,
>>Cheche
>>
>>
> 
> 
>

Re: about lucent and exist

Posted by Juan Jose Pablos <ch...@che-che.com>.

I had not finished my last email. please ignore...


Juan Jose Pablos wrote:
> Ramon,
> 
>>
>> Do you think we should drop Lucene and use Xindice instead?
>>
> 
> I think that we should not drop anything until we get a replacement that 
> improves the actual situation. Lucene works and there is room for Lucene 
> and xindice.
> 
> 
>  > - Populate the database using a crawler and cocoon's xml-views.
> 
> On todays forrest situation we have this schemas:
> 
> document
> sdocbook/docbook
> howto
> faq
> changes/todo/contributors??
> book/site
> 
> 
> 
> 
>> This is what I think:
>>
>> - Use Xindice.
>> - Create a search page with a number of options as in "search in 
>> content",
>> "search in title" and so on.
>>
>> Regards.
>>
>> Ramón
>>
>>
>>> -----Mensaje original-----
>>> De: Juan Jose Pablos [mailto:cheche@che-che.com] Enviado el: sábado, 
>>> 13 de septiembre de 2003 17:56
>>> Para: forrest-dev@xml.apache.org
>>> Asunto: Re: about lucent and exist
>>>
>>>
>>> Stefano Mazzocchi wrote:
>>>
>>>> Lucene is based on algorithms that don't allow the above.
>>>>
>>>
>>> Thanks for backing this up. That was my initial feeling.
>>>
>>>
>>>> For that, you need what is called an "xml database", which 
>>>
>>>
>>> could be,
>>>
>>>> in
>>>> the most simple case, a collection of files in a file 
>>>
>>>
>>> system and a very
>>>
>>>> slow incremental collector that opens all files, scans them 
>>>
>>>
>>> and collects
>>>
>>>> the matching elements and returns the results as a new 
>>>
>>>
>>> document. In the
>>>
>>>> best case, it's a semi-structured database with multidimensional 
>>>> indexing features (exist and xindice are much closer to that).
>>>>
>>>
>>> I am happy to look at xindice.
>>>
>>>
>>>> You are trying to create "virtual documents" out of 
>>>
>>>
>>> XML-aware queries
>>>
>>>> over a repository of hierarchical content (not necessarely XML, but 
>>>> XML-viewable).
>>>
>>>
>>> Are you saying that because we are making the request to document-v12 
>>> schema? I am not sure about this. I am not thinking about doing the 
>>> request to the document-v12 schema.
>>>
>>> In Forrest we are importing from another schema and on that process 
>>> we are losing information ( i.e. <author/> becames <p> ). So I would 
>>> like to get a search on the source and get the results to where I can 
>>> retrieve that document.
>>>
>>>
>>>> Eh, if it was that easy. You are implying that:
>>>>
>>>> 1) a tag is used to indicate the semantics of the nodes contained
>>>> therein. Although this is generally the case (and there is 
>>>
>>>
>>> the ability
>>>
>>>> to have RDF/XML to performm this way) this is not generalizable.
>>>
>>>
>>> I would like to see an example on this.
>>>
>>>
>>>> 2) without namespaces, there is a tremendous semantic 
>>>
>>>
>>> collision. With
>>>
>>>> namespaces, you are assuming that the namespace refers to 
>>>
>>>
>>> the 'meaning'
>>>
>>>> of the tag, again not generalizable.
>>>>
>>>
>>> ok, I have not mention anything about namespaces, the request that 
>>> put as an example only deals with faq schema. I had not thought about 
>>> multi  namespace documents or other type of XML input.
>>>
>>>
>>>> This said, I agree that having the ability to run XQuery 
>>>
>>>
>>> queries over a
>>>
>>>> content repository that exposes XML views would be a 
>>>
>>>
>>> tremendous help.
>>>
>>>> Just don't call it "semantic searching", because that's not 
>>>
>>>
>>> even close
>>>
>>>> (but very few are able to explain the difference and the 
>>>
>>>
>>> reason why we
>>>
>>>> need the entire RDF stack in the first place, so don't worry).
>>>>
>>>> -- 
>>>> Stefano.
>>>
>>>
>>> ok, I will not used that name, I will not worry either.
>>>
>>> Cheers,
>>> Cheche
>>>
>>>
>>
>>
>>
>

Re: about lucent and exist

Posted by Juan Jose Pablos <ch...@che-che.com>.

Ramon,

> 
> Do you think we should drop Lucene and use Xindice instead?
> 

I think that we should not drop anything until we get a replacement that 
improves the actual situation. Lucene works and there is room for Lucene 
and xindice.


 > - Populate the database using a crawler and cocoon's xml-views.

On todays forrest situation we have this schemas:

document
sdocbook/docbook
howto
faq
changes/todo/contributors??
book/site




> This is what I think:
> 
> - Use Xindice.
> - Create a search page with a number of options as in "search in content",
> "search in title" and so on.
> 
> Regards.
> 
> Ramón
> 
> 
>>-----Mensaje original-----
>>De: Juan Jose Pablos [mailto:cheche@che-che.com] 
>>Enviado el: sábado, 13 de septiembre de 2003 17:56
>>Para: forrest-dev@xml.apache.org
>>Asunto: Re: about lucent and exist
>>
>>
>>Stefano Mazzocchi wrote:
>>
>>>Lucene is based on algorithms that don't allow the above.
>>>
>>
>>Thanks for backing this up. That was my initial feeling.
>>
>>
>>>For that, you need what is called an "xml database", which 
>>
>>could be, 
>>
>>>in
>>>the most simple case, a collection of files in a file 
>>
>>system and a very 
>>
>>>slow incremental collector that opens all files, scans them 
>>
>>and collects 
>>
>>>the matching elements and returns the results as a new 
>>
>>document. In the 
>>
>>>best case, it's a semi-structured database with multidimensional 
>>>indexing features (exist and xindice are much closer to that).
>>>
>>
>>I am happy to look at xindice.
>>
>>
>>>You are trying to create "virtual documents" out of 
>>
>>XML-aware queries
>>
>>>over a repository of hierarchical content (not necessarely XML, but 
>>>XML-viewable).
>>
>>Are you saying that because we are making the request to document-v12 
>>schema? I am not sure about this. I am not thinking about doing the 
>>request to the document-v12 schema.
>>
>>In Forrest we are importing from another schema and on that 
>>process we 
>>are losing information ( i.e. <author/> becames <p> ). So I 
>>would like 
>>to get a search on the source and get the results to where I can 
>>retrieve that document.
>>
>>
>>>Eh, if it was that easy. You are implying that:
>>>
>>> 1) a tag is used to indicate the semantics of the nodes contained
>>>therein. Although this is generally the case (and there is 
>>
>>the ability 
>>
>>>to have RDF/XML to performm this way) this is not generalizable.
>>
>>I would like to see an example on this.
>>
>>
>>> 2) without namespaces, there is a tremendous semantic 
>>
>>collision. With
>>
>>>namespaces, you are assuming that the namespace refers to 
>>
>>the 'meaning' 
>>
>>>of the tag, again not generalizable.
>>>
>>
>>ok, I have not mention anything about namespaces, the request 
>>that put 
>>as an example only deals with faq schema. I had not thought 
>>about multi 
>>  namespace documents or other type of XML input.
>>
>>
>>>This said, I agree that having the ability to run XQuery 
>>
>>queries over a 
>>
>>>content repository that exposes XML views would be a 
>>
>>tremendous help.
>>
>>>Just don't call it "semantic searching", because that's not 
>>
>>even close 
>>
>>>(but very few are able to explain the difference and the 
>>
>>reason why we 
>>
>>>need the entire RDF stack in the first place, so don't worry).
>>>
>>>-- 
>>>Stefano.
>>
>>ok, I will not used that name, I will not worry either.
>>
>>Cheers,
>>Cheche
>>
>>
> 
> 
>

RE: about lucent and exist

Posted by Ramon Prades <rp...@porcelanosa.com>.

Hi Juan Jose

Do you think we should drop Lucene and use Xindice instead?

This is what I think:

- Use Xindice.
- Populate the database using a crawler and cocoon's xml-views.
- Create a search page with a number of options as in "search in content",
"search in title" and so on.

Regards.

Ramón

> -----Mensaje original-----
> De: Juan Jose Pablos [mailto:cheche@che-che.com] 
> Enviado el: sábado, 13 de septiembre de 2003 17:56
> Para: forrest-dev@xml.apache.org
> Asunto: Re: about lucent and exist
> 
> 
> Stefano Mazzocchi wrote:
> > 
> > Lucene is based on algorithms that don't allow the above.
> > 
> 
> Thanks for backing this up. That was my initial feeling.
> 
> > For that, you need what is called an "xml database", which 
> could be, 
> > in
> > the most simple case, a collection of files in a file 
> system and a very 
> > slow incremental collector that opens all files, scans them 
> and collects 
> > the matching elements and returns the results as a new 
> document. In the 
> > best case, it's a semi-structured database with multidimensional 
> > indexing features (exist and xindice are much closer to that).
> > 
> 
> I am happy to look at xindice.
> 
> > 
> > You are trying to create "virtual documents" out of 
> XML-aware queries
> > over a repository of hierarchical content (not necessarely XML, but 
> > XML-viewable).
> 
> Are you saying that because we are making the request to document-v12 
> schema? I am not sure about this. I am not thinking about doing the 
> request to the document-v12 schema.
> 
> In Forrest we are importing from another schema and on that 
> process we 
> are losing information ( i.e. <author/> becames <p> ). So I 
> would like 
> to get a search on the source and get the results to where I can 
> retrieve that document.
> 
> > Eh, if it was that easy. You are implying that:
> > 
> >  1) a tag is used to indicate the semantics of the nodes contained
> > therein. Although this is generally the case (and there is 
> the ability 
> > to have RDF/XML to performm this way) this is not generalizable.
> 
> I would like to see an example on this.
> 
> > 
> >  2) without namespaces, there is a tremendous semantic 
> collision. With
> > namespaces, you are assuming that the namespace refers to 
> the 'meaning' 
> > of the tag, again not generalizable.
> > 
> 
> ok, I have not mention anything about namespaces, the request 
> that put 
> as an example only deals with faq schema. I had not thought 
> about multi 
>   namespace documents or other type of XML input.
> 
> > This said, I agree that having the ability to run XQuery 
> queries over a 
> > content repository that exposes XML views would be a 
> tremendous help.
> > Just don't call it "semantic searching", because that's not 
> even close 
> > (but very few are able to explain the difference and the 
> reason why we 
> > need the entire RDF stack in the first place, so don't worry).
> > 
> > -- 
> > Stefano.
> 
> ok, I will not used that name, I will not worry either.
> 
> Cheers,
> Cheche
> 
> 
>

Re: about lucent and exist

Posted by Juan Jose Pablos <ch...@che-che.com>.

Stefano Mazzocchi wrote:
> 
> Lucene is based on algorithms that don't allow the above.
> 

Thanks for backing this up. That was my initial feeling.

> For that, you need what is called an "xml database", which could be, in 
> the most simple case, a collection of files in a file system and a very 
> slow incremental collector that opens all files, scans them and collects 
> the matching elements and returns the results as a new document. In the 
> best case, it's a semi-structured database with multidimensional 
> indexing features (exist and xindice are much closer to that).
> 

I am happy to look at xindice.

> 
> You are trying to create "virtual documents" out of XML-aware queries 
> over a repository of hierarchical content (not necessarely XML, but 
> XML-viewable).

Are you saying that because we are making the request to document-v12 
schema? I am not sure about this. I am not thinking about doing the 
request to the document-v12 schema.

In Forrest we are importing from another schema and on that process we 
are losing information ( i.e. <author/> becames <p> ). So I would like 
to get a search on the source and get the results to where I can 
retrieve that document.

> Eh, if it was that easy. You are implying that:
> 
>  1) a tag is used to indicate the semantics of the nodes contained 
> therein. Although this is generally the case (and there is the ability 
> to have RDF/XML to performm this way) this is not generalizable.

I would like to see an example on this.

> 
>  2) without namespaces, there is a tremendous semantic collision. With 
> namespaces, you are assuming that the namespace refers to the 'meaning' 
> of the tag, again not generalizable.
> 

ok, I have not mention anything about namespaces, the request that put 
as an example only deals with faq schema. I had not thought about multi 
  namespace documents or other type of XML input.

> This said, I agree that having the ability to run XQuery queries over a 
> content repository that exposes XML views would be a tremendous help.
> Just don't call it "semantic searching", because that's not even close 
> (but very few are able to explain the difference and the reason why we 
> need the entire RDF stack in the first place, so don't worry).
> 
> -- 
> Stefano.

ok, I will not used that name, I will not worry either.

Cheers,
Cheche

Re: about lucent and exist

Posted by Stefano Mazzocchi <st...@apache.org>.

On Friday, Sep 12, 2003, at 02:36 Europe/Rome, Juan Jose Pablos wrote:

> Ramon Prades wrote:
>>> Which make me realize that lucene is a *text search engine*.
>> That's the main advantage about lucene: it's language independent. In 
>> fact,
>> Forrest isn't concerned at all about the input documents: you have to 
>> write
>> an indexer for each format you want to use, i.e. if you want to 
>> search in
>> Microsoft Word documents, you have to write a class to open and 
>> process
>> them.
>
> I am not worry about fixing just one issue. Being XML aware means that 
> you can do a:
>
> (after using forms to create this Xpath query)
> //faqs/part/id['general']/faq/question[containts(.,'xsl')]
>
> So you would search for "xsl" within a collection of FAQ XML documents 
> that have a faq part called 'general'
>
> I am not sure how dificult is to get there with lucene, but exist 
> seems to get it already.

Lucene is based on algorithms that don't allow the above.

For that, you need what is called an "xml database", which could be, in 
the most simple case, a collection of files in a file system and a very 
slow incremental collector that opens all files, scans them and 
collects the matching elements and returns the results as a new 
document. In the best case, it's a semi-structured database with 
multidimensional indexing features (exist and xindice are much closer 
to that).

take a look at JSR 170 for another possibility (it includes a SQL-like 
query language for hierarchies of nodes)

>> You can do the same with Lucene, it's all down to the Indexer. In 
>> mine, I
>> index forrest documents by mixing all the text. This is because I 
>> don't
>> think queries like "p:lucene" (read: "search all docs with word 
>> "lucene"
>> inside a "p" tag) are a good idea (specially for non-programmers).
>
> I do not think that users should deal with that, for them that 
> language is hidden.

You are trying to create "virtual documents" out of XML-aware queries 
over a repository of hierarchical content (not necessarely XML, but 
XML-viewable).

Forget Lucene, it's not the right tool and not the right direction.

>> Having said that, I think certain tags with a very strong meaning can 
>> be
>> used. For example "authors" and "title" (both working in my code): 
>> this can
>> be useful, specially if we have radio buttons for "search in authors 
>> only"
>> and "search in title only".
>
> Semantics searching ( I thought about something similar before I knew 
> the name) is about using tags to limited the search and get better 
> results.

Eh, if it was that easy. You are implying that:

  1) a tag is used to indicate the semantics of the nodes contained 
therein. Although this is generally the case (and there is the ability 
to have RDF/XML to performm this way) this is not generalizable.

  2) without namespaces, there is a tremendous semantic collision. With 
namespaces, you are assuming that the namespace refers to the 'meaning' 
of the tag, again not generalizable.

This said, I agree that having the ability to run XQuery queries over a 
content repository that exposes XML views would be a tremendous help. 
Just don't call it "semantic searching", because that's not even close 
(but very few are able to explain the difference and the reason why we 
need the entire RDF stack in the first place, so don't worry).

--
Stefano.

Re: LIST REMOVAL

Posted by David Crossley <cr...@indexgeo.com.au>.

Timothy Fisher trfishermi<AT>yahoo.com wrote:
>         The list removal email address is not working.
>         
>         PLEASE remove me from this mailing list

Are you using the same email address that you used when you
subscribed? When you say "not working" what do you mean?
As a moderator, i just tried to un-subscribe you, so you
should be receiving an automatic confirmation request.

--David

LIST REMOVAL

Posted by Timothy Fisher <tr...@yahoo.com>.

The list removal email address is not working.

PLEASE remove me from this mailing list



---------------------------------
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

Re: about lucent and exist

Posted by Juan Jose Pablos <ch...@che-che.com>.

Ramon Prades wrote:
> 
>>Which make me realize that lucene is a *text search engine*.
> 
> 
> That's the main advantage about lucene: it's language independent. In fact,
> Forrest isn't concerned at all about the input documents: you have to write
> an indexer for each format you want to use, i.e. if you want to search in
> Microsoft Word documents, you have to write a class to open and process
> them.
> 

I am not worry about fixing just one issue. Being XML aware means that 
you can do a:

(after using forms to create this Xpath query)
//faqs/part/id['general']/faq/question[containts(.,'xsl')]

So you would search for "xsl" within a collection of FAQ XML documents 
that have a faq part called 'general'

I am not sure how dificult is to get there with lucene, but exist seems 
to get it already.

> You can do the same with Lucene, it's all down to the Indexer. In mine, I
> index forrest documents by mixing all the text. This is because I don't
> think queries like "p:lucene" (read: "search all docs with word "lucene"
> inside a "p" tag) are a good idea (specially for non-programmers).

I do not think that users should deal with that, for them that language 
is hidden.

> 
> Having said that, I think certain tags with a very strong meaning can be
> used. For example "authors" and "title" (both working in my code): this can
> be useful, specially if we have radio buttons for "search in authors only"
> and "search in title only".

Semantics searching ( I thought about something similar before I knew 
the name) is about using tags to limited the search and get better results.

> 
> I wanted to do all this a few weeks ago, but I've been awfully busy (who
> isn't?). I plan to start again in 2 or 3 weeks.
> 

I will help you as I promised, I got that bug assigned to me.Using 
lucene wihtin forrest and having exist support are compatible tasks, you 
got the first one almost done. Spain..go..go..go!

Cheers,
Cheche

RE: about lucent and exist

Posted by Ramon Prades <rp...@porcelanosa.com>.

Hi Cheche

Please look at my comments below.

Regards.

Ramon

> -----Mensaje original-----
> De: Juan Jose Pablos [mailto:cheche@che-che.com] 
> Enviado el: jueves, 11 de septiembre de 2003 16:50
> Para: forrest-dev@xml.apache.org
> Asunto: about lucent and exist
> 
> 
> Hi,
> 
> I started looking at Ramon Padres bug. On the todo list I can see:
> 
>      - Improve ForrestIndexer: It should work with accented characters
>       ("a" and "á" should be the same) and should reduce 
> indexes to their
>       roots (i.e. jump, jumper, jumping should all be the same index).

This is just improving the existing indexing algorithm. Should be very easy.

> 
> Which make me realize that lucene is a *text search engine*.

That's the main advantage about lucene: it's language independent. In fact,
Forrest isn't concerned at all about the input documents: you have to write
an indexer for each format you want to use, i.e. if you want to search in
Microsoft Word documents, you have to write a class to open and process
them.

> 
> We can fix issues related with the fact that lucene is not xml aware, 
> and help them with the testing, but I do not feel that it is an ideal 
> situation. Does anyone know if lucene is moving to a more xml 
> awareness?

No, Lucene is about searching all sorts of files (even binaries if you have
the indexer).

> 
> Should we look at exist instead?, I saw their demo[1] and it is very 
> much what the "semantic searching" is about isn't?

You can do the same with Lucene, it's all down to the Indexer. In mine, I
index forrest documents by mixing all the text. This is because I don't
think queries like "p:lucene" (read: "search all docs with word "lucene"
inside a "p" tag) are a good idea (specially for non-programmers).

Having said that, I think certain tags with a very strong meaning can be
used. For example "authors" and "title" (both working in my code): this can
be useful, specially if we have radio buttons for "search in authors only"
and "search in title only".

To finish my first version and have Lucene up and running in Forrest I
suggest doing the following:

- Index documents by asking Cocoon for the xml views. This will include
files like "todo" or "changes" in the searching scope.

- Improve the indexer to store a "normalized" version of the content
(replacing accented characters).

- Improve the search page by including radio buttons (to search in authors
and title).

- Add searching to static sites.

By having all this Forrest will have a very good searching engine: it's fast
and it's simple (and it's Apache).

I wanted to do all this a few weeks ago, but I've been awfully busy (who
isn't?). I plan to start again in 2 or 3 weeks.


> 
> Cheers,
> Cheche
> 
> [1] http://130.83.186.203/exist/simple/xquery.xsp
> 
> 
>