You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by "Walid \"jo\" Gedeon" <wg...@gmail.com> on 2008/08/02 13:37:09 UTC

RFO- Indexing 'meaningfull' xml

Hello!

This is a Request for Opinion targeted for the Lucene experts out there :-)

I'm trying to get to know Lucene a bit better: After having played with the
'getting started', I moved onto trying indexing of xml files.

The simple (?) project would be to index chat sessions, each session stored
in a file and containing many entries of the form:

<message type="incoming_privateMessage" timestamp="200808021312"
to="someone%40domain1%2Ecom"
from="someoneelse%40domain2%2Ecom"><body>Hello</body></message>

(it's jabber-client protocol with timestamp)

In addition to the full text search, I'd like to be able to perform searches
such as:
 - list sessions from:xxx timestamp:200808*
 - list sessions (from:xxx OR from:yyy)
 - etc

Would it be better to store each message as a separate document with its
fields, adding the 'filename' (session identifier) as an extra field? or
maybe is there a better way of doing it making the session file a document?

All comments appreciated, thanks! :-)

PS: Of course, the actual goal isn't to index chat history (there are many
chat searches available) but use this to learn the API ;-)

Re: RFO- Indexing 'meaningfull' xml

Posted by "Walid \"jo\" Gedeon" <wg...@gmail.com>.
Hello Hoss,
  Thanks for your reply :-)
I believe I'm in the first case: "to be able to search for 'foo' and get
back a list of all sessions where the word 'foo' was used". However, I want
to be able to separate free text search from field-based search.

I have put both the session and messages as documents, the session document
for free text search and the messages for field based search:
The algorithm that I've ended up using since I posted the initial message
is:
  o execute the search on messages and documents, then on all hits
  o construct a list of 'filename's that match and show the last 10 results
by newest.

This works, but I'm afraid is not going to be performant when I end up
indexing all sessions. There must be a way to get the right hit-set from a
search.

But in all cases, I'm looking at Solr for potential answers, thanks for
mentioning it :-)

Ta.
Jo

On Thu, Aug 7, 2008 at 12:59 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : In addition to the full text search, I'd like to be able to perform
> searches
> : such as:
> :  - list sessions from:xxx timestamp:200808*
> :  - list sessions (from:xxx OR from:yyy)
> :  - etc
> :
> : Would it be better to store each message as a separate document with its
> : fields, adding the 'filename' (session identifier) as an extra field? or
> : maybe is there a better way of doing it making the session file a
> document?
>
> As a general rule of thumb, you make 1 document for each result you want
> to get back when you execute a search ... if you want to be able to search
> for "foo" and get back a list of all sessions where the word "foo" was
> used, then each session should be a document.  If you also want to be able
> to search for "foo" and get back a list of each message thta contained the
> word "foo", then each message can also be a document -- either in another
> index, or even in the same index (here's no rule that says all documents
> must have the same fields)
>
> BTW: If you are planning on experimenting with the Java API, i would
> suggest sending any specific followup questions to the java-user@lucene
> list.  But you may also want to consider checking out Solr, and the
> solr-user list.  Depends on what level of abstraction you want to deal
> with (Solr provides a config based web service type front end for dealing
> with Lucene indexes, but also has a Java API both for indexing and for
> hoooking in custom functionality when executing searches)
>
>
> -Hoss
>
>

Re: RFO- Indexing 'meaningfull' xml

Posted by Chris Hostetter <ho...@fucit.org>.
: In addition to the full text search, I'd like to be able to perform searches
: such as:
:  - list sessions from:xxx timestamp:200808*
:  - list sessions (from:xxx OR from:yyy)
:  - etc
: 
: Would it be better to store each message as a separate document with its
: fields, adding the 'filename' (session identifier) as an extra field? or
: maybe is there a better way of doing it making the session file a document?

As a general rule of thumb, you make 1 document for each result you want 
to get back when you execute a search ... if you want to be able to search 
for "foo" and get back a list of all sessions where the word "foo" was 
used, then each session should be a document.  If you also want to be able 
to search for "foo" and get back a list of each message thta contained the 
word "foo", then each message can also be a document -- either in another 
index, or even in the same index (here's no rule that says all documents 
must have the same fields)

BTW: If you are planning on experimenting with the Java API, i would 
suggest sending any specific followup questions to the java-user@lucene 
list.  But you may also want to consider checking out Solr, and the 
solr-user list.  Depends on what level of abstraction you want to deal 
with (Solr provides a config based web service type front end for dealing 
with Lucene indexes, but also has a Java API both for indexing and for 
hoooking in custom functionality when executing searches)


-Hoss