You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by "Walid \"jo\" Gedeon" <wg...@gmail.com> on 2008/08/02 13:37:09 UTC
RFO- Indexing 'meaningfull' xml
Hello!
This is a Request for Opinion targeted for the Lucene experts out there :-)
I'm trying to get to know Lucene a bit better: After having played with the
'getting started', I moved onto trying indexing of xml files.
The simple (?) project would be to index chat sessions, each session stored
in a file and containing many entries of the form:
<message type="incoming_privateMessage" timestamp="200808021312"
to="someone%40domain1%2Ecom"
from="someoneelse%40domain2%2Ecom"><body>Hello</body></message>
(it's jabber-client protocol with timestamp)
In addition to the full text search, I'd like to be able to perform searches
such as:
- list sessions from:xxx timestamp:200808*
- list sessions (from:xxx OR from:yyy)
- etc
Would it be better to store each message as a separate document with its
fields, adding the 'filename' (session identifier) as an extra field? or
maybe is there a better way of doing it making the session file a document?
All comments appreciated, thanks! :-)
PS: Of course, the actual goal isn't to index chat history (there are many
chat searches available) but use this to learn the API ;-)
Re: RFO- Indexing 'meaningfull' xml
Posted by "Walid \"jo\" Gedeon" <wg...@gmail.com>.
Hello Hoss,
Thanks for your reply :-)
I believe I'm in the first case: "to be able to search for 'foo' and get
back a list of all sessions where the word 'foo' was used". However, I want
to be able to separate free text search from field-based search.
I have put both the session and messages as documents, the session document
for free text search and the messages for field based search:
The algorithm that I've ended up using since I posted the initial message
is:
o execute the search on messages and documents, then on all hits
o construct a list of 'filename's that match and show the last 10 results
by newest.
This works, but I'm afraid is not going to be performant when I end up
indexing all sessions. There must be a way to get the right hit-set from a
search.
But in all cases, I'm looking at Solr for potential answers, thanks for
mentioning it :-)
Ta.
Jo
On Thu, Aug 7, 2008 at 12:59 AM, Chris Hostetter
<ho...@fucit.org>wrote:
>
> : In addition to the full text search, I'd like to be able to perform
> searches
> : such as:
> : - list sessions from:xxx timestamp:200808*
> : - list sessions (from:xxx OR from:yyy)
> : - etc
> :
> : Would it be better to store each message as a separate document with its
> : fields, adding the 'filename' (session identifier) as an extra field? or
> : maybe is there a better way of doing it making the session file a
> document?
>
> As a general rule of thumb, you make 1 document for each result you want
> to get back when you execute a search ... if you want to be able to search
> for "foo" and get back a list of all sessions where the word "foo" was
> used, then each session should be a document. If you also want to be able
> to search for "foo" and get back a list of each message thta contained the
> word "foo", then each message can also be a document -- either in another
> index, or even in the same index (here's no rule that says all documents
> must have the same fields)
>
> BTW: If you are planning on experimenting with the Java API, i would
> suggest sending any specific followup questions to the java-user@lucene
> list. But you may also want to consider checking out Solr, and the
> solr-user list. Depends on what level of abstraction you want to deal
> with (Solr provides a config based web service type front end for dealing
> with Lucene indexes, but also has a Java API both for indexing and for
> hoooking in custom functionality when executing searches)
>
>
> -Hoss
>
>
Re: RFO- Indexing 'meaningfull' xml
Posted by Chris Hostetter <ho...@fucit.org>.
: In addition to the full text search, I'd like to be able to perform searches
: such as:
: - list sessions from:xxx timestamp:200808*
: - list sessions (from:xxx OR from:yyy)
: - etc
:
: Would it be better to store each message as a separate document with its
: fields, adding the 'filename' (session identifier) as an extra field? or
: maybe is there a better way of doing it making the session file a document?
As a general rule of thumb, you make 1 document for each result you want
to get back when you execute a search ... if you want to be able to search
for "foo" and get back a list of all sessions where the word "foo" was
used, then each session should be a document. If you also want to be able
to search for "foo" and get back a list of each message thta contained the
word "foo", then each message can also be a document -- either in another
index, or even in the same index (here's no rule that says all documents
must have the same fields)
BTW: If you are planning on experimenting with the Java API, i would
suggest sending any specific followup questions to the java-user@lucene
list. But you may also want to consider checking out Solr, and the
solr-user list. Depends on what level of abstraction you want to deal
with (Solr provides a config based web service type front end for dealing
with Lucene indexes, but also has a Java API both for indexing and for
hoooking in custom functionality when executing searches)
-Hoss