You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Namit Yadav <na...@gmail.com> on 2006/07/25 04:05:52 UTC

Index Rows as Documents? Help me design a solution

My question might be very easy for you Lucene experts. But after going
through the Lucene documentation / example, I haven't been able to
figure out how to solve this problem. I'll be really grateful if
someone can help me get a starting point here.

Our application tracks SMSes sent from a particular phone number. We
have gigs of logs that (Lets say) look like this

SomeUselessData1#SMSID#SomeData1#PhoneNumber
SomeUselessData2#SMSID#SomeData2
SomeUselessData3#SMSID#SomeData3
SomeUselessData4#SMSID#SomeData4
...
...

Now our search will obviously be done on the basis of the phone
number. So we need indexing so that we can:

1 List SMSIDs of all the SMSes that a phone number had sent (Each SMS
message will have a globally unique ID)
2 List SomeData1, SomeData2, SomeData3 and SomeData4 for a given SMSID.

How can I do this efficiently?

I wrote a sample piece of code where each row was a Document, and
PhoneNumber, SMSID and SomeData columns were Fields. The indexing was
taking much more than minutes for a 1 MB log file, so I realized that
I didn't do it right (You can guess how 'not' comfortable I am with
Lucene at present). I would expect to be able to index at least a of
GB of logs within 1 or 2 minutes.

Can someone please point me to the right examples, help me understand
what my Documents / Fields / Analyzers should be or help me design a
solution?

Thanks in advance

ps. I just now got Lucene in Action. Is there any example (or similar
concept) explained in the book? From what I see, none of the examples
really help me much.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Index Rows as Documents? Help me design a solution

Posted by Daniel Naber <lu...@danielnaber.de>.

On Dienstag 25 Juli 2006 04:05, Namit Yadav wrote:

> 1 List SMSIDs of all the SMSes that a phone number had sent (Each SMS
> message will have a globally unique ID)
> 2 List SomeData1, SomeData2, SomeData3 and SomeData4 for a given SMSID.
>
> How can I do this efficiently?

Short answer: use a relational database, not Lucene. Why do want to use 
Lucene for this?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Index Rows as Documents? Help me design a solution

Posted by Erick Erickson <er...@gmail.com>.

The code looks good, *assuming* that the IndexWriter you pass in isn't
closed/opened between files (this would  be a problem if you have lots of
files to index......). I've had the IndexWriter.optimize method take a
loooong time to complete, so I typically don't do this until I'm entirely
done...

My application is indexing 10,000 documents in 15 seconds or so. These are
XML documents that need to be parsed and a dozen or so fields indexed FWIW.

An easy test for narrowing down where your problem is would be to just
comment out the writer.addDocument and take some timings. Then, perhaps,
comment out the open/close/optimize of the IndexWriter and see if *that*
made a measurable difference.

But if you're not opening and closing the index writer between files, I'm
stumped. I usually take a "divide and conquer" approach or haul out a
performance analyzer :(.

My experience is that Lucene indexes quite quickly, so I would assume it was
a problem in my code for quite a while before throwing in the towel.

You aren't by chance reading/writing over a network that may be slow are you
? (really grasping at straws here.....).

Best
Erick

Re: Index Rows as Documents? Help me design a solution

Posted by Doron Cohen <DO...@il.ibm.com>.

Few comments -

> (from first posting in this thread)
> The indexing was taking much more than minutes for a 1 MB log file. ...
> I would expect to be able to index at least a of GB of logs within 1 or 2
minutes.

1-2 minutes per GB would be 30-60 GB/Hour, which for a single machine/jvm
is a lot - well at least I did not see Lucene index this fast.

> doc.add(new Field("msisdn", columns[0], Field.Store.YES,
Field.Index.TOKENIZED));
> doc.add(new Field("messageid", columns[2], Field.Store.YES,
Field.Index.TOKENIZED));

Is it really required to analyze the text for these fields - "msisdn" , "
messageid"?

> doc.add(new Field("line", line, Field.Store.YES, Field.Index.NO));

This is storing the original text of all input lines that are indexed -
quite an overhead.

- Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Index Rows as Documents? Help me design a solution

Posted by Namit Yadav <na...@gmail.com>.

Thanks for the suggestion, Erick!

As for why we can't use a relational database, we get all the logs
from an external application. And due to the nature of the business,
we need to continue maintaining the logs. Moreover, the search
requests are very infrequent .. so it doesn't make sense to (almost)
replicate the complete data in database.

Back to the problem. Erick, here is a sample indexFile method (Is this
how I am supposed to index the file?):

    private static void indexFile(IndexWriter writer, File f) {
        try {
            System.out.println("Indexing " + f.getCanonicalPath());
            BufferedReader br = new BufferedReader(new FileReader(f));
            String line = null;
            String[] columns = null;
            while((line = br.readLine())!=null) {
                columns = line.split("#");
                if(columns.length == 4) { // Rows not having 4 columns
are not useful for us
                        Document doc = new Document();
                        doc.add(new Field("msisdn", columns[0],
Field.Store.YES, Field.Index.TOKENIZED));
                        doc.add(new Field("messageid", columns[2],
Field.Store.YES, Field.Index.TOKENIZED));
                        doc.add(new Field("line", line,
Field.Store.YES, Field.Index.NO));
                        writer.addDocument(doc);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

On 7/25/06, Erick Erickson <er...@gmail.com> wrote:
> Indexing 1M of logs shouldn't take minutes, so  you're probably right.
>
> A problem I've seen is opening/indexing/closing your index writer too often.
> You should do something like... (really bad pseudo code here)
>
> IndexWriter IW = new IndexWriter(....);
> for (lots and lots and lots of records) {
>    IW.addDocument();
> }
>
> IW.optimize();
> IW.close();
>
>
> Others have had a problem where they open/write/close the index writer for
> EACH document, which is painfully slow.
>
> Also, you might play around with IndexWriter.setMergeFactor and
> setMaxBufferedDocs. If you set them too high, you'll run out of memory, but
> they can make a difference in now fast your index is built....
>
>
> If none of this is relevant, can you post a bit of (perhaps pseudo) code?
>
> Best
> Erick
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Index Rows as Documents? Help me design a solution

Posted by Erick Erickson <er...@gmail.com>.

Indexing 1M of logs shouldn't take minutes, so  you're probably right.

A problem I've seen is opening/indexing/closing your index writer too often.
You should do something like... (really bad pseudo code here)

IndexWriter IW = new IndexWriter(....);
for (lots and lots and lots of records) {
   IW.addDocument();
}

IW.optimize();
IW.close();


Others have had a problem where they open/write/close the index writer for
EACH document, which is painfully slow.

Also, you might play around with IndexWriter.setMergeFactor and
setMaxBufferedDocs. If you set them too high, you'll run out of memory, but
they can make a difference in now fast your index is built....


If none of this is relevant, can you post a bit of (perhaps pseudo) code?

Best
Erick