You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Alex Shneyderman <a....@gmail.com> on 2011/10/14 01:21:43 UTC

Suggestions or best practices for indexing the logs

Hello, everybody!

I am trying to introduce faster searches to our application that sifts
through the logs. And Lucene seems to be the tool to use here. The one
peculiarity of the problem it seems there are few files and they
contain many log statements. I avoid storing the text in the index
itself. Given all this I setup indexing as follows:

I iterate over a log file and for each statement in the log file I do
the indexing of the statements content.

Here is the java code that does field additions:

            NumericField startOffset = new NumericField("so",
Field.Store.YES, false);
            startOffset.setLongValue( statement.getStartOffset() );
            doc.add(startOffset);

            NumericField endOffset = new NumericField("eo",
Field.Store.YES, false);
            endOffset.setLongValue( statement.getEndOffset() );
            doc.add(endOffset);

            NumericField timestampField = new NumericField("ts",
Field.Store.YES, true);
            timestampField.setLongValue(statement.getStatementTime().getTime());
            doc.add(timestampField);

            doc.add(new Field("fn", fileTagName, Field.Store.YES,
Field.Index.NO ));
            doc.add(new Field("ct", statement.getContent(),
Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));

I am getting following results (index size vs log files) with this scheme:

The size of the logs is 385MB.
(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
385     /var/tmp/logs


The size of the index is 143MB.
(00:41:26) /var/tmp/index > du -ms /var/tmp/index
143     /var/tmp/index

Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
much (I would expect something like 1/5 - 1/7 for the index)? Is there
anything I can do to move this to the desired ration? Of course what
would help is the words histogram and here the top of the output of
the words histogram script that I ran on the logs:

Total number of words: 26935271
Number of different words: 551981
The most common words are:
as      3395203
10      797708
13      797662
2011    795595
at      787365
timer   746790
...

Could anyone suggest a better way to organize index for my logs? And
by better I mean more compact. Or this is as good as it gets? I tried
to optimize and got a 2Mb improvement (index went from 145Mb to
143Mb).

Could anyone point to an article that deals with indexing of logs? Any
help, suggestions and pointers are greatly appreciated.

Thanks for any and all help and cheers,
Alex.

Re: Suggestions or best practices for indexing the logs

Posted by Alex Shneyderman <a....@gmail.com>.
Otis,

Not sure I understand. Could you elaborate?

Note, content is not stored in the index itself. Hence my confusion to
your suggestion.

Thanks,
Alex.

On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Alex,
>
> You could try compressing the content field - that might help a bit.
>
> Otis
> ----
>
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>>________________________________
>>From: Alex Shneyderman <a....@gmail.com>
>>To: general@lucene.apache.org
>>Sent: Thursday, October 13, 2011 7:21 PM
>>Subject: Suggestions or best practices for indexing the logs
>>
>>Hello, everybody!
>>
>>I am trying to introduce faster searches to our application that sifts
>>through the logs. And Lucene seems to be the tool to use here. The one
>>peculiarity of the problem it seems there are few files and they
>>contain many log statements. I avoid storing the text in the index
>>itself. Given all this I setup indexing as follows:
>>
>>I iterate over a log file and for each statement in the log file I do
>>the indexing of the statements content.
>>
>>Here is the java code that does field additions:
>>
>>            NumericField startOffset = new NumericField("so",
>>Field.Store.YES, false);
>>            startOffset.setLongValue( statement.getStartOffset() );
>>            doc.add(startOffset);
>>
>>            NumericField endOffset = new NumericField("eo",
>>Field.Store.YES, false);
>>            endOffset.setLongValue( statement.getEndOffset() );
>>            doc.add(endOffset);
>>
>>            NumericField timestampField = new NumericField("ts",
>>Field.Store.YES, true);
>>            timestampField.setLongValue(statement.getStatementTime().getTime());
>>            doc.add(timestampField);
>>
>>            doc.add(new Field("fn", fileTagName, Field.Store.YES,
>>Field.Index.NO ));
>>            doc.add(new Field("ct", statement.getContent(),
>>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>>
>>I am getting following results (index size vs log files) with this scheme:
>>
>>The size of the logs is 385MB.
>>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>>385     /var/tmp/logs
>>
>>
>>The size of the index is 143MB.
>>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>>143     /var/tmp/index
>>
>>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>>anything I can do to move this to the desired ration? Of course what
>>would help is the words histogram and here the top of the output of
>>the words histogram script that I ran on the logs:
>>
>>Total number of words: 26935271
>>Number of different words: 551981
>>The most common words are:
>>as      3395203
>>10      797708
>>13      797662
>>2011    795595
>>at      787365
>>timer   746790
>>...
>>
>>Could anyone suggest a better way to organize index for my logs? And
>>by better I mean more compact. Or this is as good as it gets? I tried
>>to optimize and got a 2Mb improvement (index went from 145Mb to
>>143Mb).
>>
>>Could anyone point to an article that deals with indexing of logs? Any
>>help, suggestions and pointers are greatly appreciated.
>>
>>Thanks for any and all help and cheers,
>>Alex.
>>
>>
>>

Re: Suggestions or best practices for indexing the logs

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Alex,

You could try compressing the content field - that might help a bit.

Otis
----

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>________________________________
>From: Alex Shneyderman <a....@gmail.com>
>To: general@lucene.apache.org
>Sent: Thursday, October 13, 2011 7:21 PM
>Subject: Suggestions or best practices for indexing the logs
>
>Hello, everybody!
>
>I am trying to introduce faster searches to our application that sifts
>through the logs. And Lucene seems to be the tool to use here. The one
>peculiarity of the problem it seems there are few files and they
>contain many log statements. I avoid storing the text in the index
>itself. Given all this I setup indexing as follows:
>
>I iterate over a log file and for each statement in the log file I do
>the indexing of the statements content.
>
>Here is the java code that does field additions:
>
>            NumericField startOffset = new NumericField("so",
>Field.Store.YES, false);
>            startOffset.setLongValue( statement.getStartOffset() );
>            doc.add(startOffset);
>
>            NumericField endOffset = new NumericField("eo",
>Field.Store.YES, false);
>            endOffset.setLongValue( statement.getEndOffset() );
>            doc.add(endOffset);
>
>            NumericField timestampField = new NumericField("ts",
>Field.Store.YES, true);
>            timestampField.setLongValue(statement.getStatementTime().getTime());
>            doc.add(timestampField);
>
>            doc.add(new Field("fn", fileTagName, Field.Store.YES,
>Field.Index.NO ));
>            doc.add(new Field("ct", statement.getContent(),
>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>
>I am getting following results (index size vs log files) with this scheme:
>
>The size of the logs is 385MB.
>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>385     /var/tmp/logs
>
>
>The size of the index is 143MB.
>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>143     /var/tmp/index
>
>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>anything I can do to move this to the desired ration? Of course what
>would help is the words histogram and here the top of the output of
>the words histogram script that I ran on the logs:
>
>Total number of words: 26935271
>Number of different words: 551981
>The most common words are:
>as      3395203
>10      797708
>13      797662
>2011    795595
>at      787365
>timer   746790
>...
>
>Could anyone suggest a better way to organize index for my logs? And
>by better I mean more compact. Or this is as good as it gets? I tried
>to optimize and got a 2Mb improvement (index went from 145Mb to
>143Mb).
>
>Could anyone point to an article that deals with indexing of logs? Any
>help, suggestions and pointers are greatly appreciated.
>
>Thanks for any and all help and cheers,
>Alex.
>
>
>