You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by abdul aleem <ja...@yahoo.com> on 2006/12/13 14:10:11 UTC

Indexing clarification , please advice

Hello All,

Apolgies if it is a naive question

a) Indexing large file ( more than 4MB ) 
   Do i need to read the entire file as string using
   java.io and create a Document object ?
   
   The file contains timestamp, if i need to index on
   timestamp is parsing the entire file manually
(tokenizing) and store the timestamp as document
object is the only way ? or is there any alternatives
?


b) I need to search the contents of a log file which
is changing rapidly, from initial testing i see if any
changes in the file is not reflected unless it is
*Indexed* again  
   
   Do we need to index the files always before search
if the content of the file is dynamically changed
 
  ( log file has a pattern and always logs in a
similar fashion, each time i need to index takes lot
of time as the file is large (approach a )  is there
any work arounds for this ? )    

  I would greatly appreciate if any inputs on the
above,

 Many thanks,
 Abdul 


   


 
____________________________________________________________________________________
Need a quick answer? Get one in minutes from people who know.
Ask your question on www.Answers.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing clarification , please advice

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

May be you can consider using Compass (http://www.opensymphony.com/compass/)
which could help you in your situation. They claim that some actions (like
updating the index very often) are treated in a very efficient way (due to
caching which is not a native part of Lucene library).

Regards,
Lukas

On 12/13/06, Daniel Naber <lu...@danielnaber.de> wrote:
>
> On Wednesday 13 December 2006 14:10, abdul aleem wrote:
>
> > a) Indexing large file ( more than 4MB )
> > Do i need to read the entire file as string using
> > java.io and create a Document object ?
>
> You can also use a reader:
>
>
> http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/document/Field.html#Field(java.lang.String,
> java.io.Reader)
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing clarification , please advice

Posted by Daniel Naber <lu...@danielnaber.de>.
On Wednesday 13 December 2006 14:10, abdul aleem wrote:

> a) Indexing large file ( more than 4MB )
>    Do i need to read the entire file as string using
>    java.io and create a Document object ?

You can also use a reader:

http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/document/Field.html#Field(java.lang.String, java.io.Reader)

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing clarification , please advice

Posted by abdul aleem <ja...@yahoo.com>.
Many thanks Erick,

Your points are valid, i was thinking entire Log file
as a lucene document, im wrong trying  to chop the log
file might be the way to go

my bad expressions , yes you got that right
timestamp must be added as a "FIELD" that is what i
meant

really appreciate your detailed reply,


regards,
Abdul 

--- Erick Erickson <er...@gmail.com> wrote:

> Let me take a crack at it. See below...
> 
> On 12/13/06, abdul aleem <ja...@yahoo.com>
> wrote:
> >
> > Hello All,
> >
> > Apolgies if it is a naive question
> >
> > a) Indexing large file ( more than 4MB )
> >    Do i need to read the entire file as string
> using
> >    java.io and create a Document object ?
> 
> 
> Essentially yes. IF you must index the whole
> document as a single Lucene
> document. my reply later for why you may not want to
> do this.
> 
> The following are equivalent.
> Document doc = new Document();
> doc.add("textfield", "data1 data2 data3");
> doc.add("textfield", "data4 data5 data6");
> 
> 
> and
> Document doc = new Document();
> doc.add("textfield", "data1 data2 data3 data4 data5
> data6");
> 
> So you could read chunks of your file and index them
> into the *same* field
> before writing the document to the index, or you
> could read the file as a
> single chunk and index it all at once.
> 
> HOWEVER: note that the default in Lucene is to index
> only the first 10,000
> (?) tokens for a field in a single document. See
> IndexWriter.SetMaxFieldLength.
> 
> 
> 
>    The file contains timestamp, if i need to index
> on
> >    timestamp is parsing the entire file manually
> > (tokenizing) and store the timestamp as document
> > object is the only way ? or is there any
> alternatives
> > ?
> 
> 
> Perhaps I don't understand the problem. You can
> store the timestamp as a
> *field* in a document. Is that what you mean by
> storing it as a "document
> object"? But there's no way I know of to have Lucene
> automatically do
> something with arbitrary text in the input stream.
> You could write a custom
> analyzer (see Lucene In Action for the Synonym
> Analyzer for a model). That
> Analyzer would be responsible for recognizing
> timestamps in the input stream
> and doing something special with them.
> 
> 
> b) I need to search the contents of a log file which
> > is changing rapidly, from initial testing i see if
> any
> > changes in the file is not reflected unless it is
> > *Indexed* again
> >
> >    Do we need to index the files always before
> search
> > if the content of the file is dynamically changed
> 
> 
> Yes. That is, you can't search data you haven't
> indexed.
> 
> 
>   ( log file has a pattern and always logs in a
> > similar fashion, each time i need to index takes
> lot
> > of time as the file is large (approach a )  is
> there
> > any work arounds for this ? )
> 
> 
> I think you need to re-think your approach. A
> document in Lucene is whatever
> you want to think of it as. For instance, you could
> index your log file such
> that each Lucene "document" was all the data added
> to the log file over some
> specified time interval. So, say you have a log that
> starts at midnight.
> Each Lucene "document" could be all the data added
> to the index for each
> minute. So you'd have a document for the data added
> to the log between 12:00
> and 12:00:59. Another "document" for all data
> between 12:01:00 and 12:01:59
> etc. That way, you don't have to re-index the entire
> log, just everything
> since the last interval you already indexed.
> 
> I'm not necessarily recommending this approach, but
> using it to illustrate
> that you don't need to think of a Lucene Document as
> your entire log file.
> You may be much better off slicing the data up
> somehow and having a
> one-to-many relationship between your log and the
> Lucene documents......
> 
> Perhaps you could index every message as an
> individual document (which would
> deal with your timestamp issue), or.....
> 
> Hope this helps
> Erick
> 
>   I would greatly appreciate if any inputs on the
> > above,
> >
> > Many thanks,
> > Abdul
> >
> >
> >
> >
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Need a quick answer? Get one in minutes from
> people who know.
> > Ask your question on www.Answers.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> 



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing clarification , please advice

Posted by Erick Erickson <er...@gmail.com>.
Let me take a crack at it. See below...

On 12/13/06, abdul aleem <ja...@yahoo.com> wrote:
>
> Hello All,
>
> Apolgies if it is a naive question
>
> a) Indexing large file ( more than 4MB )
>    Do i need to read the entire file as string using
>    java.io and create a Document object ?


Essentially yes. IF you must index the whole document as a single Lucene
document. my reply later for why you may not want to do this.

The following are equivalent.
Document doc = new Document();
doc.add("textfield", "data1 data2 data3");
doc.add("textfield", "data4 data5 data6");


and
Document doc = new Document();
doc.add("textfield", "data1 data2 data3 data4 data5 data6");

So you could read chunks of your file and index them into the *same* field
before writing the document to the index, or you could read the file as a
single chunk and index it all at once.

HOWEVER: note that the default in Lucene is to index only the first 10,000
(?) tokens for a field in a single document. See
IndexWriter.SetMaxFieldLength.



   The file contains timestamp, if i need to index on
>    timestamp is parsing the entire file manually
> (tokenizing) and store the timestamp as document
> object is the only way ? or is there any alternatives
> ?


Perhaps I don't understand the problem. You can store the timestamp as a
*field* in a document. Is that what you mean by storing it as a "document
object"? But there's no way I know of to have Lucene automatically do
something with arbitrary text in the input stream. You could write a custom
analyzer (see Lucene In Action for the Synonym Analyzer for a model). That
Analyzer would be responsible for recognizing timestamps in the input stream
and doing something special with them.


b) I need to search the contents of a log file which
> is changing rapidly, from initial testing i see if any
> changes in the file is not reflected unless it is
> *Indexed* again
>
>    Do we need to index the files always before search
> if the content of the file is dynamically changed


Yes. That is, you can't search data you haven't indexed.


  ( log file has a pattern and always logs in a
> similar fashion, each time i need to index takes lot
> of time as the file is large (approach a )  is there
> any work arounds for this ? )


I think you need to re-think your approach. A document in Lucene is whatever
you want to think of it as. For instance, you could index your log file such
that each Lucene "document" was all the data added to the log file over some
specified time interval. So, say you have a log that starts at midnight.
Each Lucene "document" could be all the data added to the index for each
minute. So you'd have a document for the data added to the log between 12:00
and 12:00:59. Another "document" for all data between 12:01:00 and 12:01:59
etc. That way, you don't have to re-index the entire log, just everything
since the last interval you already indexed.

I'm not necessarily recommending this approach, but using it to illustrate
that you don't need to think of a Lucene Document as your entire log file.
You may be much better off slicing the data up somehow and having a
one-to-many relationship between your log and the Lucene documents......

Perhaps you could index every message as an individual document (which would
deal with your timestamp issue), or.....

Hope this helps
Erick

  I would greatly appreciate if any inputs on the
> above,
>
> Many thanks,
> Abdul
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Need a quick answer? Get one in minutes from people who know.
> Ask your question on www.Answers.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>