You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eyal Post <ey...@gmail.com> on 2006/03/01 07:23:45 UTC
Adding line count to a document
I'd like to add a line count field to my indexed document. The obvious way
is to read my file twice, once to tokenize it and add it's content to a
field in the document and once to count the number of lines in it and add it
to another field.
Any idea how can I optimize this and read the file once?
Regards,
Eyal
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Adding line count to a document
Posted by Eyal <ey...@gmail.com>.
I think my questions wasn't clear..
Let's say I'm doing something like that (c# code, but that's not the
issue..)
TextReader reader=new StreamReader("C:\FileToIndex.txt");
Int lineCount=CountLines("C:\FileToIndex.txt"); //This ones reads the entire
file and count the number of lines
Document doc=new Document();
doc.add(Field.Text("body"),reader);
doc.add(Field.Keyword("lineCount"),lineCount);
In the above example, I'm reading the entire file twice. This could be a
100Mb file.
Now, Let's say I have a class LineCountingTextReader that counts the lines
as the file is being read. If I do the following
doc.add(Field.Text("body"),lcReader);
Then only after I call IndexWriter.AddDocument I will actually have the line
count (since only then the file will be read entirely).
I don't want to read the entire file into memory and use it for both line
counting and analyzing since it may be a very big file. So I'm wondering
what other are doing?
This is also a problem when you need to get several pieces of information
from 1 file to different fields (i.e. analyze an html file and also get the
links from it and add them to other a different field).
Thanks in advance,
Eyal
> -----Original Message-----
> From: Eyal Post [mailto:eyal.junk@gmail.com]
> Sent: Wednesday, March 01, 2006 8:24 AM
> To: java-user@lucene.apache.org
> Subject: Adding line count to a document
>
> I'd like to add a line count field to my indexed document.
> The obvious way is to read my file twice, once to tokenize it
> and add it's content to a field in the document and once to
> count the number of lines in it and add it to another field.
> Any idea how can I optimize this and read the file once?
>
> Regards,
> Eyal
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org