You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/01/30 22:21:45 UTC
Indexing speed
This relates to a previous post of mine regarding Context of 'lines' of
text (log4j events in my case):
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
I'm going through the process of writing quick and dirty
test-case/test-bed classes to validate whether my ideas are going to
work or not.
For my first test, I thought I would write a quick indexer that indexed
a traditional log file by lines, with each line being a Document, so
that I could then search for matching lines and then do a context
search. Yes this is exactly what 'grep' does and does very well, but I
thought if one was doing a lot of analysis of a log file (typical when
mentally analysing log files) it might be best to index it once, and
then search quickly many times.
Turns out that even using JUST a RamDirectory (which suprised me),
writing a Document for every line of text isn't as fast as I was hoping,
it is taking significantly longer than I hoped. I played around with
the mergeFactor settings etc, but nothing really made much difference to
the indexing speed, other than NOT adding the Document to the index....
I have tried this out on my Mac laptop, as well as a test Linux server
with no noticeable difference. (Both scenarios have the reading log
file, and new index on the same physical drive, which I know is not the
_best_ setup, but still).
This could well be my own stupidness, so here's what I'm doing.
Statistics on the Log File
=================
The log file is 28meg, consisting of 409566 lines, of the form:
[2004-12-21 00:00:00,935 INFO
][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] Finished
processing [mail box=stagingfax][MsgCount=0]
[2004-12-21 00:00:00,986 INFO
][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] Finished
processing [mail box=aconexnz9000][MsgCount=0]
[2004-12-21 00:00:01,126 INFO ][
monitor][http-80-Processor9][192.168.0.220][] Controller duration: 212ms
url=/Fax, fowardDuration=-1, total=212
[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
Could not connect to mail server!
[host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
javax.mail.AuthenticationFailedException: Login failed: authentication
failure
at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
at javax.mail.Service.connect(Service.java:233)
at javax.mail.Service.connect(Service.java:134)
at
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
at
com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
==================
Source code for test-bed:
==================
public class TestBed1 {
public static void main(String[] args) throws Exception {
if(args.length <1) throw new IllegalArgumentException("not
enough args");
String filename = args[0];
File file = new File(filename);
Analyzer a = new SimpleAnalyzer();
String indexLoc = "/tmp/testbed1/";
//IndexWriter writer = new IndexWriter(indexLoc, a, true);
RAMDirectory ramDir = new RAMDirectory();
IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
long length = file.length();
BufferedReader fileReader = new BufferedReader(new
FileReader(file));
String line = "";
double processed = 0;
NumberFormat nf = NumberFormat.getPercentInstance();
nf.setMaximumFractionDigits(0);
String percent = "";
String lastPercent = " ";
long lines =0;
while ((line = fileReader.readLine())!=null) {
Document doc = new Document();
doc.add(Field.UnStored("Line", line) );
ramWriter.addDocument(doc);
processed +=line.length();
lines++;
percent = nf.format(processed/length);
if (!percent.equals(lastPercent)){
lastPercent = percent;
System.out.println(percent + "(lines=" + lines + ")");
}
}
//writer.optimize();
//writer.close();
}
}
=======
I did other simple tests by testing exactly how long it takes Java to
just read the lines of the file, and that is mega quick in comparison.
It's actually the "ramWriter.addDocument(doc)" line which seems to have
the biggest amount of work to do, and probably for good reason. I had
originally tried to use Field.Text(...) to keep the line with the index
for Context later on, but even Unstored doesn't really make that much
difference from a stopwatch time point of view (creates a bigger index
of course).
I might setup a profiler and work through where it's taking the the
time, but you guys probably already know the answer.
I'm going to need much higher throughput for my utility to be useful.
Maybe that's just not achievable.
Thoughts?
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Indexing speed
Posted by Paul Smith <ps...@aconex.com>.
Thanks Otis, tried Field.Keyword but that didn't seem to make any
appreciatable difference.
I'll have a hunt around with a profiler and see what I can find. I
guess my use case is unusual, I need to create a LOT of very small
documents.
cheers,
Paul
Otis Gospodnetic wrote:
>I believe most of the time is being spent in the Analyzer. It should
>be easy to empirically test this claim by using Field.Keyword instead
>of Field.Text (Field.Keyword fields are not analyzed). If that turns
>out to be correct, then you could play with writing a custom and
>optimal Analyzer.
>
>Otis
>
>--- Paul Smith <ps...@aconex.com> wrote:
>
>
>
>>This relates to a previous post of mine regarding Context of 'lines'
>>of
>>text (log4j events in my case):
>>
>>
>>
>>
>http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
>
>
>>I'm going through the process of writing quick and dirty
>>test-case/test-bed classes to validate whether my ideas are going to
>>work or not.
>>
>>For my first test, I thought I would write a quick indexer that
>>indexed
>>a traditional log file by lines, with each line being a Document, so
>>that I could then search for matching lines and then do a context
>>search. Yes this is exactly what 'grep' does and does very well,
>>but I
>>thought if one was doing a lot of analysis of a log file (typical
>>when
>>mentally analysing log files) it might be best to index it once, and
>>then search quickly many times.
>>
>>Turns out that even using JUST a RamDirectory (which suprised me),
>>writing a Document for every line of text isn't as fast as I was
>>hoping,
>>it is taking significantly longer than I hoped. I played around with
>>
>>the mergeFactor settings etc, but nothing really made much difference
>>to
>>the indexing speed, other than NOT adding the Document to the
>>index....
>>I have tried this out on my Mac laptop, as well as a test Linux
>>server
>>with no noticeable difference. (Both scenarios have the reading log
>>file, and new index on the same physical drive, which I know is not
>>the
>>_best_ setup, but still).
>>
>>This could well be my own stupidness, so here's what I'm doing.
>>
>>Statistics on the Log File
>>=================
>>
>>The log file is 28meg, consisting of 409566 lines, of the form:
>>
>>[2004-12-21 00:00:00,935 INFO
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]
>>Finished
>>processing [mail box=stagingfax][MsgCount=0]
>>[2004-12-21 00:00:00,986 INFO
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]
>>Finished
>>processing [mail box=aconexnz9000][MsgCount=0]
>>[2004-12-21 00:00:01,126 INFO ][
>>monitor][http-80-Processor9][192.168.0.220][] Controller duration:
>>212ms
>>url=/Fax, fowardDuration=-1, total=212
>>[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
>>
>>Could not connect to mail server!
>>[host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
>>javax.mail.AuthenticationFailedException: Login failed:
>>authentication
>>failure
>> at
>>com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
>> at javax.mail.Service.connect(Service.java:233)
>> at javax.mail.Service.connect(Service.java:134)
>> at
>>
>>
>>
>com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
>
>
>> at
>>
>>
>>
>com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
>
>
>>==================
>>Source code for test-bed:
>>==================
>>
>>public class TestBed1 {
>>
>> public static void main(String[] args) throws Exception {
>>
>> if(args.length <1) throw new IllegalArgumentException("not
>>enough args");
>> String filename = args[0];
>>
>> File file = new File(filename);
>> Analyzer a = new SimpleAnalyzer();
>>
>> String indexLoc = "/tmp/testbed1/";
>>
>> //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>>
>> RAMDirectory ramDir = new RAMDirectory();
>> IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>>
>> long length = file.length();
>>
>> BufferedReader fileReader = new BufferedReader(new
>>FileReader(file));
>>
>> String line = "";
>> double processed = 0;
>> NumberFormat nf = NumberFormat.getPercentInstance();
>> nf.setMaximumFractionDigits(0);
>>
>> String percent = "";
>> String lastPercent = " ";
>> long lines =0;
>> while ((line = fileReader.readLine())!=null) {
>> Document doc = new Document();
>> doc.add(Field.UnStored("Line", line) );
>> ramWriter.addDocument(doc);
>> processed +=line.length();
>> lines++;
>> percent = nf.format(processed/length);
>> if (!percent.equals(lastPercent)){
>> lastPercent = percent;
>> System.out.println(percent + "(lines=" + lines +
>>")");
>> }
>> }
>> //writer.optimize();
>> //writer.close();
>>
>>
>> }
>>}
>>
>>=======
>>
>>I did other simple tests by testing exactly how long it takes Java to
>>
>>just read the lines of the file, and that is mega quick in
>>comparison.
>>It's actually the "ramWriter.addDocument(doc)" line which seems to
>>have
>>the biggest amount of work to do, and probably for good reason. I
>>had
>>originally tried to use Field.Text(...) to keep the line with the
>>index
>>for Context later on, but even Unstored doesn't really make that much
>>
>>difference from a stopwatch time point of view (creates a bigger
>>index
>>of course).
>>
>>I might setup a profiler and work through where it's taking the the
>>time, but you guys probably already know the answer.
>>
>>I'm going to need much higher throughput for my utility to be useful.
>>
>>Maybe that's just not achievable.
>>
>>Thoughts?
>>
>>cheers,
>>
>>Paul Smith
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>
>
Re: Indexing speed
Posted by Otis Gospodnetic <ot...@yahoo.com>.
I believe most of the time is being spent in the Analyzer. It should
be easy to empirically test this claim by using Field.Keyword instead
of Field.Text (Field.Keyword fields are not analyzed). If that turns
out to be correct, then you could play with writing a custom and
optimal Analyzer.
Otis
--- Paul Smith <ps...@aconex.com> wrote:
> This relates to a previous post of mine regarding Context of 'lines'
> of
> text (log4j events in my case):
>
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
>
> I'm going through the process of writing quick and dirty
> test-case/test-bed classes to validate whether my ideas are going to
> work or not.
>
> For my first test, I thought I would write a quick indexer that
> indexed
> a traditional log file by lines, with each line being a Document, so
> that I could then search for matching lines and then do a context
> search. Yes this is exactly what 'grep' does and does very well,
> but I
> thought if one was doing a lot of analysis of a log file (typical
> when
> mentally analysing log files) it might be best to index it once, and
> then search quickly many times.
>
> Turns out that even using JUST a RamDirectory (which suprised me),
> writing a Document for every line of text isn't as fast as I was
> hoping,
> it is taking significantly longer than I hoped. I played around with
>
> the mergeFactor settings etc, but nothing really made much difference
> to
> the indexing speed, other than NOT adding the Document to the
> index....
> I have tried this out on my Mac laptop, as well as a test Linux
> server
> with no noticeable difference. (Both scenarios have the reading log
> file, and new index on the same physical drive, which I know is not
> the
> _best_ setup, but still).
>
> This could well be my own stupidness, so here's what I'm doing.
>
> Statistics on the Log File
> =================
>
> The log file is 28meg, consisting of 409566 lines, of the form:
>
> [2004-12-21 00:00:00,935 INFO
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]
> Finished
> processing [mail box=stagingfax][MsgCount=0]
> [2004-12-21 00:00:00,986 INFO
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]
> Finished
> processing [mail box=aconexnz9000][MsgCount=0]
> [2004-12-21 00:00:01,126 INFO ][
> monitor][http-80-Processor9][192.168.0.220][] Controller duration:
> 212ms
> url=/Fax, fowardDuration=-1, total=212
> [2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
>
> Could not connect to mail server!
> [host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
> javax.mail.AuthenticationFailedException: Login failed:
> authentication
> failure
> at
> com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
> at javax.mail.Service.connect(Service.java:233)
> at javax.mail.Service.connect(Service.java:134)
> at
>
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
> at
>
com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
>
>
> ==================
> Source code for test-bed:
> ==================
>
> public class TestBed1 {
>
> public static void main(String[] args) throws Exception {
>
> if(args.length <1) throw new IllegalArgumentException("not
> enough args");
> String filename = args[0];
>
> File file = new File(filename);
> Analyzer a = new SimpleAnalyzer();
>
> String indexLoc = "/tmp/testbed1/";
>
> //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>
> RAMDirectory ramDir = new RAMDirectory();
> IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>
> long length = file.length();
>
> BufferedReader fileReader = new BufferedReader(new
> FileReader(file));
>
> String line = "";
> double processed = 0;
> NumberFormat nf = NumberFormat.getPercentInstance();
> nf.setMaximumFractionDigits(0);
>
> String percent = "";
> String lastPercent = " ";
> long lines =0;
> while ((line = fileReader.readLine())!=null) {
> Document doc = new Document();
> doc.add(Field.UnStored("Line", line) );
> ramWriter.addDocument(doc);
> processed +=line.length();
> lines++;
> percent = nf.format(processed/length);
> if (!percent.equals(lastPercent)){
> lastPercent = percent;
> System.out.println(percent + "(lines=" + lines +
> ")");
> }
> }
> //writer.optimize();
> //writer.close();
>
>
> }
> }
>
> =======
>
> I did other simple tests by testing exactly how long it takes Java to
>
> just read the lines of the file, and that is mega quick in
> comparison.
> It's actually the "ramWriter.addDocument(doc)" line which seems to
> have
> the biggest amount of work to do, and probably for good reason. I
> had
> originally tried to use Field.Text(...) to keep the line with the
> index
> for Context later on, but even Unstored doesn't really make that much
>
> difference from a stopwatch time point of view (creates a bigger
> index
> of course).
>
> I might setup a profiler and work through where it's taking the the
> time, but you guys probably already know the answer.
>
> I'm going to need much higher throughput for my utility to be useful.
>
> Maybe that's just not achievable.
>
> Thoughts?
>
> cheers,
>
> Paul Smith
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org