You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/01/30 22:21:45 UTC

Indexing speed

This relates to a previous post of mine regarding Context of 'lines' of 
text (log4j events in my case):

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html

I'm going through the process of writing quick and dirty 
test-case/test-bed classes to validate whether my ideas are going to 
work or not. 

For my first test, I thought I would write a quick indexer that indexed 
a traditional log file by lines, with each line being a Document, so 
that I could then search for matching lines and then do a context 
search.   Yes this is exactly what 'grep' does and does very well, but I 
thought if one was doing a lot of analysis of a log file (typical when 
mentally analysing log files) it might be best to index it once, and 
then search quickly many times.

Turns out that even using JUST a RamDirectory (which suprised me),  
writing a Document for every line of text isn't as fast as I was hoping, 
it is taking significantly longer than I hoped.  I played around with 
the mergeFactor settings etc, but nothing really made much difference to 
the indexing speed, other than NOT adding the Document to the index....  
I have tried this out on my Mac laptop, as well as a test Linux server 
with no noticeable difference.  (Both scenarios have the reading log 
file, and new index on the same physical drive, which I know is not the 
_best_ setup, but still).

This could well be my own stupidness, so here's what I'm doing.

Statistics on the Log File
=================

The log file is 28meg, consisting of 409566 lines, of the form:

[2004-12-21 00:00:00,935 INFO 
][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]  Finished 
processing [mail box=stagingfax][MsgCount=0]
[2004-12-21 00:00:00,986 INFO 
][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][]  Finished 
processing [mail box=aconexnz9000][MsgCount=0]
[2004-12-21 00:00:01,126 INFO ][             
monitor][http-80-Processor9][192.168.0.220][] Controller duration: 212ms 
url=/Fax, fowardDuration=-1, total=212
[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][] 
Could not connect to mail server! 
[host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
javax.mail.AuthenticationFailedException: Login failed: authentication 
failure
        at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
        at javax.mail.Service.connect(Service.java:233)
        at javax.mail.Service.connect(Service.java:134)
        at 
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
        at 
com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)


==================
Source code for test-bed:
==================

public class TestBed1 {

    public static void main(String[] args) throws Exception {
       
        if(args.length <1) throw new IllegalArgumentException("not 
enough args");
        String filename = args[0];
       
        File file = new File(filename);
        Analyzer a = new SimpleAnalyzer();
       
        String indexLoc = "/tmp/testbed1/";
       
        //IndexWriter writer = new IndexWriter(indexLoc, a, true);
       
        RAMDirectory ramDir = new RAMDirectory();
        IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
       
        long length = file.length();
       
        BufferedReader fileReader = new BufferedReader(new 
FileReader(file));
       
        String line = "";
        double processed = 0;
        NumberFormat nf = NumberFormat.getPercentInstance();
        nf.setMaximumFractionDigits(0);
       
        String percent = "";
        String lastPercent = " ";
        long lines =0;
        while ((line = fileReader.readLine())!=null) {
            Document doc = new Document();
            doc.add(Field.UnStored("Line", line) );
            ramWriter.addDocument(doc);
            processed +=line.length();
            lines++;
            percent = nf.format(processed/length);
            if (!percent.equals(lastPercent)){
                lastPercent = percent;
                System.out.println(percent + "(lines=" + lines + ")");
            }
        }
        //writer.optimize();
        //writer.close();
       
       
    }
}

=======

I did other simple tests by testing exactly how long it takes Java to 
just read the lines of the file, and that is mega quick in comparison.  
It's actually the "ramWriter.addDocument(doc)" line which seems to have 
the biggest amount of work to do, and probably for good reason.  I had 
originally tried to use Field.Text(...) to keep the line with the index 
for Context later on, but even Unstored doesn't really make that much 
difference from a stopwatch time point of view (creates a bigger index 
of course).

I might setup a profiler and work through where it's taking the the 
time, but you guys probably already know the answer.

I'm going to need much higher throughput for my utility to be useful. 
Maybe that's just not achievable.

Thoughts?

cheers,

Paul Smith



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Indexing speed

Posted by Paul Smith <ps...@aconex.com>.
Thanks Otis, tried Field.Keyword but that didn't seem to make any 
appreciatable difference.

I'll have a hunt around with a profiler and see what I can find.  I 
guess my use case is unusual, I need to create a LOT of very small 
documents.

cheers,

Paul

Otis Gospodnetic wrote:

>I believe most of the time is being spent in the Analyzer.  It should
>be easy to empirically test this claim by using Field.Keyword instead
>of Field.Text (Field.Keyword fields are not analyzed).  If that turns
>out to be correct, then you could play with writing a custom and
>optimal Analyzer.
>
>Otis
>
>--- Paul Smith <ps...@aconex.com> wrote:
>
>  
>
>>This relates to a previous post of mine regarding Context of 'lines'
>>of 
>>text (log4j events in my case):
>>
>>
>>    
>>
>http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
>  
>
>>I'm going through the process of writing quick and dirty 
>>test-case/test-bed classes to validate whether my ideas are going to 
>>work or not. 
>>
>>For my first test, I thought I would write a quick indexer that
>>indexed 
>>a traditional log file by lines, with each line being a Document, so 
>>that I could then search for matching lines and then do a context 
>>search.   Yes this is exactly what 'grep' does and does very well,
>>but I 
>>thought if one was doing a lot of analysis of a log file (typical
>>when 
>>mentally analysing log files) it might be best to index it once, and 
>>then search quickly many times.
>>
>>Turns out that even using JUST a RamDirectory (which suprised me),  
>>writing a Document for every line of text isn't as fast as I was
>>hoping, 
>>it is taking significantly longer than I hoped.  I played around with
>>
>>the mergeFactor settings etc, but nothing really made much difference
>>to 
>>the indexing speed, other than NOT adding the Document to the
>>index....  
>>I have tried this out on my Mac laptop, as well as a test Linux
>>server 
>>with no noticeable difference.  (Both scenarios have the reading log 
>>file, and new index on the same physical drive, which I know is not
>>the 
>>_best_ setup, but still).
>>
>>This could well be my own stupidness, so here's what I'm doing.
>>
>>Statistics on the Log File
>>=================
>>
>>The log file is 28meg, consisting of 409566 lines, of the form:
>>
>>[2004-12-21 00:00:00,935 INFO 
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
>>Finished 
>>processing [mail box=stagingfax][MsgCount=0]
>>[2004-12-21 00:00:00,986 INFO 
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
>>Finished 
>>processing [mail box=aconexnz9000][MsgCount=0]
>>[2004-12-21 00:00:01,126 INFO ][             
>>monitor][http-80-Processor9][192.168.0.220][] Controller duration:
>>212ms 
>>url=/Fax, fowardDuration=-1, total=212
>>[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
>>
>>Could not connect to mail server! 
>>[host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
>>javax.mail.AuthenticationFailedException: Login failed:
>>authentication 
>>failure
>>        at
>>com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
>>        at javax.mail.Service.connect(Service.java:233)
>>        at javax.mail.Service.connect(Service.java:134)
>>        at 
>>
>>    
>>
>com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
>  
>
>>        at 
>>
>>    
>>
>com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
>  
>
>>==================
>>Source code for test-bed:
>>==================
>>
>>public class TestBed1 {
>>
>>    public static void main(String[] args) throws Exception {
>>       
>>        if(args.length <1) throw new IllegalArgumentException("not 
>>enough args");
>>        String filename = args[0];
>>       
>>        File file = new File(filename);
>>        Analyzer a = new SimpleAnalyzer();
>>       
>>        String indexLoc = "/tmp/testbed1/";
>>       
>>        //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>>       
>>        RAMDirectory ramDir = new RAMDirectory();
>>        IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>>       
>>        long length = file.length();
>>       
>>        BufferedReader fileReader = new BufferedReader(new 
>>FileReader(file));
>>       
>>        String line = "";
>>        double processed = 0;
>>        NumberFormat nf = NumberFormat.getPercentInstance();
>>        nf.setMaximumFractionDigits(0);
>>       
>>        String percent = "";
>>        String lastPercent = " ";
>>        long lines =0;
>>        while ((line = fileReader.readLine())!=null) {
>>            Document doc = new Document();
>>            doc.add(Field.UnStored("Line", line) );
>>            ramWriter.addDocument(doc);
>>            processed +=line.length();
>>            lines++;
>>            percent = nf.format(processed/length);
>>            if (!percent.equals(lastPercent)){
>>                lastPercent = percent;
>>                System.out.println(percent + "(lines=" + lines +
>>")");
>>            }
>>        }
>>        //writer.optimize();
>>        //writer.close();
>>       
>>       
>>    }
>>}
>>
>>=======
>>
>>I did other simple tests by testing exactly how long it takes Java to
>>
>>just read the lines of the file, and that is mega quick in
>>comparison.  
>>It's actually the "ramWriter.addDocument(doc)" line which seems to
>>have 
>>the biggest amount of work to do, and probably for good reason.  I
>>had 
>>originally tried to use Field.Text(...) to keep the line with the
>>index 
>>for Context later on, but even Unstored doesn't really make that much
>>
>>difference from a stopwatch time point of view (creates a bigger
>>index 
>>of course).
>>
>>I might setup a profiler and work through where it's taking the the 
>>time, but you guys probably already know the answer.
>>
>>I'm going to need much higher throughput for my utility to be useful.
>>
>>Maybe that's just not achievable.
>>
>>Thoughts?
>>
>>cheers,
>>
>>Paul Smith
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>  
>

Re: Indexing speed

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I believe most of the time is being spent in the Analyzer.  It should
be easy to empirically test this claim by using Field.Keyword instead
of Field.Text (Field.Keyword fields are not analyzed).  If that turns
out to be correct, then you could play with writing a custom and
optimal Analyzer.

Otis

--- Paul Smith <ps...@aconex.com> wrote:

> This relates to a previous post of mine regarding Context of 'lines'
> of 
> text (log4j events in my case):
> 
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
> 
> I'm going through the process of writing quick and dirty 
> test-case/test-bed classes to validate whether my ideas are going to 
> work or not. 
> 
> For my first test, I thought I would write a quick indexer that
> indexed 
> a traditional log file by lines, with each line being a Document, so 
> that I could then search for matching lines and then do a context 
> search.   Yes this is exactly what 'grep' does and does very well,
> but I 
> thought if one was doing a lot of analysis of a log file (typical
> when 
> mentally analysing log files) it might be best to index it once, and 
> then search quickly many times.
> 
> Turns out that even using JUST a RamDirectory (which suprised me),  
> writing a Document for every line of text isn't as fast as I was
> hoping, 
> it is taking significantly longer than I hoped.  I played around with
> 
> the mergeFactor settings etc, but nothing really made much difference
> to 
> the indexing speed, other than NOT adding the Document to the
> index....  
> I have tried this out on my Mac laptop, as well as a test Linux
> server 
> with no noticeable difference.  (Both scenarios have the reading log 
> file, and new index on the same physical drive, which I know is not
> the 
> _best_ setup, but still).
> 
> This could well be my own stupidness, so here's what I'm doing.
> 
> Statistics on the Log File
> =================
> 
> The log file is 28meg, consisting of 409566 lines, of the form:
> 
> [2004-12-21 00:00:00,935 INFO 
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
> Finished 
> processing [mail box=stagingfax][MsgCount=0]
> [2004-12-21 00:00:00,986 INFO 
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
> Finished 
> processing [mail box=aconexnz9000][MsgCount=0]
> [2004-12-21 00:00:01,126 INFO ][             
> monitor][http-80-Processor9][192.168.0.220][] Controller duration:
> 212ms 
> url=/Fax, fowardDuration=-1, total=212
> [2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
> 
> Could not connect to mail server! 
> [host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
> javax.mail.AuthenticationFailedException: Login failed:
> authentication 
> failure
>         at
> com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
>         at javax.mail.Service.connect(Service.java:233)
>         at javax.mail.Service.connect(Service.java:134)
>         at 
>
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
>         at 
>
com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
> 
> 
> ==================
> Source code for test-bed:
> ==================
> 
> public class TestBed1 {
> 
>     public static void main(String[] args) throws Exception {
>        
>         if(args.length <1) throw new IllegalArgumentException("not 
> enough args");
>         String filename = args[0];
>        
>         File file = new File(filename);
>         Analyzer a = new SimpleAnalyzer();
>        
>         String indexLoc = "/tmp/testbed1/";
>        
>         //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>        
>         RAMDirectory ramDir = new RAMDirectory();
>         IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>        
>         long length = file.length();
>        
>         BufferedReader fileReader = new BufferedReader(new 
> FileReader(file));
>        
>         String line = "";
>         double processed = 0;
>         NumberFormat nf = NumberFormat.getPercentInstance();
>         nf.setMaximumFractionDigits(0);
>        
>         String percent = "";
>         String lastPercent = " ";
>         long lines =0;
>         while ((line = fileReader.readLine())!=null) {
>             Document doc = new Document();
>             doc.add(Field.UnStored("Line", line) );
>             ramWriter.addDocument(doc);
>             processed +=line.length();
>             lines++;
>             percent = nf.format(processed/length);
>             if (!percent.equals(lastPercent)){
>                 lastPercent = percent;
>                 System.out.println(percent + "(lines=" + lines +
> ")");
>             }
>         }
>         //writer.optimize();
>         //writer.close();
>        
>        
>     }
> }
> 
> =======
> 
> I did other simple tests by testing exactly how long it takes Java to
> 
> just read the lines of the file, and that is mega quick in
> comparison.  
> It's actually the "ramWriter.addDocument(doc)" line which seems to
> have 
> the biggest amount of work to do, and probably for good reason.  I
> had 
> originally tried to use Field.Text(...) to keep the line with the
> index 
> for Context later on, but even Unstored doesn't really make that much
> 
> difference from a stopwatch time point of view (creates a bigger
> index 
> of course).
> 
> I might setup a profiler and work through where it's taking the the 
> time, but you guys probably already know the answer.
> 
> I'm going to need much higher throughput for my utility to be useful.
> 
> Maybe that's just not achievable.
> 
> Thoughts?
> 
> cheers,
> 
> Paul Smith
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org