You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by britske <gb...@gmail.com> on 2010/03/26 16:17:33 UTC

Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order

define fun ;-)

indeed I created a custom indexing chain, plugged it in and all works well.
I'm currently trying to isolate the critical parts to double test that
indeed most of the time goes in the indexing process.
It's kind of hard doing accurate measurements bc. the indexer is designed to
be running asychronously from the process that calculates the maps etc ,so
to make use of concurrent IO and CPU.

however from a birds-eys view: disabling the async custom indexer and only
doing the described calculating/ populating of the ordered maps increases
throughput from 1.8 docs/sec to 5.4 docs/sec, while with or without aysnc
indexer enabled the total application is never cpu-bound. (not even close)

Investigating more, and reporting back.

Glad you liked the case ;-)

Geert-Jan

2010/3/26 Michael McCandless-2 [via Lucene] <
ml-node+676471-1074052327-38570@n3.nabble.com<ml...@n3.nabble.com>
>

> This sounds like fun :)
>
> So you've already created a custom indexing chain, and plugged this
> into DocumentsWriter?  And this chain directly interacts with the low
> level classes for writing a segment (FormatPostingsTerms/DocsConsumer,
> etc.)?
>
> I'm not sure you're gonna do much better than that... these classes
> already expect things "in order" and all they do (pretty much) is
> write the index files.  I think they should be pretty lean...
>
> Also, once flex lands, soon (note that it moves these low level
> interfaces/classes around)... since you're using these classes for
> writing, it'll mean you can freely swap in different codecs.
>
> The only thing you can do further is to conflate your custom code with
> the codec, ie, so that you make a single chain that directly writes
> index files.  But I'm not sure you'll gain much performance by doing
> so... (and then you can't [as easily] swap codecs).
>
> Have you profiled to see where the time is being spent?
>
> Mike
>
> On Thu, Mar 25, 2010 at 7:40 PM, britske <[hidden email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=0>>
> wrote:
>
> >
> > Hi,
> >
> > perhaps first some background:
> >
> > I need to speed-up indexing for an particular application which has a
> pretty
> > unsual schema: besides the normal stored and indexed fields we have about
>
> > 20.000 fields  per document which are all indexed/ non-stored sInts.
> >
> > Obviously indexing was really slow with such a number of fields. With
> > indexing through Solr we got about 0.3 docs/ sec ( on a ec2 m1.large
> > instance)
> >
> > Since these ~20.000 fields are all build/ calculated analogously, we
> figured
> > it would be possible to possibly build a low-level indexer for these
> fields
> > (of which we had domain knowledge which we could use to possibly speed
> > indexing up) and later merge them with the other fields to construct the
> > entire index. So we did and now achieve around 1.8 docs/ sec (6x
> speedup).
> > Not bad, but still not enough.
> >
> > As part of calculating these fields, we keep track of all fields, terms
> per
> > field, and docids per term ( per field) .
> > all the stuff is then ordered: the fields, the terms available for each
> > field, the docids per term and inserted in that order using lowel-level
> > classes like: FormatPostingsFieldsWriter , FormatPostingsTermsConsumer,
> > FormatPostingsDocsConsumer (pseudo-code below)
> >
> > This constructs the following files: .tis, .tii, .frq.  ( + some default
> > values for the other required files, which don't need actual data bc.
> these
> > fields are not stored..)
> >
> > I should also mention that each call to the indexer writes all available
> > fields, terms, docids to a new fsdirectory. So basically each call
> results
> > in a complete index. (containing about 100 docs each, bc. otherwise we
> run
> > into mem-problems keeping the  ordered maps)
> >
> > Since we already have all fields, terms, docids in order it seems (a lot
> of
> > ) overkill to me to be going through the methods that above-mentioned
> > classes offer, which were meant for more 'non-sequential / non-ordered'
> > inserts (AFAIK).
> >
> > What would be the best way to write .tis, .tii and .frq ni a more
> sequential
> > matter?
> > I'm looking for something that would construct a byte-array for each file
>
> > that conforms to the index-file definition of that particular file (or
> > something) . I could try to do it myself and completely bypass all
> > indexing-classes altogether and just write the files to disk. (Possible
> bc.
> > as mentioned we have all data needed to construct a complete index) .
> >
> > However, perhaps there are classes I'm not aware of that help in getting
> the
> > format right (it seems like a lot of trial-and-error coding otherwise)
> >
> > Thanks for any help, pointers, etc.
> >
> > Geert-Jan
> >
> >
> >
> > PSEUDO-CODE of the current low-level (not-so) sequential indexer:
> >
> >        foreach(String sField: fieldsInOrder){
> >                --> add field to FieldInfos and grab newly created
> fieldInfo
> >                --> add fieldInfo to FormatPostingsFieldsWriter and grab
> > formatPostingsTermsConsumer
> >                List<String> termsInOrder =
> termsInOrderForFieldMap(sField);
> >                foreach(String sTerm: termsInOrder){
> >                        FormatPostingsDocsConsumer frq =
> formatPostingsTermsConsumer.add(sTerm);
> >                        List<Integer> docidsInOrderPerFieldTerm =
> > docidsInOrderPerFieldTermMap(sField+"-"+sTerm);
> >                        for(Integer docid:docidsInOrderPerFieldTerm){
> >                                frq.addDoc(docid);
> >                        }
> >                        //close relevant stuff
> >                }
> >                //close relevant stuff
> >        }
> >        //close relevant stuff
> >
> >
> >
> > --
> > View this message in context:
> http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p576998.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=1>
> > For additional commands, e-mail: [hidden email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=2>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=3>
> For additional commands, e-mail: [hidden email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=4>
>
>
>
> ------------------------------
>  View message @
> http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p676471.html
> To unsubscribe from custom low-level indexer (to speed things up) when
> fields, terms and docids are in order, click here< (link removed) ==>.
>
>
>

-- 
View this message in context: http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p676678.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.