You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/04/11 17:10:19 UTC

(Benchmark) Split DocMaker into DocCollector and DocMaker

Hi

I would like to propose some refactoring to the benchmark package. Today,
DocMaker has two roles: collecting documents from a collection and preparing
a Document object. I think these two should actually be split up to
DocCollector and DocMaker, which will use a DocCollector instance.

DocCollector will implement all the methods of DocMaker, like
getNextDocData, raw size in bytes tracking etc. This can actually fit well
w/ 1591, by having a basic DocCollector that offers input stream services,
and wraps a file (for example) with a bzip or gzip streams etc.

DocMaker will implement the makeDocument methods, reusing DocState etc.

The idea is that collecting the Enwiki documents, for example, should be the
same whether I create documents using DocState, add payloads or index
additional metadata. Same goes for Trec and Reuters collections, as well as
LineDocMaker.
In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are
99% the same and 99% different. Most of their differences lie in the way
they read the data, while most of the similarity lies in the way the create
documents (using DocState).
That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker
(just the reuse of DocState). Also, other DocMakers do not use that DocState
today.

So by having a EnwikiDocCollector, ReutersDocCollector and others (TREC,
Line, Simple), I can write several DocMakers, such as DocStateMaker,
ConfigurableDocMaker (one which accpets all kinds of config options) and
custom DocMakers (payload, facets, sorting), passing to them a DocCollector
instance (much like we do today w/ DocMaker) and reuse the same DocMaking
algorithm with many document collections, as well as the same document
collection algorithm with many DocMaker implementations.

This will also give us the opportunity to perf test document collection
alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead
of creating a Document object.

I've already done so in my code environment (I extend the benchmark package
for my application's purposes) and I like the flexibility I have. I think
this can be a nice contribution to the benchmark package, which can result
in some code cleanup as well.

What do you think? I can open an issue and work out a patch.

Shai

Re: (Benchmark) Split DocMaker into DocCollector and DocMaker

Posted by Shai Erera <se...@gmail.com>.
ConentSource is also a good name. I was thinking that its main API will be
getNextDocData (which already exists today) and will return DocData (as it
does today). Then BasicDocMaker or DocStateDocMaker will translate it into
DocState.
getNextDocData will receive a DocData object to reuse (something which
doesn't happen today) and it will be the DocMaker managing the DD instance
per thread or not, just like it does today.

I will open an issue and start to work on a patch. We can then iterate on
it.

On Sat, Apr 11, 2009 at 6:21 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Sounds great!
>
> As long as LineDocMaker still has very low overhead :)
>
> But how about the name RawContentSource (or maybe ContentSource)
> instead of DocCollector?  Ie, it's the thing that pulls raw content
> from somewhere, and then DocMaker creates documents from it?
>
> Mike
>
> On Sat, Apr 11, 2009 at 11:10 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > I would like to propose some refactoring to the benchmark package. Today,
> > DocMaker has two roles: collecting documents from a collection and
> preparing
> > a Document object. I think these two should actually be split up to
> > DocCollector and DocMaker, which will use a DocCollector instance.
> >
> > DocCollector will implement all the methods of DocMaker, like
> > getNextDocData, raw size in bytes tracking etc. This can actually fit
> well
> > w/ 1591, by having a basic DocCollector that offers input stream
> services,
> > and wraps a file (for example) with a bzip or gzip streams etc.
> >
> > DocMaker will implement the makeDocument methods, reusing DocState etc.
> >
> > The idea is that collecting the Enwiki documents, for example, should be
> the
> > same whether I create documents using DocState, add payloads or index
> > additional metadata. Same goes for Trec and Reuters collections, as well
> as
> > LineDocMaker.
> > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they
> are
> > 99% the same and 99% different. Most of their differences lie in the way
> > they read the data, while most of the similarity lies in the way the
> create
> > documents (using DocState).
> > That led to a somehwat bizzare extension of LineDocMaker by
> EnwikiDocMaker
> > (just the reuse of DocState). Also, other DocMakers do not use that
> DocState
> > today.
> >
> > So by having a EnwikiDocCollector, ReutersDocCollector and others (TREC,
> > Line, Simple), I can write several DocMakers, such as DocStateMaker,
> > ConfigurableDocMaker (one which accpets all kinds of config options) and
> > custom DocMakers (payload, facets, sorting), passing to them a
> DocCollector
> > instance (much like we do today w/ DocMaker) and reuse the same DocMaking
> > algorithm with many document collections, as well as the same document
> > collection algorithm with many DocMaker implementations.
> >
> > This will also give us the opportunity to perf test document collection
> > alone (i.e., compare bzip, gzip and regular input streams), w/o the
> overhead
> > of creating a Document object.
> >
> > I've already done so in my code environment (I extend the benchmark
> package
> > for my application's purposes) and I like the flexibility I have. I think
> > this can be a nice contribution to the benchmark package, which can
> result
> > in some code cleanup as well.
> >
> > What do you think? I can open an issue and work out a patch.
> >
> > Shai
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: (Benchmark) Split DocMaker into DocCollector and DocMaker

Posted by Michael McCandless <lu...@mikemccandless.com>.
Sounds great!

As long as LineDocMaker still has very low overhead :)

But how about the name RawContentSource (or maybe ContentSource)
instead of DocCollector?  Ie, it's the thing that pulls raw content
from somewhere, and then DocMaker creates documents from it?

Mike

On Sat, Apr 11, 2009 at 11:10 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I would like to propose some refactoring to the benchmark package. Today,
> DocMaker has two roles: collecting documents from a collection and preparing
> a Document object. I think these two should actually be split up to
> DocCollector and DocMaker, which will use a DocCollector instance.
>
> DocCollector will implement all the methods of DocMaker, like
> getNextDocData, raw size in bytes tracking etc. This can actually fit well
> w/ 1591, by having a basic DocCollector that offers input stream services,
> and wraps a file (for example) with a bzip or gzip streams etc.
>
> DocMaker will implement the makeDocument methods, reusing DocState etc.
>
> The idea is that collecting the Enwiki documents, for example, should be the
> same whether I create documents using DocState, add payloads or index
> additional metadata. Same goes for Trec and Reuters collections, as well as
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are
> 99% the same and 99% different. Most of their differences lie in the way
> they read the data, while most of the similarity lies in the way the create
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker
> (just the reuse of DocState). Also, other DocMakers do not use that DocState
> today.
>
> So by having a EnwikiDocCollector, ReutersDocCollector and others (TREC,
> Line, Simple), I can write several DocMakers, such as DocStateMaker,
> ConfigurableDocMaker (one which accpets all kinds of config options) and
> custom DocMakers (payload, facets, sorting), passing to them a DocCollector
> instance (much like we do today w/ DocMaker) and reuse the same DocMaking
> algorithm with many document collections, as well as the same document
> collection algorithm with many DocMaker implementations.
>
> This will also give us the opportunity to perf test document collection
> alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead
> of creating a Document object.
>
> I've already done so in my code environment (I extend the benchmark package
> for my application's purposes) and I like the flexibility I have. I think
> this can be a nice contribution to the benchmark package, which can result
> in some code cleanup as well.
>
> What do you think? I can open an issue and work out a patch.
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org