You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2009/09/17 16:00:24 UTC

custom segment files

Hi guys:

     I am trying to figure how to add the ability to create custom segment
files. Hopefully it is possible to create a plugin framework where one can
provide some sort of callback to add to a segment given a doc and provide
some sort of merge logic. This is in light of the flexible indexing effort.

     After digging thru the latest trunk code in that area, I see a
Writer/WriterPerThread pattern for different types of segment files, e.g.
Stored data, norms, inverted doc, etc.

     Do you think it is a good idea to consolidate them? Are there
intricacies where there are cross dependency between different types of
writers?

     Merge logic seems to be in the SegmentMerger class. Seems to do this,
it would be good to separate it out to per writer type.

      I am still trying to understand the code, any help is greatly
appreciated.

Thoughts?

Thanks

-John

Re: custom segment files

Posted by John Wang <jo...@gmail.com>.
Thank you very much Michael for the information!

-John

On Fri, Sep 18, 2009 at 6:01 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> > Say you have a type of field with fixed length data per doc, e.g. a
> > 8 bytes.
>
> OK this makes sense -- thanks for the example!  This sounds like
> getting column-stride-fields before that feature is added to Lucene
> "for real".
>
> For flushing, you can plugin your own indexing chain to IndexWriter.
> This (customizing what's indexed per-doc and what's written for the
> new segment) is exactly what the pluggable indexing chain is for.
> BUT: this API is still very experimental and package private.
>
> I suppose, for looser integration we could add a hook that's called in
> IndexWriter giving you a chance to do something at flush.
> Hmm... actually could you use doAfterFlush()?
>
> Merging, however, doesn't yet have hooks / pluggability in place to do
> something custom, and I agree it's sorely needed.  Patches very
> welcome here!
>
> This could enable "loose" customization on what's flushed and how it's
> merged, and you'd have to make your own reader external to Lucene.
>
> LUCENE-1458 is aiming to cover this sort of use case, but in a more
> tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
> replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
> so that you could add your own attribute at the field, term, doc or
> positions level.  However I haven't explored this at all yet, and eg
> customizable merging is not done.
>
> > It [flush] probably doesn't need to be final Mike?
>
> I agree.  Wanna include un-final'ing it in a patch?
>
> > Is there a wiki or some sort of write up on LUCENE-1458?
>
> Sorry not just yet.  I agree it's badly needed... it's an enormous set
> of changes at this point.  I'll add a wiki page that I'll try to keep
> current as the design iterates.
>
> Mike
>
> On Thu, Sep 17, 2009 at 8:14 PM, John Wang <jo...@gmail.com> wrote:
> > Sure.
> >
> > A simple example:
> >
> > Say you have a type of field with fixed length data per doc, e.g. a 8
> bytes.
> > It might be good to store in a segment:
> > <numdocs><v1><v2>....<vn>
> >
> > so if you have 1000 docs, your seg file is 8k+4 bytes.
> >
> > Merging would be rather trivial as well.
> >
> > Doing this right now involves storing into payload, which pays a cost of
> > parsing byte[] to say a long per doc.
> >
> > I think this problem is orthogonal to 1458.
> >
> > There are other usecases, so I thought it might be a good idea to
> abstract
> > it out, since on a high level it is rather similar:
> >
> > start
> > write per doc
> > end
> > merge
> >
> > Hopefully I am describing it clearly.
> >
> > Thanks
> >
> > -John
> >
> >
> > On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> I'm actively working on LUCENE-1458, to enable differenct codecs for
> >> reading/writing the terms dict and doc/freq/prox/payload postings.
> >> I'm working now towards getting PforDelta working...
> >>
> >> However, that change doesn't [yet] do anything for norms, stored
> >> fields nor term vectors.
> >>
> >> Can you describe more details about what kinds of customization you're
> >> looking to do?
> >>
> >> Mike
> >>
> >> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <jo...@gmail.com>
> wrote:
> >> > Hi guys:
> >> >
> >> >      I am trying to figure how to add the ability to create custom
> >> > segment
> >> > files. Hopefully it is possible to create a plugin framework where one
> >> > can
> >> > provide some sort of callback to add to a segment given a doc and
> >> > provide
> >> > some sort of merge logic. This is in light of the flexible indexing
> >> > effort.
> >> >
> >> >      After digging thru the latest trunk code in that area, I see a
> >> > Writer/WriterPerThread pattern for different types of segment files,
> >> > e.g.
> >> > Stored data, norms, inverted doc, etc.
> >> >
> >> >      Do you think it is a good idea to consolidate them? Are there
> >> > intricacies where there are cross dependency between different types
> of
> >> > writers?
> >> >
> >> >      Merge logic seems to be in the SegmentMerger class. Seems to do
> >> > this,
> >> > it would be good to separate it out to per writer type.
> >> >
> >> >       I am still trying to understand the code, any help is greatly
> >> > appreciated.
> >> >
> >> > Thoughts?
> >> >
> >> > Thanks
> >> >
> >> > -John
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: custom segment files

Posted by Michael McCandless <lu...@mikemccandless.com>.
> Say you have a type of field with fixed length data per doc, e.g. a
> 8 bytes.

OK this makes sense -- thanks for the example!  This sounds like
getting column-stride-fields before that feature is added to Lucene
"for real".

For flushing, you can plugin your own indexing chain to IndexWriter.
This (customizing what's indexed per-doc and what's written for the
new segment) is exactly what the pluggable indexing chain is for.
BUT: this API is still very experimental and package private.

I suppose, for looser integration we could add a hook that's called in
IndexWriter giving you a chance to do something at flush.
Hmm... actually could you use doAfterFlush()?

Merging, however, doesn't yet have hooks / pluggability in place to do
something custom, and I agree it's sorely needed.  Patches very
welcome here!

This could enable "loose" customization on what's flushed and how it's
merged, and you'd have to make your own reader external to Lucene.

LUCENE-1458 is aiming to cover this sort of use case, but in a more
tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
so that you could add your own attribute at the field, term, doc or
positions level.  However I haven't explored this at all yet, and eg
customizable merging is not done.

> It [flush] probably doesn't need to be final Mike?

I agree.  Wanna include un-final'ing it in a patch?

> Is there a wiki or some sort of write up on LUCENE-1458?

Sorry not just yet.  I agree it's badly needed... it's an enormous set
of changes at this point.  I'll add a wiki page that I'll try to keep
current as the design iterates.

Mike

On Thu, Sep 17, 2009 at 8:14 PM, John Wang <jo...@gmail.com> wrote:
> Sure.
>
> A simple example:
>
> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>
>
> so if you have 1000 docs, your seg file is 8k+4 bytes.
>
> Merging would be rather trivial as well.
>
> Doing this right now involves storing into payload, which pays a cost of
> parsing byte[] to say a long per doc.
>
> I think this problem is orthogonal to 1458.
>
> There are other usecases, so I thought it might be a good idea to abstract
> it out, since on a high level it is rather similar:
>
> start
> write per doc
> end
> merge
>
> Hopefully I am describing it clearly.
>
> Thanks
>
> -John
>
>
> On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> I'm actively working on LUCENE-1458, to enable differenct codecs for
>> reading/writing the terms dict and doc/freq/prox/payload postings.
>> I'm working now towards getting PforDelta working...
>>
>> However, that change doesn't [yet] do anything for norms, stored
>> fields nor term vectors.
>>
>> Can you describe more details about what kinds of customization you're
>> looking to do?
>>
>> Mike
>>
>> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <jo...@gmail.com> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: custom segment files

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Sep 18, 2009 at 08:14:24AM +0800, John Wang wrote:

> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>

Heh.  You've just described this proof of concept class:

    http://www.rectangular.com/kinosearch/docs/devel/KSx/Index/ByteBufDocWriter.html
    http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KSx/Index/ByteBufDocWriter.pm

> Hopefully I am describing it clearly.

Sure, I understand exactly what you mean.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: custom segment files

Posted by John Wang <jo...@gmail.com>.
Sure.

A simple example:

Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
It might be good to store in a segment:
<numdocs><v1><v2>....<vn>

so if you have 1000 docs, your seg file is 8k+4 bytes.

Merging would be rather trivial as well.

Doing this right now involves storing into payload, which pays a cost of
parsing byte[] to say a long per doc.

I think this problem is orthogonal to 1458.

There are other usecases, so I thought it might be a good idea to abstract
it out, since on a high level it is rather similar:

start
write per doc
end
merge

Hopefully I am describing it clearly.

Thanks

-John


On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I'm actively working on LUCENE-1458, to enable differenct codecs for
> reading/writing the terms dict and doc/freq/prox/payload postings.
> I'm working now towards getting PforDelta working...
>
> However, that change doesn't [yet] do anything for norms, stored
> fields nor term vectors.
>
> Can you describe more details about what kinds of customization you're
> looking to do?
>
> Mike
>
> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <jo...@gmail.com> wrote:
> > Hi guys:
> >
> >      I am trying to figure how to add the ability to create custom
> segment
> > files. Hopefully it is possible to create a plugin framework where one
> can
> > provide some sort of callback to add to a segment given a doc and provide
> > some sort of merge logic. This is in light of the flexible indexing
> effort.
> >
> >      After digging thru the latest trunk code in that area, I see a
> > Writer/WriterPerThread pattern for different types of segment files, e.g.
> > Stored data, norms, inverted doc, etc.
> >
> >      Do you think it is a good idea to consolidate them? Are there
> > intricacies where there are cross dependency between different types of
> > writers?
> >
> >      Merge logic seems to be in the SegmentMerger class. Seems to do
> this,
> > it would be good to separate it out to per writer type.
> >
> >       I am still trying to understand the code, any help is greatly
> > appreciated.
> >
> > Thoughts?
> >
> > Thanks
> >
> > -John
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: custom segment files

Posted by Earwin Burrfoot <ea...@gmail.com>.
I bet custom per-segment files could very well be used for per-segment
userdata/debuginfo we introduced earlier.
With them it could be stored neatly in a separate file instead of
being grafted onto current ones.

On Thu, Sep 17, 2009 at 18:35, Michael McCandless
<lu...@mikemccandless.com> wrote:
> I'm actively working on LUCENE-1458, to enable differenct codecs for
> reading/writing the terms dict and doc/freq/prox/payload postings.
> I'm working now towards getting PforDelta working...
>
> However, that change doesn't [yet] do anything for norms, stored
> fields nor term vectors.
>
> Can you describe more details about what kinds of customization you're
> looking to do?
>
> Mike
>
> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <jo...@gmail.com> wrote:
>> Hi guys:
>>
>>      I am trying to figure how to add the ability to create custom segment
>> files. Hopefully it is possible to create a plugin framework where one can
>> provide some sort of callback to add to a segment given a doc and provide
>> some sort of merge logic. This is in light of the flexible indexing effort.
>>
>>      After digging thru the latest trunk code in that area, I see a
>> Writer/WriterPerThread pattern for different types of segment files, e.g.
>> Stored data, norms, inverted doc, etc.
>>
>>      Do you think it is a good idea to consolidate them? Are there
>> intricacies where there are cross dependency between different types of
>> writers?
>>
>>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
>> it would be good to separate it out to per writer type.
>>
>>       I am still trying to understand the code, any help is greatly
>> appreciated.
>>
>> Thoughts?
>>
>> Thanks
>>
>> -John
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: custom segment files

Posted by Michael McCandless <lu...@mikemccandless.com>.
I'm actively working on LUCENE-1458, to enable differenct codecs for
reading/writing the terms dict and doc/freq/prox/payload postings.
I'm working now towards getting PforDelta working...

However, that change doesn't [yet] do anything for norms, stored
fields nor term vectors.

Can you describe more details about what kinds of customization you're
looking to do?

Mike

On Thu, Sep 17, 2009 at 10:00 AM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: custom segment files

Posted by Jason Rutherglen <ja...@gmail.com>.
Yes, I guess you could branch the code?  It probably doesn't need to
be final Mike?

On Thu, Sep 17, 2009 at 7:16 PM, John Wang <jo...@gmail.com> wrote:
> Hi Michael:
>
>      Is there a wiki or some sort of write up on LUCENE-1458? It looks
> extremely cool!
>
> Re: Jason: isn't flush final?
>
> -John
>
> On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>
>> I believe you could override the IW.flush and IW.mergeSuccess
>> methods. flush unfortunately doesn't expose the new SegmentInfo,
>> however it could be obtained via
>> IW.getReader().getSequentialSubReaders (by comparing the before
>> and after).
>>
>> Adjacent segment files could then be maintained without hacking into
>> SegmentMerger.
>>
>> On Thu, Sep 17, 2009 at 7:00 AM, John Wang <jo...@gmail.com> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: custom segment files

Posted by John Wang <jo...@gmail.com>.
Hi Michael:

     Is there a wiki or some sort of write up on LUCENE-1458? It looks
extremely cool!

Re: Jason: isn't flush final?

-John

On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> I believe you could override the IW.flush and IW.mergeSuccess
> methods. flush unfortunately doesn't expose the new SegmentInfo,
> however it could be obtained via
> IW.getReader().getSequentialSubReaders (by comparing the before
> and after).
>
> Adjacent segment files could then be maintained without hacking into
> SegmentMerger.
>
> On Thu, Sep 17, 2009 at 7:00 AM, John Wang <jo...@gmail.com> wrote:
> > Hi guys:
> >
> >      I am trying to figure how to add the ability to create custom
> segment
> > files. Hopefully it is possible to create a plugin framework where one
> can
> > provide some sort of callback to add to a segment given a doc and provide
> > some sort of merge logic. This is in light of the flexible indexing
> effort.
> >
> >      After digging thru the latest trunk code in that area, I see a
> > Writer/WriterPerThread pattern for different types of segment files, e.g.
> > Stored data, norms, inverted doc, etc.
> >
> >      Do you think it is a good idea to consolidate them? Are there
> > intricacies where there are cross dependency between different types of
> > writers?
> >
> >      Merge logic seems to be in the SegmentMerger class. Seems to do
> this,
> > it would be good to separate it out to per writer type.
> >
> >       I am still trying to understand the code, any help is greatly
> > appreciated.
> >
> > Thoughts?
> >
> > Thanks
> >
> > -John
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: custom segment files

Posted by Jason Rutherglen <ja...@gmail.com>.
I believe you could override the IW.flush and IW.mergeSuccess
methods. flush unfortunately doesn't expose the new SegmentInfo,
however it could be obtained via
IW.getReader().getSequentialSubReaders (by comparing the before
and after).

Adjacent segment files could then be maintained without hacking into
SegmentMerger.

On Thu, Sep 17, 2009 at 7:00 AM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org