You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jason Huang <ja...@icare.com> on 2012/09/20 21:38:30 UTC

Status of HBASE-3529 (Add search to HBase)?

Hello,

I am interested in learning the possibility of integrating Lucene &
HBase. Google search points me to HBASE-3529 (Add search to HBase).

This project is currently listed as "patch available" but
"unresolved". Does that mean there are reported bugs in the patch that
haven't been resolved yet?  It appears that no one has actively worked
on this project for a while. Does anyone in this mail list know the
most recent status?

thanks!

Jason

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Stack <st...@duboce.net>.

On Thu, Sep 20, 2012 at 9:00 PM, Andrew Purtell <ap...@apache.org> wrote:
> Data wouldn't go in ES, just index. For us a generic indexing service may
> make sense but hey for others maybe not.
>

That could be so.  In my experience it starts out that way and then
you start adding more and more data so the query results are more
useful on return and then when you are done, your index is bigger than
the original dataset.

> I'm not arguing against the idea of Lucene indexes in HBase, just pointing
> out the issues with what's been done so far. Maybe the Lucene 4 APIs or the
> blur stuff is a way forward.
>

Jason!

St.Ack

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

Data wouldn't go in ES, just index. For us a generic indexing service may
make sense but hey for others maybe not.

I'm not arguing against the idea of Lucene indexes in HBase, just pointing
out the issues with what's been done so far. Maybe the Lucene 4 APIs or the
blur stuff is a way forward.

On Thursday, September 20, 2012, Stack wrote:

> On Thu, Sep 20, 2012 at 6:51 PM, Andrew Purtell <apurtell@apache.org<javascript:;>>
> wrote:
> > But, what stopped progress here is a veto of HDFS side changes needed for
> > the implementation to get that performance.
> >
>
> We could have another go and even do it ourselves if enough of us
> thought it worth it.
>
> >  If we're just
> >> rebuilding ElasticSearch, wouldn't a simple Coprocessor connector that
> >> managed communication with ES be simpler and more performant?
> >
>
> Why have your data in two places if you could avoid it?
>
> St.Ack
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Stack <st...@duboce.net>.

On Thu, Sep 20, 2012 at 6:51 PM, Andrew Purtell <ap...@apache.org> wrote:
> But, what stopped progress here is a veto of HDFS side changes needed for
> the implementation to get that performance.
>

We could have another go and even do it ourselves if enough of us
thought it worth it.

>  If we're just
>> rebuilding ElasticSearch, wouldn't a simple Coprocessor connector that
>> managed communication with ES be simpler and more performant?
>

Why have your data in two places if you could avoid it?

St.Ack

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

On Thu, Sep 20, 2012 at 7:00 PM, Andrew Purtell <ap...@apache.org> wrote:
> On Thu, Sep 20, 2012 at 6:59 PM, Jacques <wh...@gmail.com> wrote:
>>> Cool. Is it open sourced anywhere do you know?
>>>
>> https://github.com/nearinfinity/blur
>> http://incubator.apache.org/projects/blur.html
>
> Thanks!

Here, I think: https://github.com/nearinfinity/blur/tree/master/src/blur-store/src/main/java/com/nearinfinity/blur/store/hdfs

Also, in the ASF incubator:
https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=summary

Looks to be using a modified version of Lucene 3.6?

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

On Thu, Sep 20, 2012 at 6:59 PM, Jacques <wh...@gmail.com> wrote:
>> Cool. Is it open sourced anywhere do you know?
>>
> https://github.com/nearinfinity/blur
> http://incubator.apache.org/projects/blur.html

Thanks!

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Jacques <wh...@gmail.com>.

>
>
>
> Cool. Is it open sourced anywhere do you know?
>


https://github.com/nearinfinity/blur
http://incubator.apache.org/projects/blur.html

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

On Thursday, September 20, 2012, Jacques wrote:

> The reason I mentioned Blur.io is I thought they implemented a
> CodecProvider that was built for write-once HDFS.


Cool. Is it open sourced anywhere do you know?


> The whole layer violation problem is all about performance.  That is the
> big question I think people need to seriously ask themselves: does their
> particular use case allow substantially poorer performance than a local
> index for HDFS benefits?


That is a good question.

But, what stopped progress here is a veto of HDFS side changes needed for
the implementation to get that performance.

 If we're just
> rebuilding ElasticSearch, wouldn't a simple Coprocessor connector that
> managed communication with ES be simpler and more performant?


This is what I recommended one of our internal groups pursue exactly - use
of ES as an indexing service, not just for HBase data (hooked up via CPs)
but also for any app that would like to use it directly.

 Photobucket's Solbase is also an option if you front it with caching,
> maintain large stop lists and don't get beyond 50-100mm docs.
>
>
>
> On Thu, Sep 20, 2012 at 5:28 PM, Andrew Purtell <apurtell@apache.org<javascript:;>>
> wrote:
>
> > I like the approach of building Lucene indexes for HBase data via a
> > coprocessor. However, the requirement (for good performance) of mmap
> > of HDFS blocks from the local filesystem, presupposing regionserver
> > and datanode colocation, presupposing short circuit local access,
> > presupposing an HDFS API modification (that was vetoed), is at issue
> > here. It seems we have to do something else. How can HBase provide
> > index data to Lucene such that it isn't a massive layering violation?
> > Maybe the Lucene 4 Codec and CodecProvider interfaces? (I'm not all
> > that familiar with Lucene internals, so big caveat there.)
> >
> > Indeed Jason put a lot of work into the HBASE-3529 patch, and it is a
> > shame we couldn't commit the result.
> >
> >
> > On Thu, Sep 20, 2012 at 5:15 PM, Otis Gospodnetic
> > <otis.gospodnetic@gmail.com <javascript:;>> wrote:
> > > I agree with Stack.  I liked that whole approach and it's a shame it
> > > didn't get committed after all the work Jason put into it.
> > >
> > > Otis
> > > Search Analytics - http://sematext.com/search-analytics/index.html
> > > Performance Monitoring - http://sematext.com/spm/index.html
> > >
> > >
> > > On Thu, Sep 20, 2012 at 5:32 PM, Stack <stack@duboce.net<javascript:;>>
> wrote:
> > >> On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <apurtell@apache.org<javascript:;>
> >
> > wrote:
> > >>> The issue with the patch on HBASE-3529 is it relies on modifications
> > >>> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
> > >>> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
> > >>> vetoed. Therefore, further progress on HBASE-3529 as currently
> > >>> implemented is not possible.
> > >>>
> > >>
> > >> Jason's approach had much merit (IMO).  It warrants study at least.
> > >>
> > >> Though the indices were written to HDFS, Jason had it so lucene was
> > >> getting local filesystem access by going via the local read
> > >> short-circuit facility [1].  Being able to do this made it so he got
> > >> close to native speeds querying the "HDFS-based" indices.  When Jason
> > >> left it -- he had to get a real job unfortunately -- he was blocked on
> > >> what to do when a region moved.  He wanted to be able to be able to
> > >> immediately pull the indices local on region reopen.  The HDFS fellas
> > >> who commented in the issue cited by Andrew above thought it a little
> > >> dodgy adding API for this special case.
> > >>
> > >> If you wanted to follow in Jasons footsteps, lets chat.
> > >> St.Ack
> > >>
> > >> 1. http://hbase.apache.org/book.html#perf.hdfs.configs
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein (via Tom White)
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Jacques <wh...@gmail.com>.

The reason I mentioned Blur.io is I thought they implemented a
CodecProvider that was built for write-once HDFS.

The whole layer violation problem is all about performance.  That is the
big question I think people need to seriously ask themselves: does their
particular use case allow substantially poorer performance than a local
index for HDFS benefits?  Would ElasticSearch better solve their problems?
 Many companies I've talked to utilize SSDs to achieve their required QPS.
 If local disks won't work and HDFS is another couple layers of
abstraction, I'm not sure what a Lucene integration will provide without
the localization that Jason originally expected.

Don't get me wrong, I think this will happen.  We've been exploring various
solutions.  I just think that, like many things, the clear target use cases
must be determined so that the feature satisfies someone.    If we're just
rebuilding ElasticSearch, wouldn't a simple Coprocessor connector that
managed communication with ES be simpler and more performant?
 Photobucket's Solbase is also an option if you front it with caching,
maintain large stop lists and don't get beyond 50-100mm docs.

On Thu, Sep 20, 2012 at 5:28 PM, Andrew Purtell <ap...@apache.org> wrote:

> I like the approach of building Lucene indexes for HBase data via a
> coprocessor. However, the requirement (for good performance) of mmap
> of HDFS blocks from the local filesystem, presupposing regionserver
> and datanode colocation, presupposing short circuit local access,
> presupposing an HDFS API modification (that was vetoed), is at issue
> here. It seems we have to do something else. How can HBase provide
> index data to Lucene such that it isn't a massive layering violation?
> Maybe the Lucene 4 Codec and CodecProvider interfaces? (I'm not all
> that familiar with Lucene internals, so big caveat there.)
>
> Indeed Jason put a lot of work into the HBASE-3529 patch, and it is a
> shame we couldn't commit the result.
>
>
> On Thu, Sep 20, 2012 at 5:15 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
> > I agree with Stack.  I liked that whole approach and it's a shame it
> > didn't get committed after all the work Jason put into it.
> >
> > Otis
> > Search Analytics - http://sematext.com/search-analytics/index.html
> > Performance Monitoring - http://sematext.com/spm/index.html
> >
> >
> > On Thu, Sep 20, 2012 at 5:32 PM, Stack <st...@duboce.net> wrote:
> >> On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >>> The issue with the patch on HBASE-3529 is it relies on modifications
> >>> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
> >>> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
> >>> vetoed. Therefore, further progress on HBASE-3529 as currently
> >>> implemented is not possible.
> >>>
> >>
> >> Jason's approach had much merit (IMO).  It warrants study at least.
> >>
> >> Though the indices were written to HDFS, Jason had it so lucene was
> >> getting local filesystem access by going via the local read
> >> short-circuit facility [1].  Being able to do this made it so he got
> >> close to native speeds querying the "HDFS-based" indices.  When Jason
> >> left it -- he had to get a real job unfortunately -- he was blocked on
> >> what to do when a region moved.  He wanted to be able to be able to
> >> immediately pull the indices local on region reopen.  The HDFS fellas
> >> who commented in the issue cited by Andrew above thought it a little
> >> dodgy adding API for this special case.
> >>
> >> If you wanted to follow in Jasons footsteps, lets chat.
> >> St.Ack
> >>
> >> 1. http://hbase.apache.org/book.html#perf.hdfs.configs
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

I like the approach of building Lucene indexes for HBase data via a
coprocessor. However, the requirement (for good performance) of mmap
of HDFS blocks from the local filesystem, presupposing regionserver
and datanode colocation, presupposing short circuit local access,
presupposing an HDFS API modification (that was vetoed), is at issue
here. It seems we have to do something else. How can HBase provide
index data to Lucene such that it isn't a massive layering violation?
Maybe the Lucene 4 Codec and CodecProvider interfaces? (I'm not all
that familiar with Lucene internals, so big caveat there.)

Indeed Jason put a lot of work into the HBASE-3529 patch, and it is a
shame we couldn't commit the result.

On Thu, Sep 20, 2012 at 5:15 PM, Otis Gospodnetic
<ot...@gmail.com> wrote:
> I agree with Stack.  I liked that whole approach and it's a shame it
> didn't get committed after all the work Jason put into it.
>
> Otis
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
>
> On Thu, Sep 20, 2012 at 5:32 PM, Stack <st...@duboce.net> wrote:
>> On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:
>>> The issue with the patch on HBASE-3529 is it relies on modifications
>>> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
>>> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
>>> vetoed. Therefore, further progress on HBASE-3529 as currently
>>> implemented is not possible.
>>>
>>
>> Jason's approach had much merit (IMO).  It warrants study at least.
>>
>> Though the indices were written to HDFS, Jason had it so lucene was
>> getting local filesystem access by going via the local read
>> short-circuit facility [1].  Being able to do this made it so he got
>> close to native speeds querying the "HDFS-based" indices.  When Jason
>> left it -- he had to get a real job unfortunately -- he was blocked on
>> what to do when a region moved.  He wanted to be able to be able to
>> immediately pull the indices local on region reopen.  The HDFS fellas
>> who commented in the issue cited by Andrew above thought it a little
>> dodgy adding API for this special case.
>>
>> If you wanted to follow in Jasons footsteps, lets chat.
>> St.Ack
>>
>> 1. http://hbase.apache.org/book.html#perf.hdfs.configs

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Otis Gospodnetic <ot...@gmail.com>.

I agree with Stack.  I liked that whole approach and it's a shame it
didn't get committed after all the work Jason put into it.

Otis
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Thu, Sep 20, 2012 at 5:32 PM, Stack <st...@duboce.net> wrote:
> On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:
>> The issue with the patch on HBASE-3529 is it relies on modifications
>> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
>> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
>> vetoed. Therefore, further progress on HBASE-3529 as currently
>> implemented is not possible.
>>
>
> Jason's approach had much merit (IMO).  It warrants study at least.
>
> Though the indices were written to HDFS, Jason had it so lucene was
> getting local filesystem access by going via the local read
> short-circuit facility [1].  Being able to do this made it so he got
> close to native speeds querying the "HDFS-based" indices.  When Jason
> left it -- he had to get a real job unfortunately -- he was blocked on
> what to do when a region moved.  He wanted to be able to be able to
> immediately pull the indices local on region reopen.  The HDFS fellas
> who commented in the issue cited by Andrew above thought it a little
> dodgy adding API for this special case.
>
> If you wanted to follow in Jasons footsteps, lets chat.
> St.Ack
>
> 1. http://hbase.apache.org/book.html#perf.hdfs.configs

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Jacques <wh...@gmail.com>.

Few thoughts...

Blur.io may also add some new ideas.

I am exploring secondary index designs which is related.   Would love to
hear more about your goals.
On Sep 20, 2012 2:32 PM, "Stack" <st...@duboce.net> wrote:

> On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> > The issue with the patch on HBASE-3529 is it relies on modifications
> > to HDFS that the author of HBASE-3529 proposed to the HDFS project as
> > https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
> > vetoed. Therefore, further progress on HBASE-3529 as currently
> > implemented is not possible.
> >
>
> Jason's approach had much merit (IMO).  It warrants study at least.
>
> Though the indices were written to HDFS, Jason had it so lucene was
> getting local filesystem access by going via the local read
> short-circuit facility [1].  Being able to do this made it so he got
> close to native speeds querying the "HDFS-based" indices.  When Jason
> left it -- he had to get a real job unfortunately -- he was blocked on
> what to do when a region moved.  He wanted to be able to be able to
> immediately pull the indices local on region reopen.  The HDFS fellas
> who commented in the issue cited by Andrew above thought it a little
> dodgy adding API for this special case.
>
> If you wanted to follow in Jasons footsteps, lets chat.
> St.Ack
>
> 1. http://hbase.apache.org/book.html#perf.hdfs.configs
>

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Stack <st...@duboce.net>.

On Thu, Sep 20, 2012 at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:
> The issue with the patch on HBASE-3529 is it relies on modifications
> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
> vetoed. Therefore, further progress on HBASE-3529 as currently
> implemented is not possible.
>

Jason's approach had much merit (IMO).  It warrants study at least.

Though the indices were written to HDFS, Jason had it so lucene was
getting local filesystem access by going via the local read
short-circuit facility [1].  Being able to do this made it so he got
close to native speeds querying the "HDFS-based" indices.  When Jason
left it -- he had to get a real job unfortunately -- he was blocked on
what to do when a region moved.  He wanted to be able to be able to
immediately pull the indices local on region reopen.  The HDFS fellas
who commented in the issue cited by Andrew above thought it a little
dodgy adding API for this special case.

If you wanted to follow in Jasons footsteps, lets chat.
St.Ack

1. http://hbase.apache.org/book.html#perf.hdfs.configs

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Jason Huang <ja...@icare.com>.

Andrew - thanks for the quick response!

Jason

On Thu, Sep 20, 2012 at 3:43 PM, Andrew Purtell <ap...@apache.org> wrote:
> The issue with the patch on HBASE-3529 is it relies on modifications
> to HDFS that the author of HBASE-3529 proposed to the HDFS project as
> https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
> vetoed. Therefore, further progress on HBASE-3529 as currently
> implemented is not possible.
>
> On Thu, Sep 20, 2012 at 12:38 PM, Jason Huang <ja...@icare.com> wrote:
>> Hello,
>>
>> I am interested in learning the possibility of integrating Lucene &
>> HBase. Google search points me to HBASE-3529 (Add search to HBase).
>>
>> This project is currently listed as "patch available" but
>> "unresolved". Does that mean there are reported bugs in the patch that
>> haven't been resolved yet?  It appears that no one has actively worked
>> on this project for a while. Does anyone in this mail list know the
>> most recent status?
>>
>> thanks!
>>
>> Jason
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)

Re: Status of HBASE-3529 (Add search to HBase)?

Posted by Andrew Purtell <ap...@apache.org>.

The issue with the patch on HBASE-3529 is it relies on modifications
to HDFS that the author of HBASE-3529 proposed to the HDFS project as
https://issues.apache.org/jira/browse/HDFS-2004. The proposal was
vetoed. Therefore, further progress on HBASE-3529 as currently
implemented is not possible.

On Thu, Sep 20, 2012 at 12:38 PM, Jason Huang <ja...@icare.com> wrote:
> Hello,
>
> I am interested in learning the possibility of integrating Lucene &
> HBase. Google search points me to HBASE-3529 (Add search to HBase).
>
> This project is currently listed as "patch available" but
> "unresolved". Does that mean there are reported bugs in the patch that
> haven't been resolved yet?  It appears that no one has actively worked
> on this project for a while. Does anyone in this mail list know the
> most recent status?
>
> thanks!
>
> Jason

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)