You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2012/10/05 05:34:56 UTC

Lucene instead of HFiles?

Hi,

Has anyone attempted using Lucene instead of HFiles (see
https://twitter.com/otisg/status/254047978174701568 )?

Is that a completely crazy, bad, would-never-work,
don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
not?

Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

Re: Lucene instead of HFiles?

Posted by Michael Segel <mi...@hotmail.com>.

Actually I think you'd want to do the reverse. 
Store your Lucene index in HBase. Which is what we did a while back. 

This could be extended to SOLR, but we never had time to do it. 


On Oct 5, 2012, at 4:11 AM, Lars George <la...@gmail.com> wrote:

> Hi Otis,
> 
> My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying.
> 
> Lars
> 
> On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <ot...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Has anyone attempted using Lucene instead of HFiles (see
>> https://twitter.com/otisg/status/254047978174701568 )?
>> 
>> Is that a completely crazy, bad, would-never-work,
>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
>> not?
>> 
>> Thanks,
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
> 
>

Re: Lucene instead of HFiles?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Lars,

Yeah, maybe.  Somewhere in the back of my head was a completely fuzzy
idea that if one were to sneak in Lucene at that low level one could
get that full-text search over HBase data that comes up periodically.
Also, I was thinking, having Lucene down there could make it possible
to get ad-hoc reports on data in HBase and one wouldn't have to figure
out the key structure ahead of time.

But I think Jacques makes a good point - there are already
ElasticSearch and Solr.  They are full-text search engines, but people
also use them for pure boolean matching, as key value stores, etc.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Fri, Oct 5, 2012 at 5:11 AM, Lars George <la...@gmail.com> wrote:
> Hi Otis,
>
> My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying.
>
> Lars
>
> On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <ot...@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone attempted using Lucene instead of HFiles (see
>> https://twitter.com/otisg/status/254047978174701568 )?
>>
>> Is that a completely crazy, bad, would-never-work,
>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
>> not?
>>
>> Thanks,
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
>

Re: Lucene instead of HFiles?

Posted by Lars George <la...@gmail.com>.

Hi Otis,

My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying.

Lars

On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <ot...@gmail.com> wrote:

> Hi,
> 
> Has anyone attempted using Lucene instead of HFiles (see
> https://twitter.com/otisg/status/254047978174701568 )?
> 
> Is that a completely crazy, bad, would-never-work,
> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
> not?
> 
> Thanks,
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html

Re: Lucene instead of HFiles?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Renaud,

On Fri, Oct 5, 2012 at 4:48 AM, Renaud Delbru <re...@deri.org> wrote:
> Hi,
>
> With respect to point 3, I know there is a new codec in Lucene 4.0 for
> append-only filesystem such as hdfs (LUCENE-2373)

Yeah.  Though I think nobody wants to search indices directly in HDFS
for performance reasons.

> Also, it would also depend on the use case. At the moment, for storing data,
> I would expect HFile to be much more efficient in term of compression than
> Lucene file system (in fact, there is no real comnpression, apart by
> compressing yourself the field byte stream before storing it). There is some
> work to try to make Lucene more efficient for small and medium sized fields
> (LUCENE-4226 - block-style compression and storing), but I think HFile is
> far more optimised for this task.

I wouldn't know... though I was under the impression there has been
other work around packing things tightly both on disk and in memory.
Check http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene
... slide 16, etc.

> In fact, another interesting idea would be to investigate the use of HFile
> as a StoredFieldFormat in Lucene. Efficient storage of data in Lucene is
> imho quite a missing feature.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


> On 05/10/12 07:36, Adrien Mogenet wrote:
>>
>> "Don't bother trying this in production" ;-)
>>
>> 1. Are you sure lookup by key are faster ?
>> 2. Updating Lucene files in a lock-free maneer and ensuring good
>> concurrency can be a bit tricky
>> 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed
>> storage is required. Katta does not look as powerful as Hadoop.
>>
>> On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic
>> <ot...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Has anyone attempted using Lucene instead of HFiles (see
>>> https://twitter.com/otisg/status/254047978174701568 )?
>>>
>>> Is that a completely crazy, bad, would-never-work,
>>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
>>> not?
>>>
>>> Thanks,
>>> Otis
>>> --
>>> Search Analytics - http://sematext.com/search-analytics/index.html
>>> Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>
>>
>

Re: Lucene instead of HFiles?

Posted by Renaud Delbru <re...@deri.org>.

Hi,

With respect to point 3, I know there is a new codec in Lucene 4.0 for 
append-only filesystem such as hdfs (LUCENE-2373)

Also, it would also depend on the use case. At the moment, for storing 
data, I would expect HFile to be much more efficient in term of 
compression than Lucene file system (in fact, there is no real 
comnpression, apart by compressing yourself the field byte stream before 
storing it). There is some work to try to make Lucene more efficient for 
small and medium sized fields (LUCENE-4226 - block-style compression and 
storing), but I think HFile is far more optimised for this task.
In fact, another interesting idea would be to investigate the use of 
HFile as a StoredFieldFormat in Lucene. Efficient storage of data in 
Lucene is imho quite a missing feature.

my2c
Regards
-- 
Renaud Delbru

On 05/10/12 07:36, Adrien Mogenet wrote:
> "Don't bother trying this in production" ;-)
>
> 1. Are you sure lookup by key are faster ?
> 2. Updating Lucene files in a lock-free maneer and ensuring good
> concurrency can be a bit tricky
> 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed
> storage is required. Katta does not look as powerful as Hadoop.
>
> On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
>> Hi,
>>
>> Has anyone attempted using Lucene instead of HFiles (see
>> https://twitter.com/otisg/status/254047978174701568 )?
>>
>> Is that a completely crazy, bad, would-never-work,
>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
>> not?
>>
>> Thanks,
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
>
>
>

RE: Lucene instead of HFiles?

Posted by Fuad Efendi <fu...@efendi.ca>.

If you don't like HFiles, and prefer Solr instead, consider Map. It is very
nice... 
: -  ) 

What about EhCache? Still synchronized?......... use LinkedHashMap......

You just need "inverted table" for a search by secondary index, and you are
comparing Lucene with HTable... wow... everything depends on use case... I
prefer auxiliary tables in HBase with extra fastest FIFO in-memory caches,
and if I don't need transactions - I don't use them...

-Fuad




-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: October-05-12 10:35 PM
To: user@hbase.apache.org
Subject: RE: Lucene instead of HFiles?

Lucene sucks with traditional "secondary indices" for traditional tables...
engineering overhead, too much... and you indeed already have kind of
"secondary indices" with HFile and Bloom Filter structure... just design
"secondary" Bloom filters etc.......

Yes, Lucene/Solr already implement this functionality. But we can improve it
for "non-tokenized" secondary indices.


-Fuad

RE: Lucene instead of HFiles?

Posted by Fuad Efendi <fu...@efendi.ca>.

Lucene sucks with traditional "secondary indices" for traditional tables...
engineering overhead, too much... and you indeed already have kind of
"secondary indices" with HFile and Bloom Filter structure... just design
"secondary" Bloom filters etc.......

Yes, Lucene/Solr already implement this functionality. But we can improve it
for "non-tokenized" secondary indices.


-Fuad

Re: Lucene instead of HFiles?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

On Fri, Oct 5, 2012 at 2:36 AM, Adrien Mogenet <ad...@gmail.com> wrote:
> "Don't bother trying this in production" ;-)
>
> 1. Are you sure lookup by key are faster ?

No clue.  But I also didn't say it's faster, just fast. :)

> 2. Updating Lucene files in a lock-free maneer and ensuring good
> concurrency can be a bit tricky

AFAIK Lucene files are immutable.  Updates are delete and add.
Deletes are flags like tombstone markers in HBase.

> 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed
> storage is required. Katta does not look as powerful as Hadoop.

Katta and Hadoop are two different tools, though.  From what I recall,
Katta simply used HDFS for storing indices, but would push them
elsewhere for searching purposes.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

> On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
>> Hi,
>>
>> Has anyone attempted using Lucene instead of HFiles (see
>> https://twitter.com/otisg/status/254047978174701568 )?
>>
>> Is that a completely crazy, bad, would-never-work,
>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
>> not?
>>
>> Thanks,
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me

Re: Lucene instead of HFiles?

Posted by Adrien Mogenet <ad...@gmail.com>.

"Don't bother trying this in production" ;-)

1. Are you sure lookup by key are faster ?
2. Updating Lucene files in a lock-free maneer and ensuring good
concurrency can be a bit tricky
3. AFAIK, Lucene files don't fit in HDFS and thus another distributed
storage is required. Katta does not look as powerful as Hadoop.

On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic
<ot...@gmail.com> wrote:
> Hi,
>
> Has anyone attempted using Lucene instead of HFiles (see
> https://twitter.com/otisg/status/254047978174701568 )?
>
> Is that a completely crazy, bad, would-never-work,
> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
> not?
>
> Thanks,
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html

-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: Lucene instead of HFiles?

Posted by Jacques <wh...@gmail.com>.

Abstractly, isn't this what Elastic Search and Katta already are:
 range-sharded data stores based on top of Lucene?

J

On Thu, Oct 4, 2012 at 8:34 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi,
>
> Has anyone attempted using Lucene instead of HFiles (see
> https://twitter.com/otisg/status/254047978174701568 )?
>
> Is that a completely crazy, bad, would-never-work,
> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or
> not?
>
> Thanks,
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>