You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Thomas Koch <th...@koch.ro> on 2010/03/25 10:42:17 UTC

ported lucandra: lucene index on HBase

Hi,

Lucandra stores a lucene index on cassandra:
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend

As the author of lucandra writes: "I’m sure something similar could be built 
on hbase."

So here it is:
http://github.com/thkoch2001/lucehbase

This is only a first prototype which has not been tested on anything real yet. 
But if you're interested, please join me to get it production ready!

I propose to keep this thread on hbase-user and java-dev only.
Would it make sense to aim this project to become an hbase contrib? Or a 
lucene contrib?

Best regards,

Thomas Koch, http://www.koch.ro

Re: ported lucandra: lucene index on HBase

Posted by TuX RaceR <tu...@gmail.com>.
Thank you Karthik, I did not know your project and joined the project's 
mailing list ;)
As I started this thread here, on Hbase list, maybe I just continue here.

Karthik K wrote:
> The HBase RPC is being modified , to append a docid to an already existing
> field/term , to the compressed encoding stored in the family/ col. name, to
> achieve the locality of reference and scale with the number of documents.
>   
woaw, that sounds very interesting ;) Is  there a HBase Jira for this or 
is that only available in your code?

> Once the documents go in the index, for all practical purpose, the
> manipulation is done across numbers , assigned to the user specified id
> space.
> More often than not, the only field that is stored is the "id" , that is
> retrieved after all the computation, that can then be used to index into
> another store to retrieve other details of the search schema. Except for
> limited cases (sorting / faceting etc.) , using the tf-idf representation
> for storing the 'field's in document goes against the format being used and
> is advised to be used sparingly.
>   
Looking at http://wiki.github.com/akkumar/hbasene/hbase-tf-idf-index-formats

I see you have:


    Term Frequency

The TF-IDF (Term Frequency/ Inverse Document Format) representation is 
as follows.

row 	fm.termFrequency

	<lucene_int_1> 	<lucene_int_2> 	<lucene_int_3>
<field/term> 	<termPositions_of_field/term_in_lucene_int_1> 
<termPositions_in_lucene_int_2> 	<termPositions_in_lucene_int_3>


my question is what happens if the term is a very common term
e.g if you have 1 billion=10^9 documents in your database and a term 
that is contained in one document every 100 document (i.e that is a term 
contained in 10 million=10^7 documents) then retrieving this row you 
will get a huge network payload. How do you deal with that kind of scenario?

Thanks
TuX



Re: ported lucandra: lucene index on HBase

Posted by Karthik K <os...@gmail.com>.
Hi TuX -
   There is a different project, being done here about using HBase as the
backing store of TF-IDF, here at -  http://github.com/akkumar/hbasene , but
addressing the same problem and I am speaking on  behalf of that.


On Mon, Apr 19, 2010 at 1:06 AM, TuX RaceR <tu...@gmail.com> wrote:

> Hi Thomas,
>
> Thanks for sharing your code for lucehbase.
> The schema you used  seems the same as the one use in lucandra:
>
> -------------------
> *Documents Ids are currently random and autogenerated.
>
> *Term keys and Document Keys are encoded as follows (using a random binary
> delimiter)
>
>     Term Key                     col name         value
>     "index_name/field/term" => { documentId , position vector }
>
>     Document Key
>     "index_name/documentId" => { fieldName , value }
> --------------------
>
> I have two questions:
> 1) for a given term key, the number of column can get potentially very
> large. Have you tried another schema where the document id is put in the
> key, i.e.:
>
>     Term Key                                               col name
> value
>     "index_name/field/term/docid" => { info , position vector }
> That way you get trivial paging in the case where a lot of documents
> contain the term.
>


The documents are encoded using a compressed bitset to scale, since with the
docid being part of the key,  (docid * unique terms) , it will not address
the best locality of reference for unions/  intersections  / range queries
etc.

The HBase RPC is being modified , to append a docid to an already existing
field/term , to the compressed encoding stored in the family/ col. name, to
achieve the locality of reference and scale with the number of documents.


>
> 2) once you get the list of docids, to get the document details (i.e the
> pairs { fieldName , value }), you will trigger a lot of random access
> queries to Hbase (where in 1, with the alternative schema
> "index_name/field/term/docid" you open a scanner and with the schema
> "index_name/field/term" you just get one row). I am wondering how you can
> get fast answers that way. If you have few fields, would it be a good idea
> to store also the values in the index (only the alternative schema
> "index_name/field/term/docid" allows this)?
>

Once the documents go in the index, for all practical purpose, the
manipulation is done across numbers , assigned to the user specified id
space.
More often than not, the only field that is stored is the "id" , that is
retrieved after all the computation, that can then be used to index into
another store to retrieve other details of the search schema. Except for
limited cases (sorting / faceting etc.) , using the tf-idf representation
for storing the 'field's in document goes against the format being used and
is advised to be used sparingly.

There is a low-volume mailing list here -
http://groups.google.com/group/hbasene-user , for discussion about the same,
that you can hop on if you are interested.



> Thanks
> TuX
>
>
>
>
> Thomas Koch wrote:
>
>> Hi,
>>
>> Lucandra stores a lucene index on cassandra:
>>
>> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>>
>> As the author of lucandra writes: "I’m sure something similar could be
>> built on hbase."
>>
>> So here it is:
>> http://github.com/thkoch2001/lucehbase
>>
>> This is only a first prototype which has not been tested on anything real
>> yet. But if you're interested, please join me to get it production ready!
>>
>> I propose to keep this thread on hbase-user and java-dev only.
>> Would it make sense to aim this project to become an hbase contrib? Or a
>> lucene contrib?
>>
>> Best regards,
>>
>> Thomas Koch, http://www.koch.ro
>>
>>
>
>

Re: ported lucandra: lucene index on HBase

Posted by TuX RaceR <tu...@gmail.com>.
Hi Thomas,

Thanks for sharing your code for lucehbase.
The schema you used  seems the same as the one use in lucandra:

-------------------
*Documents Ids are currently random and autogenerated.

*Term keys and Document Keys are encoded as follows (using a random 
binary delimiter)

      Term Key                     col name         value
      "index_name/field/term" => { documentId , position vector }

      Document Key
      "index_name/documentId" => { fieldName , value }
--------------------

I have two questions:
1) for a given term key, the number of column can get potentially very 
large. Have you tried another schema where the document id is put in the 
key, i.e.:

      Term Key                                               col 
name         value
      "index_name/field/term/docid" => { info , position vector }
That way you get trivial paging in the case where a lot of documents 
contain the term.

2) once you get the list of docids, to get the document details (i.e the 
pairs { fieldName , value }), you will trigger a lot of random access 
queries to Hbase (where in 1, with the alternative schema 
"index_name/field/term/docid" you open a scanner and with the schema 
"index_name/field/term" you just get one row). I am wondering how you 
can get fast answers that way. If you have few fields, would it be a 
good idea to store also the values in the index (only the alternative 
schema "index_name/field/term/docid" allows this)?

Thanks
TuX



Thomas Koch wrote:
> Hi,
>
> Lucandra stores a lucene index on cassandra:
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>
> As the author of lucandra writes: "I’m sure something similar could be built 
> on hbase."
>
> So here it is:
> http://github.com/thkoch2001/lucehbase
>
> This is only a first prototype which has not been tested on anything real yet. 
> But if you're interested, please join me to get it production ready!
>
> I propose to keep this thread on hbase-user and java-dev only.
> Would it make sense to aim this project to become an hbase contrib? Or a 
> lucene contrib?
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro
>   


Re: ported lucandra: lucene index on HBase

Posted by Jean-Daniel Cryans <jd...@apache.org>.
That sounds great Thomas! You can start by adding an entry here
http://wiki.apache.org/hadoop/SupportingProjects

WRT becoming an HBase contrib, we have a rule that at least one
committer (or a very active contributor) must be in charge and be
available to fix anything broken in it due to changes in core HBase.
For example, if a contrib doesn't compile before a release, we will
exclude it.

J-D

On Thu, Mar 25, 2010 at 2:42 AM, Thomas Koch <th...@koch.ro> wrote:
> Hi,
>
> Lucandra stores a lucene index on cassandra:
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>
> As the author of lucandra writes: "I’m sure something similar could be built
> on hbase."
>
> So here it is:
> http://github.com/thkoch2001/lucehbase
>
> This is only a first prototype which has not been tested on anything real yet.
> But if you're interested, please join me to get it production ready!
>
> I propose to keep this thread on hbase-user and java-dev only.
> Would it make sense to aim this project to become an hbase contrib? Or a
> lucene contrib?
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro
>

Re: ported lucandra: lucene index on HBase

Posted by Jean-Daniel Cryans <jd...@apache.org>.
That sounds great Thomas! You can start by adding an entry here
http://wiki.apache.org/hadoop/SupportingProjects

WRT becoming an HBase contrib, we have a rule that at least one
committer (or a very active contributor) must be in charge and be
available to fix anything broken in it due to changes in core HBase.
For example, if a contrib doesn't compile before a release, we will
exclude it.

J-D

On Thu, Mar 25, 2010 at 2:42 AM, Thomas Koch <th...@koch.ro> wrote:
> Hi,
>
> Lucandra stores a lucene index on cassandra:
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>
> As the author of lucandra writes: "I’m sure something similar could be built
> on hbase."
>
> So here it is:
> http://github.com/thkoch2001/lucehbase
>
> This is only a first prototype which has not been tested on anything real yet.
> But if you're interested, please join me to get it production ready!
>
> I propose to keep this thread on hbase-user and java-dev only.
> Would it make sense to aim this project to become an hbase contrib? Or a
> lucene contrib?
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro
>