You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "David G. Boney" <db...@semanticartifacts.com> on 2011/01/27 19:40:23 UTC

Lucandra Limitations

I was reviewing the Lucandra schema presented on the below page at Datastax:

http://www.datastax.com/docs/0.7/data_model/lucandra

In the TermInfo Super Column Family, docID is the key for a supercolumn. Does this imply that the maximum number of documents that can be index for a term with Lucandra is two billion, the maximum number of columns?

-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com





Re: Lucandra Limitations

Posted by Jake Luciani <ja...@gmail.com>.
The latest iteration of Lucandra, called Solandra, creates localized
sub-indexes of size N and spreads them around the cassandra ring. Then using
solr, will behind the scenes search all the subindexes in parallel. This
approach should give you what you need and it would be great to have such a
large dataset used for testing out the limits of solandra.

Solandra is here: http://github.com/tjake/lucandra

-Jake

On Thu, Jan 27, 2011 at 3:30 PM, David G. Boney <
dboney1@semanticartifacts.com> wrote:

> I am new to Lucene and Lucandra.
>
> My use case is that I have a trillion URIs to index with Lucene. Each URI
> is either a resource or literal in an RDF graph. Each URI is a document for
> Lucene
>
> If I were using Lucene, my understanding is that it would create a segment,
> stuff as many URIs in the segment until it hit either the document limit,
> around 2 billion, of the maximum size of the index. Lets say for the sake of
> argument that I only store 1billion URIs in a segment, then I would have
> 1000 segments to index my URIs.
>
> Lucandra does not support segments. How would I index a trillion URIs?
> Based on the below comments, I could only have around 2 billion URIs, or
> documents, per index. Would I have to create separate indexes to store all
> the URIs? Using the case where I store only 1 billion URIs in an index,
> would I have to create 1000 indexes? Since these are indexes and not
> segments, which would have been handled by Lucene, do I have to do my search
> against each index? Lucene supports the ability to create multiple
> IndexSearchers and stick them in a MultiSearcher.
>
> Is this the right way to view the problem?
>
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
> On Jan 27, 2011, at 12:45 PM, Jake Luciani wrote:
>
> Yes, but that's also the lucene limit
> http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations
>
> "Lucene uses a Java int to refer to document numbers, and the index file
> format uses an Int32"
>
>
>
> On Thu, Jan 27, 2011 at 1:40 PM, David G. Boney <
> dboney1@semanticartifacts.com> wrote:
>
>> I was reviewing the Lucandra schema presented on the below page at
>> Datastax:
>>
>> http://www.datastax.com/docs/0.7/data_model/lucandra
>>
>> In the TermInfo Super Column Family, docID is the key for a supercolumn.
>> Does this imply that the maximum number of documents that can be index for a
>> term with Lucandra is two billion, the maximum number of columns?
>>
>> -------------
>> Sincerely,
>> David G. Boney
>> dboney1@semanticartifacts.com
>> http://www.semanticartifacts.com
>>
>>
>>
>>
>>
>
>

Re: Lucandra Limitations

Posted by "David G. Boney" <db...@semanticartifacts.com>.
I am new to Lucene and Lucandra.

My use case is that I have a trillion URIs to index with Lucene. Each URI is either a resource or literal in an RDF graph. Each URI is a document for Lucene

If I were using Lucene, my understanding is that it would create a segment, stuff as many URIs in the segment until it hit either the document limit, around 2 billion, of the maximum size of the index. Lets say for the sake of argument that I only store 1billion URIs in a segment, then I would have 1000 segments to index my URIs.

Lucandra does not support segments. How would I index a trillion URIs? Based on the below comments, I could only have around 2 billion URIs, or documents, per index. Would I have to create separate indexes to store all the URIs? Using the case where I store only 1 billion URIs in an index, would I have to create 1000 indexes? Since these are indexes and not segments, which would have been handled by Lucene, do I have to do my search against each index? Lucene supports the ability to create multiple IndexSearchers and stick them in a MultiSearcher.

Is this the right way to view the problem?
-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com




On Jan 27, 2011, at 12:45 PM, Jake Luciani wrote:

> Yes, but that's also the lucene limit http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations
> 
> "Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32"
> 
> 
> 
> On Thu, Jan 27, 2011 at 1:40 PM, David G. Boney <db...@semanticartifacts.com> wrote:
> I was reviewing the Lucandra schema presented on the below page at Datastax:
> 
> http://www.datastax.com/docs/0.7/data_model/lucandra
> 
> In the TermInfo Super Column Family, docID is the key for a supercolumn. Does this imply that the maximum number of documents that can be index for a term with Lucandra is two billion, the maximum number of columns?
> 
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> 


Re: Lucandra Limitations

Posted by Jake Luciani <ja...@gmail.com>.
Yes, but that's also the lucene limit
http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations

"Lucene uses a Java int to refer to document numbers, and the index file
format uses an Int32"



On Thu, Jan 27, 2011 at 1:40 PM, David G. Boney <
dboney1@semanticartifacts.com> wrote:

> I was reviewing the Lucandra schema presented on the below page at
> Datastax:
>
> http://www.datastax.com/docs/0.7/data_model/lucandra
>
> In the TermInfo Super Column Family, docID is the key for a supercolumn.
> Does this imply that the maximum number of documents that can be index for a
> term with Lucandra is two billion, the maximum number of columns?
>
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
>

Re: Lucandra Limitations

Posted by Paul Brown <pa...@gmail.com>.
Lucene trades on (32-bit) ints internally, so I expect you're just seeing a projection of that limitation.

On Jan 27, 2011, at 10:40 AM, David G. Boney wrote:

> I was reviewing the Lucandra schema presented on the below page at Datastax:
> 
> http://www.datastax.com/docs/0.7/data_model/lucandra
> 
> In the TermInfo Super Column Family, docID is the key for a supercolumn. Does this imply that the maximum number of documents that can be index for a term with Lucandra is two billion, the maximum number of columns?
> 
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com