You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by ML_Seda <so...@gmail.com> on 2010/01/18 20:29:57 UTC

Lucandra Ingestion

I'm inserting a lot of documents into Cassandra/Lucandra.  The problem is,
the ingestion is fairly slow:

addDocument(Document doc, Analyzer analyzer) 

method takes 25-50 milliseconds

Was there any work done to speed this up?  maybe a bulk insert?

Thanks
-- 
View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Lucandra Ingestion

Posted by ML_Seda <so...@gmail.com>.
Sure.  Thanks Jake & Jon!
-- 
View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457366.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Lucandra Ingestion

Posted by Jake Luciani <ja...@gmail.com>.
I'm going to make it multithreaded internally once I get some spare  
time.

Also, Could you raise lucandra specific issues against the lucandra  
project on github?

Thx!

On Jan 25, 2010, at 5:01 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> I'm not super familiar with Lucandra but my guess is you want to use
> one indexwriter per thread, there's not much point in using a shared,
> thread-unsafe object across multiple clients and serializing via
> synchronized.
>
> On Mon, Jan 25, 2010 at 3:59 PM, ML_Seda <so...@gmail.com> wrote:
>>
>>
>> Jonathan Ellis-3 wrote:
>>>
>>> Are you using multiple threads?
>>>
>>
>> I'm adding in threading now, and getting exceptions at times  
>> regarding a
>> "broken pipe".
>>
>> I then added the following :
>>        synchronized (this) {
>>                indexWriter.addDocument(doc, analyzer);
>>        }
>>
>> Which did get rid of the problem.  I'm currently using Phasers  
>> (jsr166) to
>> register threads per file found in a given directory.  Although it  
>> still
>> seems slow.
>>
>> Has anyone else ingested large # of files, and found ways to optimize
>> ingestion?  If I apply a patch for batch operations (from the link  
>> in the
>> post), will this work with the version of cassandra supported by  
>> lucandra?
>>
>> Thanks again.
>>
>> --
>> View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457044.html
>> Sent from the cassandra-user@incubator.apache.org mailing list  
>> archive at Nabble.com.
>>

Re: Lucandra Ingestion

Posted by Jonathan Ellis <jb...@gmail.com>.
I'm not super familiar with Lucandra but my guess is you want to use
one indexwriter per thread, there's not much point in using a shared,
thread-unsafe object across multiple clients and serializing via
synchronized.

On Mon, Jan 25, 2010 at 3:59 PM, ML_Seda <so...@gmail.com> wrote:
>
>
> Jonathan Ellis-3 wrote:
>>
>> Are you using multiple threads?
>>
>
> I'm adding in threading now, and getting exceptions at times regarding a
> "broken pipe".
>
> I then added the following :
>        synchronized (this) {
>                indexWriter.addDocument(doc, analyzer);
>        }
>
> Which did get rid of the problem.  I'm currently using Phasers (jsr166) to
> register threads per file found in a given directory.  Although it still
> seems slow.
>
> Has anyone else ingested large # of files, and found ways to optimize
> ingestion?  If I apply a patch for batch operations (from the link in the
> post), will this work with the version of cassandra supported by lucandra?
>
> Thanks again.
>
> --
> View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457044.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>

Re: Lucandra Ingestion

Posted by ML_Seda <so...@gmail.com>.

Jonathan Ellis-3 wrote:
> 
> Are you using multiple threads?
> 

I'm adding in threading now, and getting exceptions at times regarding a
"broken pipe".

I then added the following :
        synchronized (this) {
        	indexWriter.addDocument(doc, analyzer);
        }

Which did get rid of the problem.  I'm currently using Phasers (jsr166) to
register threads per file found in a given directory.  Although it still
seems slow.  

Has anyone else ingested large # of files, and found ways to optimize
ingestion?  If I apply a patch for batch operations (from the link in the
post), will this work with the version of cassandra supported by lucandra? 

Thanks again.

-- 
View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457044.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Lucandra Ingestion

Posted by Jonathan Ellis <jb...@gmail.com>.
Are you using multiple threads?

On Mon, Jan 18, 2010 at 1:29 PM, ML_Seda <so...@gmail.com> wrote:
>
> I'm inserting a lot of documents into Cassandra/Lucandra.  The problem is,
> the ingestion is fairly slow:
>
> addDocument(Document doc, Analyzer analyzer)
>
> method takes 25-50 milliseconds
>
> Was there any work done to speed this up?  maybe a bulk insert?
>
> Thanks
> --
> View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>

Re: Lucandra Ingestion

Posted by ML_Seda <so...@gmail.com>.
This particular document is 8,158 bytes, and I am storing up to six fields
only one of which is indexed and stored.

indexing is taking:
Indexing Took: 14714ms*

This is problematic when I'm trying to ingest millions of documents.


Jake Luciani wrote:
> 
> How big are the documents?  Each term requires an insert so it's def slow
> on
> Lucandra's side. Once the bulk insert for many keys is in available this
> should go much faster.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-336
> 
> Looks like it will be in 0.6 release.
> 
> -Jake
> 
> On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda <so...@gmail.com> wrote:
> 
>>
>> I'm inserting a lot of documents into Cassandra/Lucandra.  The problem
>> is,
>> the ingestion is fairly slow:
>>
>> addDocument(Document doc, Analyzer analyzer)
>>
>> method takes 25-50 milliseconds
>>
>> Was there any work done to speed this up?  maybe a bulk insert?
>>
>> Thanks
>> --
>> View this message in context:
>> http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at
>> Nabble.com.
>>
> 
> 

-- 
View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415874.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Lucandra Ingestion

Posted by ML_Seda <so...@gmail.com>.
Thanks Jake.  No, I'm currently not using multiple threads.  I will do that
next.

Thanks for the link, hopefully Lucandra will support this as well.  


Jake Luciani wrote:
> 
> How big are the documents?  Each term requires an insert so it's def slow
> on
> Lucandra's side. Once the bulk insert for many keys is in available this
> should go much faster.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-336
> 
> Looks like it will be in 0.6 release.
> 
> -Jake
> 
> On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda <so...@gmail.com> wrote:
> 
>>
>> I'm inserting a lot of documents into Cassandra/Lucandra.  The problem
>> is,
>> the ingestion is fairly slow:
>>
>> addDocument(Document doc, Analyzer analyzer)
>>
>> method takes 25-50 milliseconds
>>
>> Was there any work done to speed this up?  maybe a bulk insert?
>>
>> Thanks
>> --
>> View this message in context:
>> http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at
>> Nabble.com.
>>
> 
> 

-- 
View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415835.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Lucandra Ingestion

Posted by Jake Luciani <ja...@gmail.com>.
How big are the documents?  Each term requires an insert so it's def slow on
Lucandra's side. Once the bulk insert for many keys is in available this
should go much faster.

https://issues.apache.org/jira/browse/CASSANDRA-336

Looks like it will be in 0.6 release.

-Jake

On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda <so...@gmail.com> wrote:

>
> I'm inserting a lot of documents into Cassandra/Lucandra.  The problem is,
> the ingestion is fairly slow:
>
> addDocument(Document doc, Analyzer analyzer)
>
> method takes 25-50 milliseconds
>
> Was there any work done to speed this up?  maybe a bulk insert?
>
> Thanks
> --
> View this message in context:
> http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>