You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Benoit Perroud <be...@noisette.ch> on 2011/09/02 10:29:17 UTC

SSTableSimpleUnsortedWriter take long time when inserting big rows

Hi All,

I started using SSTableSimpleUnsortedWriter to load data, and my data
has a few rows but a lot of column name in each rows.

I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted.

But the time taken to insert columns is increasing as the column
family is increasing. The problem appears because everytime we call
newRow, all the columns of the previous CF is added to the new CF.

Attached is a small patch that check which is the smallest CF, and add
the smallest CF to the biggest one.

Should I open I bug for that ?

Thanks in advance,

Benoit

Re: SSTableSimpleUnsortedWriter take long time when inserting big rows

Posted by Benoit Perroud <be...@noisette.ch>.
Thanks for your answer.

2011/9/2 Sylvain Lebresne <sy...@datastax.com>:
> On Fri, Sep 2, 2011 at 10:29 AM, Benoit Perroud <be...@noisette.ch> wrote:
>> Hi All,
>>
>> I started using SSTableSimpleUnsortedWriter to load data, and my data
>> has a few rows but a lot of column name in each rows.
>>
>> I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted.
>>
>> But the time taken to insert columns is increasing as the column
>> family is increasing. The problem appears because everytime we call
>> newRow, all the columns of the previous CF is added to the new CF.
>
> If I understand correctly, each row has way more that 10 000 columns, but
> you call newRow every 10 000 columns, right ?

Yes. I call newRow every 10 000 columns to be sure to flush as soon as possible.

> Note that you have the possibility to decrease the frequency of the calls to
> newRow.
>
> But anyway, I agree that the code shouldn't suck like that.
>
>> Attached is a small patch that check which is the smallest CF, and add
>> the smallest CF to the biggest one.
>>
>> Should I open I bug for that ?
>
> Please do. I'm actually thinking of a slightly different fix: we should not have
> to add all the previous columns to the new column family, we should just
> directly reuse the previous column family when adding the new column.
> But the JIRA ticket will be a better place to discuss this.

Opened : https://issues.apache.org/jira/browse/CASSANDRA-3122
Let's discuss there.

Thanks !

Benoit.

> --
> Sylvain
>

Re: SSTableSimpleUnsortedWriter take long time when inserting big rows

Posted by Sylvain Lebresne <sy...@datastax.com>.
On Fri, Sep 2, 2011 at 10:29 AM, Benoit Perroud <be...@noisette.ch> wrote:
> Hi All,
>
> I started using SSTableSimpleUnsortedWriter to load data, and my data
> has a few rows but a lot of column name in each rows.
>
> I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted.
>
> But the time taken to insert columns is increasing as the column
> family is increasing. The problem appears because everytime we call
> newRow, all the columns of the previous CF is added to the new CF.

If I understand correctly, each row has way more that 10 000 columns, but
you call newRow every 10 000 columns, right ?

Note that you have the possibility to decrease the frequency of the calls to
newRow.

But anyway, I agree that the code shouldn't suck like that.

> Attached is a small patch that check which is the smallest CF, and add
> the smallest CF to the biggest one.
>
> Should I open I bug for that ?

Please do. I'm actually thinking of a slightly different fix: we should not have
to add all the previous columns to the new column family, we should just
directly reuse the previous column family when adding the new column.
But the JIRA ticket will be a better place to discuss this.

--
Sylvain