You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Даниел Симеонов <ds...@gmail.com> on 2010/04/28 13:56:22 UTC

question about how columns are deserialized in memory

Hi,
   I have a question about if a row in a Column Family has only columns
whether all of the columns are deserialized in memory if you need any of
them? As I understood it is the case, and if the Column Family is super
Column Family, then only the Super Column (entire) is brought up in memory?
What about row cache, is it different than memtable?
I have another one question, let's say there is only data to be inserted and
a solution to it is to have columns to be added to rows in Column Family, is
it possible in Cassandra to split the row if certain threshold is reached,
say 100 columns per row, what if there are concurrent inserts?
The original data model and use case is to insert timestamped data and to
make range queries. The original keys of CF rows were in the form of
<id>.<timestamp> and then a single column with data, OPP was used. This is
not an optimal solution, since nodes are hotter than others, I am thinking
of changing the model in the way to have keys like <id>.<year/month/day> and
then a list of columns with timestamps within this range and
RandomPartitioner or using OPP but preprocess part of the key with MD5, i.e.
the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the problem
is how to deal with large number of columns being inserted in a particular
row.
Thank you very much!
Best regards, Daniel.

Re: question about how columns are deserialized in memory

Posted by Sylvain Lebresne <sy...@yakaz.com>.

> Hi,
>   What about if the upper bound of columns in a row is loosely defined, i.e.
> it is ok that we have maximum of around 100 for example, but not exactly
> (maybe 105, 110)?
> What if I make a slice query to return say 1/5th of the columns in a row, I
> believe that such query again will not deserialize all columns in memory?

But you cannot do a slice that return some percentage of the row (I
mean, you can
ask for x columns in a row. But not for 1/5th of them). Even with
a loose upper bound (which clearly make it easier), it is not so easy as you
have to decide when to use another row (and btw, there is the problem of
choosing the name of the current row when you start the insert). There is
no way to have even an approximate of the number of columns in a row other
that counting them (which already is a approximation in the general case from
the client perspective since new columns can have been inserted by other client
just after the count you see was calculated).

> Best regards, Daniel.
>
> 2010/4/28 Sylvain Lebresne <sy...@yakaz.com>
>>
>> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
>> > Hi Sylvain,
>> >   Thank you very much! I still have some further questions, I didn't
>> > find
>> > how row cache is being configured?
>>
>> Provided you don't use trunk but something stable like 0.6.1 (which
>> you should),
>> it is in storage-conf.xml. It's one option of the definition of the
>> column families (it
>> is documented in the file).
>>
>> > Regarding the splitting of rows, I
>> > understand that it is not so necessary, still I am curious whether it is
>> > implementable by the client code.
>>
>> Well, I'm not sure there is any simple way to do it (at least not
>> efficiently). Counting
>> the number of columns in a row is expensive plus there is no easy way
>> to implement
>> counter in cassandra (even though
>> https://issues.apache.org/jira/browse/CASSANDRA-580
>> will make that better someday).
>>
>> > Best regards, Daniel.
>> >
>> > 2010/4/28 Sylvain Lebresne <sy...@yakaz.com>
>> >>
>> >> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
>> >> > Hi,
>> >> >    I have a question about if a row in a Column Family has only
>> >> > columns
>> >> > whether all of the columns are deserialized in memory if you need any
>> >> > of
>> >> > them? As I understood it is the case,
>> >>
>> >> No it's not. Only the columns you request are deserialized in memory.
>> >> The
>> >> only
>> >> thing is that, as of now, during compaction the entire row will be
>> >> deserialize at
>> >> once. So it just have to still fit in memory. But depending of the
>> >> typical size of
>> >> your column, you can easily millions of columns in a row without it
>> >> being a problem
>> >> at all.
>> >>
>> >> >  and if the Column Family is super
>> >> > Column Family, then only the Super Column (entire) is brought up in
>> >> > memory?
>> >>
>> >> Yes, that part is true. That is the problem with the current
>> >> implementation of super
>> >> columns. While you can have lots of column in one row, you probably
>> >> don't want to
>> >> have lots of columns in one super column (but it's no problem to have
>> >> lots of super
>> >> column in one row).
>> >>
>> >> > What about row cache, is it different than memtable?
>> >>
>> >> Be careful with row cache. If row cache is enable, then yes, any read
>> >> in a row will read
>> >> the entire row. So you typically don't want to use row cache in column
>> >> family where rows
>> >> have lots of columns (unless you always read all the columns in the
>> >> row each time of
>> >> course).
>> >>
>> >> > I have another one question, let's say there is only data to be
>> >> > inserted
>> >> > and
>> >> > a solution to it is to have columns to be added to rows in Column
>> >> > Family, is
>> >> > it possible in Cassandra to split the row if certain threshold is
>> >> > reached,
>> >> > say 100 columns per row, what if there are concurrent inserts?
>> >>
>> >> No, cassandra can't do that for you. But you should be okay with what
>> >> you describe
>> >> below. That is, if a given row corresponds to an hour of data, it will
>> >> limit it's size.
>> >> And again, the number of column in a row is not really limited as long
>> >> as
>> >> the
>> >> overall size of the row fits easily in memory.
>> >>
>> >> > The original data model and use case is to insert timestamped data
>> >> > and
>> >> > to
>> >> > make range queries. The original keys of CF rows were in the form of
>> >> > <id>.<timestamp> and then a single column with data, OPP was used.
>> >> > This
>> >> > is
>> >> > not an optimal solution, since nodes are hotter than others, I am
>> >> > thinking
>> >> > of changing the model in the way to have keys like
>> >> > <id>.<year/month/day>
>> >> > and
>> >> > then a list of columns with timestamps within this range and
>> >> > RandomPartitioner or using OPP but preprocess part of the key with
>> >> > MD5,
>> >> > i.e.
>> >> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
>> >> > problem
>> >> > is how to deal with large number of columns being inserted in a
>> >> > particular
>> >> > row.
>> >> > Thank you very much!
>> >> > Best regards, Daniel.
>> >
>> >
>
>

Re: question about how columns are deserialized in memory

Posted by Даниел Симеонов <ds...@gmail.com>.

Hi,
  What about if the upper bound of columns in a row is loosely defined, i.e.
it is ok that we have maximum of around 100 for example, but not exactly
(maybe 105, 110)?
What if I make a slice query to return say 1/5th of the columns in a row, I
believe that such query again will not deserialize all columns in memory?
Best regards, Daniel.

2010/4/28 Sylvain Lebresne <sy...@yakaz.com>

> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
> > Hi Sylvain,
> >   Thank you very much! I still have some further questions, I didn't find
> > how row cache is being configured?
>
> Provided you don't use trunk but something stable like 0.6.1 (which
> you should),
> it is in storage-conf.xml. It's one option of the definition of the
> column families (it
> is documented in the file).
>
> > Regarding the splitting of rows, I
> > understand that it is not so necessary, still I am curious whether it is
> > implementable by the client code.
>
> Well, I'm not sure there is any simple way to do it (at least not
> efficiently). Counting
> the number of columns in a row is expensive plus there is no easy way
> to implement
> counter in cassandra (even though
> https://issues.apache.org/jira/browse/CASSANDRA-580
> will make that better someday).
>
> > Best regards, Daniel.
> >
> > 2010/4/28 Sylvain Lebresne <sy...@yakaz.com>
> >>
> >> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
> >> > Hi,
> >> >    I have a question about if a row in a Column Family has only
> columns
> >> > whether all of the columns are deserialized in memory if you need any
> of
> >> > them? As I understood it is the case,
> >>
> >> No it's not. Only the columns you request are deserialized in memory.
> The
> >> only
> >> thing is that, as of now, during compaction the entire row will be
> >> deserialize at
> >> once. So it just have to still fit in memory. But depending of the
> >> typical size of
> >> your column, you can easily millions of columns in a row without it
> >> being a problem
> >> at all.
> >>
> >> >  and if the Column Family is super
> >> > Column Family, then only the Super Column (entire) is brought up in
> >> > memory?
> >>
> >> Yes, that part is true. That is the problem with the current
> >> implementation of super
> >> columns. While you can have lots of column in one row, you probably
> >> don't want to
> >> have lots of columns in one super column (but it's no problem to have
> >> lots of super
> >> column in one row).
> >>
> >> > What about row cache, is it different than memtable?
> >>
> >> Be careful with row cache. If row cache is enable, then yes, any read
> >> in a row will read
> >> the entire row. So you typically don't want to use row cache in column
> >> family where rows
> >> have lots of columns (unless you always read all the columns in the
> >> row each time of
> >> course).
> >>
> >> > I have another one question, let's say there is only data to be
> inserted
> >> > and
> >> > a solution to it is to have columns to be added to rows in Column
> >> > Family, is
> >> > it possible in Cassandra to split the row if certain threshold is
> >> > reached,
> >> > say 100 columns per row, what if there are concurrent inserts?
> >>
> >> No, cassandra can't do that for you. But you should be okay with what
> >> you describe
> >> below. That is, if a given row corresponds to an hour of data, it will
> >> limit it's size.
> >> And again, the number of column in a row is not really limited as long
> as
> >> the
> >> overall size of the row fits easily in memory.
> >>
> >> > The original data model and use case is to insert timestamped data and
> >> > to
> >> > make range queries. The original keys of CF rows were in the form of
> >> > <id>.<timestamp> and then a single column with data, OPP was used.
> This
> >> > is
> >> > not an optimal solution, since nodes are hotter than others, I am
> >> > thinking
> >> > of changing the model in the way to have keys like
> <id>.<year/month/day>
> >> > and
> >> > then a list of columns with timestamps within this range and
> >> > RandomPartitioner or using OPP but preprocess part of the key with
> MD5,
> >> > i.e.
> >> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
> >> > problem
> >> > is how to deal with large number of columns being inserted in a
> >> > particular
> >> > row.
> >> > Thank you very much!
> >> > Best regards, Daniel.
> >
> >
>

Re: question about how columns are deserialized in memory

Posted by Sylvain Lebresne <sy...@yakaz.com>.

2010/4/28 Даниел Симеонов <ds...@gmail.com>:
> Hi Sylvain,
>   Thank you very much! I still have some further questions, I didn't find
> how row cache is being configured?

Provided you don't use trunk but something stable like 0.6.1 (which
you should),
it is in storage-conf.xml. It's one option of the definition of the
column families (it
is documented in the file).

> Regarding the splitting of rows, I
> understand that it is not so necessary, still I am curious whether it is
> implementable by the client code.

Well, I'm not sure there is any simple way to do it (at least not
efficiently). Counting
the number of columns in a row is expensive plus there is no easy way
to implement
counter in cassandra (even though
https://issues.apache.org/jira/browse/CASSANDRA-580
will make that better someday).

> Best regards, Daniel.
>
> 2010/4/28 Sylvain Lebresne <sy...@yakaz.com>
>>
>> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
>> > Hi,
>> >    I have a question about if a row in a Column Family has only columns
>> > whether all of the columns are deserialized in memory if you need any of
>> > them? As I understood it is the case,
>>
>> No it's not. Only the columns you request are deserialized in memory. The
>> only
>> thing is that, as of now, during compaction the entire row will be
>> deserialize at
>> once. So it just have to still fit in memory. But depending of the
>> typical size of
>> your column, you can easily millions of columns in a row without it
>> being a problem
>> at all.
>>
>> >  and if the Column Family is super
>> > Column Family, then only the Super Column (entire) is brought up in
>> > memory?
>>
>> Yes, that part is true. That is the problem with the current
>> implementation of super
>> columns. While you can have lots of column in one row, you probably
>> don't want to
>> have lots of columns in one super column (but it's no problem to have
>> lots of super
>> column in one row).
>>
>> > What about row cache, is it different than memtable?
>>
>> Be careful with row cache. If row cache is enable, then yes, any read
>> in a row will read
>> the entire row. So you typically don't want to use row cache in column
>> family where rows
>> have lots of columns (unless you always read all the columns in the
>> row each time of
>> course).
>>
>> > I have another one question, let's say there is only data to be inserted
>> > and
>> > a solution to it is to have columns to be added to rows in Column
>> > Family, is
>> > it possible in Cassandra to split the row if certain threshold is
>> > reached,
>> > say 100 columns per row, what if there are concurrent inserts?
>>
>> No, cassandra can't do that for you. But you should be okay with what
>> you describe
>> below. That is, if a given row corresponds to an hour of data, it will
>> limit it's size.
>> And again, the number of column in a row is not really limited as long as
>> the
>> overall size of the row fits easily in memory.
>>
>> > The original data model and use case is to insert timestamped data and
>> > to
>> > make range queries. The original keys of CF rows were in the form of
>> > <id>.<timestamp> and then a single column with data, OPP was used. This
>> > is
>> > not an optimal solution, since nodes are hotter than others, I am
>> > thinking
>> > of changing the model in the way to have keys like <id>.<year/month/day>
>> > and
>> > then a list of columns with timestamps within this range and
>> > RandomPartitioner or using OPP but preprocess part of the key with MD5,
>> > i.e.
>> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
>> > problem
>> > is how to deal with large number of columns being inserted in a
>> > particular
>> > row.
>> > Thank you very much!
>> > Best regards, Daniel.
>
>

Re: question about how columns are deserialized in memory

Posted by Даниел Симеонов <ds...@gmail.com>.

Hi Sylvain,
  Thank you very much! I still have some further questions, I didn't find
how row cache is being configured? Regarding the splitting of rows, I
understand that it is not so necessary, still I am curious whether it is
implementable by the client code.
Best regards, Daniel.

2010/4/28 Sylvain Lebresne <sy...@yakaz.com>

> 2010/4/28 Даниел Симеонов <ds...@gmail.com>:
> > Hi,
> >    I have a question about if a row in a Column Family has only columns
> > whether all of the columns are deserialized in memory if you need any of
> > them? As I understood it is the case,
>
> No it's not. Only the columns you request are deserialized in memory. The
> only
> thing is that, as of now, during compaction the entire row will be
> deserialize at
> once. So it just have to still fit in memory. But depending of the
> typical size of
> your column, you can easily millions of columns in a row without it
> being a problem
> at all.
>
> >  and if the Column Family is super
> > Column Family, then only the Super Column (entire) is brought up in
> memory?
>
> Yes, that part is true. That is the problem with the current
> implementation of super
> columns. While you can have lots of column in one row, you probably
> don't want to
> have lots of columns in one super column (but it's no problem to have
> lots of super
> column in one row).
>
> > What about row cache, is it different than memtable?
>
> Be careful with row cache. If row cache is enable, then yes, any read
> in a row will read
> the entire row. So you typically don't want to use row cache in column
> family where rows
> have lots of columns (unless you always read all the columns in the
> row each time of
> course).
>
> > I have another one question, let's say there is only data to be inserted
> and
> > a solution to it is to have columns to be added to rows in Column Family,
> is
> > it possible in Cassandra to split the row if certain threshold is
> reached,
> > say 100 columns per row, what if there are concurrent inserts?
>
> No, cassandra can't do that for you. But you should be okay with what
> you describe
> below. That is, if a given row corresponds to an hour of data, it will
> limit it's size.
> And again, the number of column in a row is not really limited as long as
> the
> overall size of the row fits easily in memory.
>
> > The original data model and use case is to insert timestamped data and to
> > make range queries. The original keys of CF rows were in the form of
> > <id>.<timestamp> and then a single column with data, OPP was used. This
> is
> > not an optimal solution, since nodes are hotter than others, I am
> thinking
> > of changing the model in the way to have keys like <id>.<year/month/day>
> and
> > then a list of columns with timestamps within this range and
> > RandomPartitioner or using OPP but preprocess part of the key with MD5,
> i.e.
> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
> problem
> > is how to deal with large number of columns being inserted in a
> particular
> > row.
> > Thank you very much!
> > Best regards, Daniel.
>

Re: question about how columns are deserialized in memory

Posted by Sylvain Lebresne <sy...@yakaz.com>.

2010/4/28 Даниел Симеонов <ds...@gmail.com>:
> Hi,
>    I have a question about if a row in a Column Family has only columns
> whether all of the columns are deserialized in memory if you need any of
> them? As I understood it is the case,

No it's not. Only the columns you request are deserialized in memory. The only
thing is that, as of now, during compaction the entire row will be
deserialize at
once. So it just have to still fit in memory. But depending of the
typical size of
your column, you can easily millions of columns in a row without it
being a problem
at all.

>  and if the Column Family is super
> Column Family, then only the Super Column (entire) is brought up in memory?

Yes, that part is true. That is the problem with the current
implementation of super
columns. While you can have lots of column in one row, you probably
don't want to
have lots of columns in one super column (but it's no problem to have
lots of super
column in one row).

> What about row cache, is it different than memtable?

Be careful with row cache. If row cache is enable, then yes, any read
in a row will read
the entire row. So you typically don't want to use row cache in column
family where rows
have lots of columns (unless you always read all the columns in the
row each time of
course).

> I have another one question, let's say there is only data to be inserted and
> a solution to it is to have columns to be added to rows in Column Family, is
> it possible in Cassandra to split the row if certain threshold is reached,
> say 100 columns per row, what if there are concurrent inserts?

No, cassandra can't do that for you. But you should be okay with what
you describe
below. That is, if a given row corresponds to an hour of data, it will
limit it's size.
And again, the number of column in a row is not really limited as long as the
overall size of the row fits easily in memory.

> The original data model and use case is to insert timestamped data and to
> make range queries. The original keys of CF rows were in the form of
> <id>.<timestamp> and then a single column with data, OPP was used. This is
> not an optimal solution, since nodes are hotter than others, I am thinking
> of changing the model in the way to have keys like <id>.<year/month/day> and
> then a list of columns with timestamps within this range and
> RandomPartitioner or using OPP but preprocess part of the key with MD5, i.e.
> the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the problem
> is how to deal with large number of columns being inserted in a particular
> row.
> Thank you very much!
> Best regards, Daniel.