You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Bulvik, Noam" <No...@teoco.com> on 2015/03/10 07:03:45 UTC

CSV bulk loading question

Hi,

We are using the CSV bulk  loading (MR) to load our data. we have a table with 50 columns and We did some testing to understand the factors on the performance of loading.
We compared two cases
A -  each column in the data will be a column in hbase table
B - take all non-key column and put them in one column in the hbase table

We saw that the second option we 7 times faster than the first one and consumed les CPU resources.

Does this make sense? Can we do something to tune the system so option A will run faster? (we prefer it this way because it enables us to query and filter over all data columns)


Regards,

Noam Bulvik


________________________________

PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you.

Re: CSV bulk loading question

Posted by Gabriel Reid <ga...@gmail.com>.

Hi Noam,

>From the perspective of row-based data storage you're writing the same
data, but from the perspective of HBase it's not the same at all. This is
because HBase stores everything as KeyValues, with one KeyValue per column
in Phoenix.

Lets say you've got a table with a single primary key column (K) and three
string fields (S1, S2, S3). We'll imagine we've got a single row of this
table containing the values ('myKey', 'myS1', 'myS2', 'myS3'). The internal
storage for a single row of this table will involve three KeyValues, each
of which contain a the primary key, column qualifier, column name,
timestamp, and value, i.e.

['myKey', '0:S1', <timestamp>, 'myS1']
['myKey', '0:S2', <timestamp>, 'myS2']
['myKey', '0:S3', <timestamp>, 'myS3']

When you store columns S1, S2, and S3 as a single column (SX), then the
internal encoding of this data in HBase is as follows:

['myKey', '0:SX', <timestamp>, 'myS1myS2myS3']

As you can see, this is much more compact -- it only contains the key,
column family name, column qualifier, and timestamp once for the whole row.
Also, sorting the set of this one KeyValue (instead of the the three
KeyValues above) will also be quicker.

Obviously storing the data this way has drawbacks (you can't easily access
the individual values), but it will be much more performant. Again, this
isn't really Phoenix-specific, it's more related to how HBase stores data.

- Gabriel


On Tue, Mar 10, 2015 at 10:12 AM Bulvik, Noam <No...@teoco.com> wrote:

> Thanks James
> Any explanation to why do we see this difference after all we are writing
> the same data in both cases
>
> -----Original Message-----
> From: James Taylor [mailto:jamestaylor@apache.org]
> Sent: Tuesday, March 10, 2015 11:08 AM
> To: user
> Subject: Re: CSV bulk loading question
>
> Hi Noam,
> We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x
> speedup (maybe 30% at the most if we're lucky). The other thing you'd lose
> by writing all values into one column is incremental update speed which may
> or may not apply for your use case. To update a single value, you'd need to
> read the existing value, stitch in the new value and write it back out
> again.
>
> If your use case doesn't require incremental update, you can write them
> all in a single column and still query and access them individually. The
> easiest way I can think of doing this would be to store them in a protobuf
> and serialize it into a VARBINARY column.
> Then create a built-in function that allows accessing them positionally or
> by name. Once you have that, could even add functional indexes over
> individual fields (obviously with the overhead that indexes add).
> PHOENIX-477 is about formalizing this as SQL Structs which this could
> become if taken far enough. Even without this, just having a set of
> built-in functions that work off of a protobuf would be a useful first step
> and a great contribution.
>
> Thanks,
> James
>
>
> On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam <No...@teoco.com>
> wrote:
> > Hi,
> >
> >
> >
> > We are using the CSV bulk  loading (MR) to load our data. we have a
> > table with 50 columns and We did some testing to understand the
> > factors on the performance of loading.
> >
> > We compared two cases
> >
> > A -  each column in the data will be a column in hbase table
> >
> > B – take all non-key column and put them in one column in the hbase
> > table
> >
> >
> >
> > We saw that the second option we 7 times faster than the first one and
> > consumed les CPU resources.
> >
> >
> >
> > Does this make sense? Can we do something to tune the system so option
> > A will run faster? (we prefer it this way because it enables us to
> > query and filter over all data columns)
> >
> >
> >
> >
> >
> > Regards,
> >
> >
> >
> > Noam Bulvik
> >
> >
> >
> >
> > ________________________________
> >
> > PRIVILEGED AND CONFIDENTIAL
> > PLEASE NOTE: The information contained in this message is privileged
> > and confidential, and is intended only for the use of the individual
> > to whom it is addressed and others who have been specifically authorized
> to receive it.
> > If you are not the intended recipient, you are hereby notified that
> > any dissemination, distribution or copying of this communication is
> > strictly prohibited. If you have received this communication in error,
> > or if any problems occur with transmission, please contact sender. Thank
> you.
>
> ________________________________
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged and
> confidential, and is intended only for the use of the individual to whom it
> is addressed and others who have been specifically authorized to receive
> it. If you are not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. If you have received this communication in error, or if any
> problems occur with transmission, please contact sender. Thank you.
>

RE: CSV bulk loading question

Posted by "Bulvik, Noam" <No...@teoco.com>.

Thanks James
Any explanation to why do we see this difference after all we are writing the same data in both cases

-----Original Message-----
From: James Taylor [mailto:jamestaylor@apache.org]
Sent: Tuesday, March 10, 2015 11:08 AM
To: user
Subject: Re: CSV bulk loading question

Hi Noam,
We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x speedup (maybe 30% at the most if we're lucky). The other thing you'd lose by writing all values into one column is incremental update speed which may or may not apply for your use case. To update a single value, you'd need to read the existing value, stitch in the new value and write it back out again.

If your use case doesn't require incremental update, you can write them all in a single column and still query and access them individually. The easiest way I can think of doing this would be to store them in a protobuf and serialize it into a VARBINARY column.
Then create a built-in function that allows accessing them positionally or by name. Once you have that, could even add functional indexes over individual fields (obviously with the overhead that indexes add). PHOENIX-477 is about formalizing this as SQL Structs which this could become if taken far enough. Even without this, just having a set of built-in functions that work off of a protobuf would be a useful first step and a great contribution.

Thanks,
James

On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam <No...@teoco.com> wrote:
> Hi,
>
>
>
> We are using the CSV bulk  loading (MR) to load our data. we have a
> table with 50 columns and We did some testing to understand the
> factors on the performance of loading.
>
> We compared two cases
>
> A -  each column in the data will be a column in hbase table
>
> B – take all non-key column and put them in one column in the hbase
> table
>
>
>
> We saw that the second option we 7 times faster than the first one and
> consumed les CPU resources.
>
>
>
> Does this make sense? Can we do something to tune the system so option
> A will run faster? (we prefer it this way because it enables us to
> query and filter over all data columns)
>
>
>
>
>
> Regards,
>
>
>
> Noam Bulvik
>
>
>
>
> ________________________________
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged
> and confidential, and is intended only for the use of the individual
> to whom it is addressed and others who have been specifically authorized to receive it.
> If you are not the intended recipient, you are hereby notified that
> any dissemination, distribution or copying of this communication is
> strictly prohibited. If you have received this communication in error,
> or if any problems occur with transmission, please contact sender. Thank you.

________________________________

PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you.

Re: CSV bulk loading question

Posted by James Taylor <ja...@apache.org>.

Hi Noam,
We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x
speedup (maybe 30% at the most if we're lucky). The other thing you'd
lose by writing all values into one column is incremental update speed
which may or may not apply for your use case. To update a single
value, you'd need to read the existing value, stitch in the new value
and write it back out again.

If your use case doesn't require incremental update, you can write
them all in a single column and still query and access them
individually. The easiest way I can think of doing this would be to
store them in a protobuf and serialize it into a VARBINARY column.
Then create a built-in function that allows accessing them
positionally or by name. Once you have that, could even add functional
indexes over individual fields (obviously with the overhead that
indexes add). PHOENIX-477 is about formalizing this as SQL Structs
which this could become if taken far enough. Even without this, just
having a set of built-in functions that work off of a protobuf would
be a useful first step and a great contribution.

Thanks,
James

On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam <No...@teoco.com> wrote:
> Hi,
>
>
>
> We are using the CSV bulk  loading (MR) to load our data. we have a table
> with 50 columns and We did some testing to understand the factors on the
> performance of loading.
>
> We compared two cases
>
> A -  each column in the data will be a column in hbase table
>
> B – take all non-key column and put them in one column in the hbase table
>
>
>
> We saw that the second option we 7 times faster than the first one and
> consumed les CPU resources.
>
>
>
> Does this make sense? Can we do something to tune the system so option A
> will run faster? (we prefer it this way because it enables us to query and
> filter over all data columns)
>
>
>
>
>
> Regards,
>
>
>
> Noam Bulvik
>
>
>
>
> ________________________________
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged and
> confidential, and is intended only for the use of the individual to whom it
> is addressed and others who have been specifically authorized to receive it.
> If you are not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. If you have received this communication in error, or if any
> problems occur with transmission, please contact sender. Thank you.