You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bill Q <bi...@gmail.com> on 2014/11/10 15:21:22 UTC

Storing data with long history of versions

Hi,
I am designing a schema to store time series data for each device. And I
have a couple of questions that I am not quit sure.

1. *Is there any down side for storing the data in the same
columnfamily:column with a long history of customized timestamp? *

For example, I have historical daily data for a device. I would like to use
only one column qualifier to store them with custom timestamp, which is the
date of the data was collected. So, when I query the data I can easily pull
all the timeseries data against this particular device in one scan.

2. *After a storefile is finalized and become immutable, what would happen
when someone updates the row? *

For example, if I insert a new column:value with a newer timestamp into the
same row:columnfamily. Where is this new key/value part going to sit in the
HDFS? Is it close to the previous K/V pairs in the storefile?


Many thanks.


Bill

Re: Storing data with long history of versions

Posted by Ted Yu <yu...@gmail.com>.
See this recent thread: http://search-hadoop.com/m/DHED4pDVFG1

Before a major compaction, the query may fetch data from multiple HFiles.
This would be slower compared to fetching data from a single file. As for
the difference in duration of queries, you can perform query on your data
to get more concrete idea.

Cheers

On Mon, Nov 10, 2014 at 7:45 AM, Bill Q <bi...@gmail.com> wrote:

> Hi Ted,
> Thanks a lot.
>
> When would it break? Would you please give some details of why the size
> would be a decision factor?
>
> I will have probably 10 cells that have daily updates. And the rest cells
> in the column family will only have a handful of versions. So, the cells in
> the same column family will be very skewed in terms of version numbers.
>
> And before a major compaction, if I try to grab the all the versions of the
> cell, will there be any performance issue? I plan to do a batch process on
> hundreds of thousands of devices with all the versions of that few cells
> pulled out.
>
> On Monday, November 10, 2014, Ted Yu <yu...@gmail.com> wrote:
>
> > Half a million timestamps with 20 bytes each cell equate to 10MB.
> > That should be fine for your client.
> >
> > Cheers
> >
> > On Mon, Nov 10, 2014 at 7:23 AM, Bill Q <bill.q.hdp@gmail.com
> > <javascript:;>> wrote:
> >
> > > Hi Ted,
> > > Thanks a lot for the reply.
> > >
> > > For #1, the size for the value only will be around 20 bytes for each
> > cell.
> > > And there will be hundreds of thousands of time stamp per cell. But not
> > > millions. Any suggestion?
> > >
> > > Many thanks.
> > >
> > >
> > > Cao
> > >
> > > On Monday, November 10, 2014, Ted Yu <yuzhihong@gmail.com
> <javascript:;>>
> > wrote:
> > >
> > > > For #1, what's the expected size of data you want to store ?
> > > >
> > > > For #2, the new data inserted under column:value with a newer
> timestamp
> > > > would be stored in a different HFile. Old and new data would be
> > > > consolidated after major compaction.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <bill.q.hdp@gmail.com
> > <javascript:;>
> > > > <javascript:;>> wrote:
> > > >
> > > > > Hi,
> > > > > I am designing a schema to store time series data for each device.
> > And
> > > I
> > > > > have a couple of questions that I am not quit sure.
> > > > >
> > > > > 1. *Is there any down side for storing the data in the same
> > > > > columnfamily:column with a long history of customized timestamp? *
> > > > >
> > > > > For example, I have historical daily data for a device. I would
> like
> > to
> > > > use
> > > > > only one column qualifier to store them with custom timestamp,
> which
> > is
> > > > the
> > > > > date of the data was collected. So, when I query the data I can
> > easily
> > > > pull
> > > > > all the timeseries data against this particular device in one scan.
> > > > >
> > > > > 2. *After a storefile is finalized and become immutable, what would
> > > > happen
> > > > > when someone updates the row? *
> > > > >
> > > > > For example, if I insert a new column:value with a newer timestamp
> > into
> > > > the
> > > > > same row:columnfamily. Where is this new key/value part going to
> sit
> > in
> > > > the
> > > > > HDFS? Is it close to the previous K/V pairs in the storefile?
> > > > >
> > > > >
> > > > > Many thanks.
> > > > >
> > > > >
> > > > > Bill
> > > > >
> > > >
> > >
> > >
> > > --
> > > Many thanks.
> > >
> > >
> > > Bill
> > >
> >
>
>
> --
> Many thanks.
>
>
> Bill
>

Re: Storing data with long history of versions

Posted by Bill Q <bi...@gmail.com>.
Hi Ted,
Thanks a lot.

When would it break? Would you please give some details of why the size
would be a decision factor?

I will have probably 10 cells that have daily updates. And the rest cells
in the column family will only have a handful of versions. So, the cells in
the same column family will be very skewed in terms of version numbers.

And before a major compaction, if I try to grab the all the versions of the
cell, will there be any performance issue? I plan to do a batch process on
hundreds of thousands of devices with all the versions of that few cells
pulled out.

On Monday, November 10, 2014, Ted Yu <yu...@gmail.com> wrote:

> Half a million timestamps with 20 bytes each cell equate to 10MB.
> That should be fine for your client.
>
> Cheers
>
> On Mon, Nov 10, 2014 at 7:23 AM, Bill Q <bill.q.hdp@gmail.com
> <javascript:;>> wrote:
>
> > Hi Ted,
> > Thanks a lot for the reply.
> >
> > For #1, the size for the value only will be around 20 bytes for each
> cell.
> > And there will be hundreds of thousands of time stamp per cell. But not
> > millions. Any suggestion?
> >
> > Many thanks.
> >
> >
> > Cao
> >
> > On Monday, November 10, 2014, Ted Yu <yuzhihong@gmail.com <javascript:;>>
> wrote:
> >
> > > For #1, what's the expected size of data you want to store ?
> > >
> > > For #2, the new data inserted under column:value with a newer timestamp
> > > would be stored in a different HFile. Old and new data would be
> > > consolidated after major compaction.
> > >
> > > Cheers
> > >
> > > On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <bill.q.hdp@gmail.com
> <javascript:;>
> > > <javascript:;>> wrote:
> > >
> > > > Hi,
> > > > I am designing a schema to store time series data for each device.
> And
> > I
> > > > have a couple of questions that I am not quit sure.
> > > >
> > > > 1. *Is there any down side for storing the data in the same
> > > > columnfamily:column with a long history of customized timestamp? *
> > > >
> > > > For example, I have historical daily data for a device. I would like
> to
> > > use
> > > > only one column qualifier to store them with custom timestamp, which
> is
> > > the
> > > > date of the data was collected. So, when I query the data I can
> easily
> > > pull
> > > > all the timeseries data against this particular device in one scan.
> > > >
> > > > 2. *After a storefile is finalized and become immutable, what would
> > > happen
> > > > when someone updates the row? *
> > > >
> > > > For example, if I insert a new column:value with a newer timestamp
> into
> > > the
> > > > same row:columnfamily. Where is this new key/value part going to sit
> in
> > > the
> > > > HDFS? Is it close to the previous K/V pairs in the storefile?
> > > >
> > > >
> > > > Many thanks.
> > > >
> > > >
> > > > Bill
> > > >
> > >
> >
> >
> > --
> > Many thanks.
> >
> >
> > Bill
> >
>


-- 
Many thanks.


Bill

Re: Storing data with long history of versions

Posted by Ted Yu <yu...@gmail.com>.
Half a million timestamps with 20 bytes each cell equate to 10MB.
That should be fine for your client.

Cheers

On Mon, Nov 10, 2014 at 7:23 AM, Bill Q <bi...@gmail.com> wrote:

> Hi Ted,
> Thanks a lot for the reply.
>
> For #1, the size for the value only will be around 20 bytes for each cell.
> And there will be hundreds of thousands of time stamp per cell. But not
> millions. Any suggestion?
>
> Many thanks.
>
>
> Cao
>
> On Monday, November 10, 2014, Ted Yu <yu...@gmail.com> wrote:
>
> > For #1, what's the expected size of data you want to store ?
> >
> > For #2, the new data inserted under column:value with a newer timestamp
> > would be stored in a different HFile. Old and new data would be
> > consolidated after major compaction.
> >
> > Cheers
> >
> > On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <bill.q.hdp@gmail.com
> > <javascript:;>> wrote:
> >
> > > Hi,
> > > I am designing a schema to store time series data for each device. And
> I
> > > have a couple of questions that I am not quit sure.
> > >
> > > 1. *Is there any down side for storing the data in the same
> > > columnfamily:column with a long history of customized timestamp? *
> > >
> > > For example, I have historical daily data for a device. I would like to
> > use
> > > only one column qualifier to store them with custom timestamp, which is
> > the
> > > date of the data was collected. So, when I query the data I can easily
> > pull
> > > all the timeseries data against this particular device in one scan.
> > >
> > > 2. *After a storefile is finalized and become immutable, what would
> > happen
> > > when someone updates the row? *
> > >
> > > For example, if I insert a new column:value with a newer timestamp into
> > the
> > > same row:columnfamily. Where is this new key/value part going to sit in
> > the
> > > HDFS? Is it close to the previous K/V pairs in the storefile?
> > >
> > >
> > > Many thanks.
> > >
> > >
> > > Bill
> > >
> >
>
>
> --
> Many thanks.
>
>
> Bill
>

Re: Storing data with long history of versions

Posted by Bill Q <bi...@gmail.com>.
Hi Ted,
Thanks a lot for the reply.

For #1, the size for the value only will be around 20 bytes for each cell.
And there will be hundreds of thousands of time stamp per cell. But not
millions. Any suggestion?

Many thanks.


Cao

On Monday, November 10, 2014, Ted Yu <yu...@gmail.com> wrote:

> For #1, what's the expected size of data you want to store ?
>
> For #2, the new data inserted under column:value with a newer timestamp
> would be stored in a different HFile. Old and new data would be
> consolidated after major compaction.
>
> Cheers
>
> On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <bill.q.hdp@gmail.com
> <javascript:;>> wrote:
>
> > Hi,
> > I am designing a schema to store time series data for each device. And I
> > have a couple of questions that I am not quit sure.
> >
> > 1. *Is there any down side for storing the data in the same
> > columnfamily:column with a long history of customized timestamp? *
> >
> > For example, I have historical daily data for a device. I would like to
> use
> > only one column qualifier to store them with custom timestamp, which is
> the
> > date of the data was collected. So, when I query the data I can easily
> pull
> > all the timeseries data against this particular device in one scan.
> >
> > 2. *After a storefile is finalized and become immutable, what would
> happen
> > when someone updates the row? *
> >
> > For example, if I insert a new column:value with a newer timestamp into
> the
> > same row:columnfamily. Where is this new key/value part going to sit in
> the
> > HDFS? Is it close to the previous K/V pairs in the storefile?
> >
> >
> > Many thanks.
> >
> >
> > Bill
> >
>


-- 
Many thanks.


Bill

Re: Storing data with long history of versions

Posted by Ted Yu <yu...@gmail.com>.
For #1, what's the expected size of data you want to store ?

For #2, the new data inserted under column:value with a newer timestamp
would be stored in a different HFile. Old and new data would be
consolidated after major compaction.

Cheers

On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <bi...@gmail.com> wrote:

> Hi,
> I am designing a schema to store time series data for each device. And I
> have a couple of questions that I am not quit sure.
>
> 1. *Is there any down side for storing the data in the same
> columnfamily:column with a long history of customized timestamp? *
>
> For example, I have historical daily data for a device. I would like to use
> only one column qualifier to store them with custom timestamp, which is the
> date of the data was collected. So, when I query the data I can easily pull
> all the timeseries data against this particular device in one scan.
>
> 2. *After a storefile is finalized and become immutable, what would happen
> when someone updates the row? *
>
> For example, if I insert a new column:value with a newer timestamp into the
> same row:columnfamily. Where is this new key/value part going to sit in the
> HDFS? Is it close to the previous K/V pairs in the storefile?
>
>
> Many thanks.
>
>
> Bill
>