You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Pamecha, Abhishek" <ap...@x.com> on 2012/08/22 01:00:37 UTC

HBase Put

Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted without any order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out how the
key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to sorted order. Is it only during minor/major compactions, that this sortedness gets applied and that there is a small window during which data is not sorted?


Thanks,
Abhishek

Re: HTable batch execution order

Posted by Harsh J <ha...@cloudera.com>.

Hi Shagun,

The original ordering index is still maintained.

Yes you will have them back in order. Don't be confused by that
javadoc statement. The result list is ordered in the same way as the
actions list, but the "order" of which they are executed depends on
variable things, and hence the statement "The Get may not return what
the Put, in the same batch, had put".

On Thu, Aug 23, 2012 at 2:49 PM, Shagun Agarwal <sh...@yahoo-inc.com> wrote:
> Hi,
>
> I have a question about HTable.batch(List<? extends Row> actions,Object[] results) API, according to java doc -The ordering of execution of the actions is not defined. Meaning if you do a Put and a Get in the same batch call, you will not necessarily be guaranteed that the Get returns what the Put had put.
> however my question is if I don't mix up the actions & only provide Get action, do I get the result in same order in which Get was provided.
> e.g if I provide 3 Get with row keys [r1, r2, r3], will I get [result1, result2, result3]?
>
> Thanks
> Shagun Agarwal

-- 
Harsh J

HTable batch execution order

Posted by Shagun Agarwal <sh...@yahoo-inc.com>.

Hi,

I have a question about HTable.batch(List<? extends Row> actions,Object[] results) API, according to java doc -The ordering of execution of the actions is not defined. Meaning if you do a Put and a Get in the same batch call, you will not necessarily be guaranteed that the Get returns what the Put had put.
however my question is if I don't mix up the actions & only provide Get action, do I get the result in same order in which Get was provided.
e.g if I provide 3 Get with row keys [r1, r2, r3], will I get [result1, result2, result3]?

Thanks
Shagun Agarwal

RE: HBase Put

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Thanks. I should have read there first. :)

Thanks,
Abhishek


-----Original Message-----
From: Jason Frantz [mailto:jfrantz@maprtech.com] 
Sent: Wednesday, August 22, 2012 2:05 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

Abhishek,

Setting your column family's bloom filter to ROWCOL will include qualifiers:

http://hbase.apache.org/book.html#schema.bloom

-Jason

On Wed, Aug 22, 2012 at 1:49 PM, Pamecha, Abhishek <ap...@x.com> wrote:

> Can I enable bloom filters per block at column qualifier levels too? 
> That way, will small block sizes, I can selectively load only few data 
> blocks in memory. Then I can do some trade off between block size and 
> bloom filter false positive rate.
>
> I am designing for a wide table scenario with thousands and millions 
> of columns and thus I don't really want to stress on checks for blocks 
> having more than one row key.
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: Mohit Anchlia [mailto:mohitanchlia@gmail.com]
> Sent: Wednesday, August 22, 2012 11:09 AM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <ap...@x.com>
> wrote:
>
> > So then a GET query means one needs to look in every HFile where key 
> > falls within the min/max range of the file.
> >
> > From another parallel thread, I gather, HFile comprise of blocks 
> > which, I think, is an atomic unit of persisted data in HDFS.(please
> correct if not).
> >
> > And that each block for a HFile has a range of keys. My key can 
> > satisfy the range for the block and yet may not be present. So, all 
> > the blocks that match the range for my key, will need to be scanned.
> > There is one block index per HFile which sorts blocks by key ranges.
> > This index help in reducing the number of blocks to scan by 
> > extracting only those blocks whose ranges satisfy the key.
> >
> > In this case, if puts are random wrt order, each block may have 
> > similar range and it may turn out that Hbase needs to scan every 
> > block for the File. This may not be good for performance.
> >
> > I just want to validate my understanding.
> >
> >
> If you have such a use case I think best practice is to use bloom filters.
> I think in generaly it's a good idea to atleast enable bloom filter at 
> row level.
>
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >  Sent: Tuesday, August 21, 2012 5:55 PM
> > To: user@hbase.apache.org
> > Subject: Re: HBase Put
> >
> > That is correct.
> >
> >
> >
> > ________________________________
> >  From: "Pamecha, Abhishek" <ap...@x.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl < 
> > lhofhansl@yahoo.com>
> > Sent: Tuesday, August 21, 2012 4:45 PM
> > Subject: RE: HBase Put
> >
> > Hi Lars,
> >
> > Thanks for the explanation. I still have a little doubt:
> >
> > Based on your description, given gets do a merge sort, the data on 
> > disk is not kept sorted across files, but just sorted within a file.
> >
> > So, basically if on two separate days, say these keys get inserted:
> >
> > Day1: File1:   A B J M
> > Day2: File2:  C D K P
> >
> > Then each file is sorted within itself, but scanning both files will 
> > require Hbase to use merge sort to produce a sorted result. Right?
> >
> > Also, File 1 and File2 are immutable, and during compactions, File 1 
> > and
> > File2 are compacted and sorted using merge sort to a bigger File3. 
> > Is that correct too?
> >
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> > Sent: Tuesday, August 21, 2012 4:07 PM
> > To: user@hbase.apache.org
> > Subject: Re: HBase Put
> >
> > In a nutshell:
> > - Puts are collected in memory (in a sorted data structure)
> > - When the collected data reaches a certain size it is flushed to a 
> > new file (which is sorted)
> > - Gets do a merge sort between the various files that have been 
> > created
> > - to contain the number of files they are periodically compacted 
> > into fewer, larger files
> >
> >
> > So the data files (HFiles) are immutable once written, changes are 
> > batched in memory first.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: "Pamecha, Abhishek" <ap...@x.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Tuesday, August 21, 2012 4:00 PM
> > Subject: HBase Put
> >
> > Hi
> >
> > I had a  question on Hbase Put call. In the scenario, where data is 
> > inserted without any order to column qualifiers, how does Hbase 
> > maintain sortedness wrt column qualifiers in its store files/blocks?
> >
> > I checked the code base and I can see checks< 
> > https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/jav
> > a/ org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> > being  made for lexicographic insertions for Key value pairs.  But I 
> > cant seem to find out how the key-offset is calculated in the first
> place?
> >
> > Also, given HDFS is by nature, append only, how do randomly ordered 
> > keys make their way to sorted order. Is it only during minor/major 
> > compactions, that this sortedness gets applied and that there is a 
> > small window during which data is not sorted?
> >
> >
> > Thanks,
> > Abhishek
> >
>

Re: HBase Put

Posted by Jason Frantz <jf...@maprtech.com>.

Abhishek,

Setting your column family's bloom filter to ROWCOL will include qualifiers:

http://hbase.apache.org/book.html#schema.bloom

-Jason

On Wed, Aug 22, 2012 at 1:49 PM, Pamecha, Abhishek <ap...@x.com> wrote:

> Can I enable bloom filters per block at column qualifier levels too? That
> way, will small block sizes, I can selectively load only few data blocks in
> memory. Then I can do some trade off between block size and bloom filter
> false positive rate.
>
> I am designing for a wide table scenario with thousands and millions of
> columns and thus I don't really want to stress on checks for blocks having
> more than one row key.
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: Mohit Anchlia [mailto:mohitanchlia@gmail.com]
> Sent: Wednesday, August 22, 2012 11:09 AM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <ap...@x.com>
> wrote:
>
> > So then a GET query means one needs to look in every HFile where key
> > falls within the min/max range of the file.
> >
> > From another parallel thread, I gather, HFile comprise of blocks
> > which, I think, is an atomic unit of persisted data in HDFS.(please
> correct if not).
> >
> > And that each block for a HFile has a range of keys. My key can
> > satisfy the range for the block and yet may not be present. So, all
> > the blocks that match the range for my key, will need to be scanned.
> > There is one block index per HFile which sorts blocks by key ranges.
> > This index help in reducing the number of blocks to scan by extracting
> > only those blocks whose ranges satisfy the key.
> >
> > In this case, if puts are random wrt order, each block may have
> > similar range and it may turn out that Hbase needs to scan every block
> > for the File. This may not be good for performance.
> >
> > I just want to validate my understanding.
> >
> >
> If you have such a use case I think best practice is to use bloom filters.
> I think in generaly it's a good idea to atleast enable bloom filter at row
> level.
>
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >  Sent: Tuesday, August 21, 2012 5:55 PM
> > To: user@hbase.apache.org
> > Subject: Re: HBase Put
> >
> > That is correct.
> >
> >
> >
> > ________________________________
> >  From: "Pamecha, Abhishek" <ap...@x.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> > lhofhansl@yahoo.com>
> > Sent: Tuesday, August 21, 2012 4:45 PM
> > Subject: RE: HBase Put
> >
> > Hi Lars,
> >
> > Thanks for the explanation. I still have a little doubt:
> >
> > Based on your description, given gets do a merge sort, the data on
> > disk is not kept sorted across files, but just sorted within a file.
> >
> > So, basically if on two separate days, say these keys get inserted:
> >
> > Day1: File1:   A B J M
> > Day2: File2:  C D K P
> >
> > Then each file is sorted within itself, but scanning both files will
> > require Hbase to use merge sort to produce a sorted result. Right?
> >
> > Also, File 1 and File2 are immutable, and during compactions, File 1
> > and
> > File2 are compacted and sorted using merge sort to a bigger File3. Is
> > that correct too?
> >
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> > Sent: Tuesday, August 21, 2012 4:07 PM
> > To: user@hbase.apache.org
> > Subject: Re: HBase Put
> >
> > In a nutshell:
> > - Puts are collected in memory (in a sorted data structure)
> > - When the collected data reaches a certain size it is flushed to a
> > new file (which is sorted)
> > - Gets do a merge sort between the various files that have been
> > created
> > - to contain the number of files they are periodically compacted into
> > fewer, larger files
> >
> >
> > So the data files (HFiles) are immutable once written, changes are
> > batched in memory first.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: "Pamecha, Abhishek" <ap...@x.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Tuesday, August 21, 2012 4:00 PM
> > Subject: HBase Put
> >
> > Hi
> >
> > I had a  question on Hbase Put call. In the scenario, where data is
> > inserted without any order to column qualifiers, how does Hbase
> > maintain sortedness wrt column qualifiers in its store files/blocks?
> >
> > I checked the code base and I can see checks<
> > https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/
> > org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> > being  made for lexicographic insertions for Key value pairs.  But I
> > cant seem to find out how the key-offset is calculated in the first
> place?
> >
> > Also, given HDFS is by nature, append only, how do randomly ordered
> > keys make their way to sorted order. Is it only during minor/major
> > compactions, that this sortedness gets applied and that there is a
> > small window during which data is not sorted?
> >
> >
> > Thanks,
> > Abhishek
> >
>

RE: HBase Put

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Can I enable bloom filters per block at column qualifier levels too? That way, will small block sizes, I can selectively load only few data blocks in memory. Then I can do some trade off between block size and bloom filter false positive rate.

I am designing for a wide table scenario with thousands and millions of columns and thus I don't really want to stress on checks for blocks having more than one row key. 

Thanks,
Abhishek


-----Original Message-----
From: Mohit Anchlia [mailto:mohitanchlia@gmail.com] 
Sent: Wednesday, August 22, 2012 11:09 AM
To: user@hbase.apache.org
Subject: Re: HBase Put

On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <ap...@x.com> wrote:

> So then a GET query means one needs to look in every HFile where key 
> falls within the min/max range of the file.
>
> From another parallel thread, I gather, HFile comprise of blocks 
> which, I think, is an atomic unit of persisted data in HDFS.(please correct if not).
>
> And that each block for a HFile has a range of keys. My key can 
> satisfy the range for the block and yet may not be present. So, all 
> the blocks that match the range for my key, will need to be scanned. 
> There is one block index per HFile which sorts blocks by key ranges. 
> This index help in reducing the number of blocks to scan by extracting 
> only those blocks whose ranges satisfy the key.
>
> In this case, if puts are random wrt order, each block may have 
> similar range and it may turn out that Hbase needs to scan every block 
> for the File. This may not be good for performance.
>
> I just want to validate my understanding.
>
>
If you have such a use case I think best practice is to use bloom filters.
I think in generaly it's a good idea to atleast enable bloom filter at row level.

> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>  Sent: Tuesday, August 21, 2012 5:55 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> That is correct.
>
>
>
> ________________________________
>  From: "Pamecha, Abhishek" <ap...@x.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl < 
> lhofhansl@yahoo.com>
> Sent: Tuesday, August 21, 2012 4:45 PM
> Subject: RE: HBase Put
>
> Hi Lars,
>
> Thanks for the explanation. I still have a little doubt:
>
> Based on your description, given gets do a merge sort, the data on 
> disk is not kept sorted across files, but just sorted within a file.
>
> So, basically if on two separate days, say these keys get inserted:
>
> Day1: File1:   A B J M
> Day2: File2:  C D K P
>
> Then each file is sorted within itself, but scanning both files will 
> require Hbase to use merge sort to produce a sorted result. Right?
>
> Also, File 1 and File2 are immutable, and during compactions, File 1 
> and
> File2 are compacted and sorted using merge sort to a bigger File3. Is 
> that correct too?
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> Sent: Tuesday, August 21, 2012 4:07 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> In a nutshell:
> - Puts are collected in memory (in a sorted data structure)
> - When the collected data reaches a certain size it is flushed to a 
> new file (which is sorted)
> - Gets do a merge sort between the various files that have been 
> created
> - to contain the number of files they are periodically compacted into 
> fewer, larger files
>
>
> So the data files (HFiles) are immutable once written, changes are 
> batched in memory first.
>
> -- Lars
>
>
>
> ________________________________
> From: "Pamecha, Abhishek" <ap...@x.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Tuesday, August 21, 2012 4:00 PM
> Subject: HBase Put
>
> Hi
>
> I had a  question on Hbase Put call. In the scenario, where data is 
> inserted without any order to column qualifiers, how does Hbase 
> maintain sortedness wrt column qualifiers in its store files/blocks?
>
> I checked the code base and I can see checks< 
> https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/
> org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> being  made for lexicographic insertions for Key value pairs.  But I 
> cant seem to find out how the key-offset is calculated in the first place?
>
> Also, given HDFS is by nature, append only, how do randomly ordered 
> keys make their way to sorted order. Is it only during minor/major 
> compactions, that this sortedness gets applied and that there is a 
> small window during which data is not sorted?
>
>
> Thanks,
> Abhishek
>

Re: HBase Put

Posted by Mohit Anchlia <mo...@gmail.com>.

On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <ap...@x.com> wrote:

> So then a GET query means one needs to look in every HFile where key falls
> within the min/max range of the file.
>
> From another parallel thread, I gather, HFile comprise of blocks which, I
> think, is an atomic unit of persisted data in HDFS.(please correct if not).
>
> And that each block for a HFile has a range of keys. My key can satisfy
> the range for the block and yet may not be present. So, all the blocks that
> match the range for my key, will need to be scanned. There is one block
> index per HFile which sorts blocks by key ranges. This index help in
> reducing the number of blocks to scan by extracting only those blocks whose
> ranges satisfy the key.
>
> In this case, if puts are random wrt order, each block may have similar
> range and it may turn out that Hbase needs to scan every block for the
> File. This may not be good for performance.
>
> I just want to validate my understanding.
>
>
If you have such a use case I think best practice is to use bloom filters.
I think in generaly it's a good idea to atleast enable bloom filter at row
level.

> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>  Sent: Tuesday, August 21, 2012 5:55 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> That is correct.
>
>
>
> ________________________________
>  From: "Pamecha, Abhishek" <ap...@x.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> lhofhansl@yahoo.com>
> Sent: Tuesday, August 21, 2012 4:45 PM
> Subject: RE: HBase Put
>
> Hi Lars,
>
> Thanks for the explanation. I still have a little doubt:
>
> Based on your description, given gets do a merge sort, the data on disk is
> not kept sorted across files, but just sorted within a file.
>
> So, basically if on two separate days, say these keys get inserted:
>
> Day1: File1:   A B J M
> Day2: File2:  C D K P
>
> Then each file is sorted within itself, but scanning both files will
> require Hbase to use merge sort to produce a sorted result. Right?
>
> Also, File 1 and File2 are immutable, and during compactions, File 1 and
> File2 are compacted and sorted using merge sort to a bigger File3. Is that
> correct too?
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> Sent: Tuesday, August 21, 2012 4:07 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> In a nutshell:
> - Puts are collected in memory (in a sorted data structure)
> - When the collected data reaches a certain size it is flushed to a new
> file (which is sorted)
> - Gets do a merge sort between the various files that have been created
> - to contain the number of files they are periodically compacted into
> fewer, larger files
>
>
> So the data files (HFiles) are immutable once written, changes are batched
> in memory first.
>
> -- Lars
>
>
>
> ________________________________
> From: "Pamecha, Abhishek" <ap...@x.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Tuesday, August 21, 2012 4:00 PM
> Subject: HBase Put
>
> Hi
>
> I had a  question on Hbase Put call. In the scenario, where data is
> inserted without any order to column qualifiers, how does Hbase maintain
> sortedness wrt column qualifiers in its store files/blocks?
>
> I checked the code base and I can see checks<
> https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> being  made for lexicographic insertions for Key value pairs.  But I cant
> seem to find out how the key-offset is calculated in the first place?
>
> Also, given HDFS is by nature, append only, how do randomly ordered keys
> make their way to sorted order. Is it only during minor/major compactions,
> that this sortedness gets applied and that there is a small window during
> which data is not sorted?
>
>
> Thanks,
> Abhishek
>

RE: HBase Put

Posted by "Pamecha, Abhishek" <ap...@x.com>.

So then a GET query means one needs to look in every HFile where key falls within the min/max range of the file.

>From another parallel thread, I gather, HFile comprise of blocks which, I think, is an atomic unit of persisted data in HDFS.(please correct if not). 

And that each block for a HFile has a range of keys. My key can satisfy the range for the block and yet may not be present. So, all the blocks that match the range for my key, will need to be scanned. There is one block index per HFile which sorts blocks by key ranges. This index help in reducing the number of blocks to scan by extracting only those blocks whose ranges satisfy the key.

In this case, if puts are random wrt order, each block may have similar range and it may turn out that Hbase needs to scan every block for the File. This may not be good for performance.

I just want to validate my understanding.

Thanks,
Abhishek


-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 5:55 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

That is correct.



________________________________
 From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, August 21, 2012 4:45 PM
Subject: RE: HBase Put
 
Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not kept sorted across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require Hbase to use merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 are compacted and sorted using merge sort to a bigger File3. Is that correct too?

Thanks,
Abhishek


-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file (which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, larger files


So the data files (HFiles) are immutable once written, changes are batched in memory first.

-- Lars



________________________________
From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put

Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted without any order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to sorted order. Is it only during minor/major compactions, that this sortedness gets applied and that there is a small window during which data is not sorted?


Thanks,
Abhishek

Re: HBase Put

Posted by lars hofhansl <lh...@yahoo.com>.

That is correct.

________________________________
 From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, August 21, 2012 4:45 PM
Subject: RE: HBase Put

Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not kept sorted across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require Hbase to use merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 are compacted and sorted using merge sort to a bigger File3. Is that correct too?

Thanks,
Abhishek

-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file (which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, larger files

So the data files (HFiles) are immutable once written, changes are batched in memory first.

-- Lars

________________________________
From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put

Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted without any order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to sorted order. Is it only during minor/major compactions, that this sortedness gets applied and that there is a small window during which data is not sorted?

Thanks,
Abhishek

RE: HBase Put

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not kept sorted across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require Hbase to use merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 are compacted and sorted using merge sort to a bigger File3. Is that correct too?

Thanks,
Abhishek


-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file (which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, larger files


So the data files (HFiles) are immutable once written, changes are batched in memory first.

-- Lars



________________________________
 From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put
 
Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted without any order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to sorted order. Is it only during minor/major compactions, that this sortedness gets applied and that there is a small window during which data is not sorted?


Thanks,
Abhishek

Re: HBase Put

Posted by lars hofhansl <lh...@yahoo.com>.

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file (which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, larger files


So the data files (HFiles) are immutable once written, changes are batched in memory first.

-- Lars



________________________________
 From: "Pamecha, Abhishek" <ap...@x.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put
 
Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted without any order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out how the
key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to sorted order. Is it only during minor/major compactions, that this sortedness gets applied and that there is a small window during which data is not sorted?


Thanks,
Abhishek