You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Taeyun Kim <ta...@gmail.com> on 2015/01/15 06:09:23 UTC

Get addColumn + ColumnRangeFilter

Hi,



I have a situation that both Get.addColumn() and Get.setFilter(new
ColumnRangeFilter(…)) needed to Get.

The source code snippet is as follows:



        Get g = new Get(getRowKey(lfileId));

        g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);

        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,

            Bytes.toBytes(name + "~"), false));

        Result r = table.get(g);



        if (r.isEmpty())

            throw new FileNotFoundException(

                String.format("%d:%d:%s", projectId, lfileId, name));



When g.addColumn() is commented out, the Result is not empty, while with
g.addColumn the Result is empty(FileNotFoundException is thrown).

Is it illegal to use both methods?



BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)



Thanks.

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

I received your tests. 

I will debug them and get back to you. 

Cheers



> On Jan 15, 2015, at 6:22 PM, Taeyun Kim <ta...@innowireless.com> wrote:
> 
> (Sorry if this mail is a duplicate)
> 
> Hi Ted,
> 
> I've attached 2 unit test classes.
> 
> Both have one failed test.
> 
> - HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter(): Expected: 10, Actual 1
> - HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter(): Result is empty
> 
> If the tests have problems, please let me know.
> 
> 
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com] 
> Sent: Thursday, January 15, 2015 6:59 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
> 
> Can you write a unit test which shows this behavior?
> 
> Thanks
> 
> 
> 
>> On Jan 14, 2015, at 9:09 PM, Taeyun Kim <ta...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> 
>> 
>> I have a situation that both Get.addColumn() and Get.setFilter(new
>> ColumnRangeFilter(…)) needed to Get.
>> 
>> The source code snippet is as follows:
>> 
>> 
>> 
>>       Get g = new Get(getRowKey(lfileId));
>> 
>>       g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
>> 
>>       g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
>> 
>>           Bytes.toBytes(name + "~"), false));
>> 
>>       Result r = table.get(g);
>> 
>> 
>> 
>>       if (r.isEmpty())
>> 
>>           throw new FileNotFoundException(
>> 
>>               String.format("%d:%d:%s", projectId, lfileId, name));
>> 
>> 
>> 
>> When g.addColumn() is commented out, the Result is not empty, while 
>> with g.addColumn the Result is empty(FileNotFoundException is thrown).
>> 
>> Is it illegal to use both methods?
>> 
>> 
>> 
>> BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
>> 
>> 
>> 
>> Thanks.
> <HBaseAddColumnWithColumnRangeFilterTests.zip>

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

Thank you!

In fact I've forgotten that there is FilterList.
It works well!

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Monday, January 19, 2015 9:46 AM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

If the number of splits is greater than 1, you can use FilterList with two ColumnRangeFilters when needed.

Cheers

On Sun, Jan 18, 2015 at 4:37 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> Thanks.
>
> But in my case it is unlikely that the FirstColumnName would be 
> included in the range. (If it is included, it would cause a problem.)
>
> Instead, since the number of splits is mostly 1, I will include the 
> name of the first split to the first Get with addColumn(). With that, 
> most queries can be satisfied with single Get.
>
> Thanks again.
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Saturday, January 17, 2015 6:31 AM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> To clarify what I meant, the test passes with the following change:
>
>       Get g = new Get(RowKey);
>
>       byte[] minColumn = new byte[]{(byte)0};
>
>       int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0, 
> FirstColumnNameBytes.length,
>
>         minColumn, 0, minColumn.length);
>
>       byte[] maxColumn = Bytes.toBytes("~");
>
>       int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0, 
> FirstColumnNameBytes.length,
>
>         maxColumn, 0, maxColumn.length);
>
>       if (cmpMin <= 0 || cmpMax >= 0) {
>
>         g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  // 
> should be redundant...
>
>       }
>
>       g.setFilter(new ColumnRangeFilter(minColumn, false,
>
>         maxColumn, false));  // ...since this includes the first 
> column
>
> FYI
>
> On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thanks for the background information.
> >
> > For your last question, the columns given by addColumn() calls 
> > (ColumnTracker
> > uses) are checked first.
> > So yes.
> >
> > Relaxing this limitation may take some effort - ScanQueryMatcher can 
> > take Filter user passes into account. But the filter may not be 
> > ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> > To add such logic into ScanQueryMatcher#match() makes the code less 
> > maintainable.
> >
> > Can you check whether the column in addColumn() is covered by the 
> > ColumnRangeFilter and if so, do not call addColumn() ?
> >
> > Cheers
> >
> > On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim 
> > <ta...@innowireless.com>
> > wrote:
> >
> >> It's a somewhat long story.
> >> Maybe I use HBase some weird way.
> >>
> >> My use case is as follows:
> >>
> >> I didn't want to put many small file into HDFS. (Since it is bad 
> >> for HDFS, both for scalability and performance)
> >>
> >> The small files are grouped by a test log, since the files are many 
> >> facets of the result of the analysis of one test log. So, they 
> >> could be the members of one SequentialFile.
> >> But I felt SequentialFile (or other similar ones) not attractive, 
> >> since anyway I would get many not-so-big(about ~20MB, except for 
> >> rare
> >> cases) Sequential files since the analysis result files are not so 
> >> big and the test log files are continually generated.
> >> So some manual file management and merge could be a must.
> >>
> >> So, I decided to use a HBase record as a kind of 'directory' to 
> >> avoid the manual file management. (directory = file group) By this, 
> >> the 'files' are automatically 'merged' into appropriately sized 
> >> HFiles, and as a bonus that 'files' can be automatically deleted 
> >> when it's lifetime is done.
> >>
> >> The 'directory' has the following files.
> >>
> >> - 'm': meta file. (to check the version of the 'directory' format)
> >> - 'Result.csv.0'
> >> - 'Result.csv.1'
> >> - ...
> >> - 'Result.csv.p': parts file. (has the split count and each size. 'p'
> >> is for 'parts')
> >> - 'AnotherResultA.csv.0'
> >> - 'AnotherResultA.csv.1'
> >> - ...
> >> - 'AnotherResultA.csv.p'
> >> - 'TestEnvironment.txt'
> >>
> >> Each 'file' is saved as a column.
> >>
> >> Result files are split for the following reasons:
> >> - To handle extreme case the file is too big to be processed by one
> task.
> >> - To save the task process memory: the split size is actually 
> >> smaller than 64MB(size for one task) and individually compressed. 
> >> By this, a task process can have at most one column uncompressed. A 
> >> task is assigned multiple 'splits'.
> >>
> >> For this, I've written an InputFormat class.
> >>
> >> Now, the InputFormat class can first Get both 'm' and a parts file 
> >> to get the inputSplit information. This is not a problem. Single 
> >> Get with 2
> >> addColumn() is sufficient.
> >> But when the whole content of a file must be read(like 
> >> Files.readAllBytes()), must Get 'm' and unknown number of splits 
> >> that has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole 
> >> content by single Get. (addColumn() + ColumnRangeFilter) But for 
> >> the current HBase status, it seems that I have to invoke 2 Gets, or 
> >> disable the version check. (Maybe not a big deal?)
> >>
> >> That's all.
> >>
> >> If you think that this Record is not efficient, or there is better 
> >> solution, please let me know.
> >>
> >> BTW, for the current status, when both addColumn() and 
> >> ColumnRangeFilter are applied, they are practically combined by 'AND'
> operator. Right?
> >>
> >> -----Original Message-----
> >> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> Sent: Friday, January 16, 2015 3:39 PM
> >> To: user@hbase.apache.org
> >> Subject: Re: Get addColumn + ColumnRangeFilter
> >>
> >> I reproduced the failed test (testAddColumnWithColumnRangeFilter)
> >> after modifying your test case to fit master branch.
> >>
> >> The reason for one Cell being returned is that 
> >> ExplicitColumnTracker is used by ScanQueryMatcher to first check if 
> >> the column is part of the requested columns (f:fc in your case). 
> >> The other columns don't pass this check, hence they're not included in the result.
> >>
> >> Before this part of code is changed, can I ask why you need to call
> >> g.addColumn() when g has ColumnRangeFilter associated with it.
> >>
> >> Cheers
> >>
> >> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim 
> >> <ta...@innowireless.com>
> >> wrote:
> >>
> >> > (Sorry if this mail is a duplicate)
> >> >
> >> > Hi Ted,
> >> >
> >> > I've attached 2 unit test classes.
> >> >
> >> > Both have one failed test.
> >> >
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> >> > Expected: 10, Actual 1
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> >> > Result is empty
> >> >
> >> > If the tests have problems, please let me know.
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> > Sent: Thursday, January 15, 2015 6:59 PM
> >> > To: user@hbase.apache.org
> >> > Subject: Re: Get addColumn + ColumnRangeFilter
> >> >
> >> > Can you write a unit test which shows this behavior?
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> >> > taeyun.kim.innowireless@gmail.com> wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > >
> >> > >
> >> > > I have a situation that both Get.addColumn() and 
> >> > > Get.setFilter(new
> >> > > ColumnRangeFilter(…)) needed to Get.
> >> > >
> >> > > The source code snippet is as follows:
> >> > >
> >> > >
> >> > >
> >> > >        Get g = new Get(getRowKey(lfileId));
> >> > >
> >> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
> >> > > MetaColumnNameBytes);
> >> > >
> >> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name),
> >> > > false,
> >> > >
> >> > >            Bytes.toBytes(name + "~"), false));
> >> > >
> >> > >        Result r = table.get(g);
> >> > >
> >> > >
> >> > >
> >> > >        if (r.isEmpty())
> >> > >
> >> > >            throw new FileNotFoundException(
> >> > >
> >> > >                String.format("%d:%d:%s", projectId, lfileId, 
> >> > > name));
> >> > >
> >> > >
> >> > >
> >> > > When g.addColumn() is commented out, the Result is not empty, 
> >> > > while with g.addColumn the Result is 
> >> > > empty(FileNotFoundException is
> thrown).
> >> > >
> >> > > Is it illegal to use both methods?
> >> > >
> >> > >
> >> > >
> >> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >> > >
> >> > >
> >> > >
> >> > > Thanks.
> >> >
> >>
> >>
> >
>
>

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

If the number of splits is greater than 1, you can use FilterList with
two ColumnRangeFilters
when needed.

Cheers

On Sun, Jan 18, 2015 at 4:37 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> Thanks.
>
> But in my case it is unlikely that the FirstColumnName would be included
> in the range. (If it is included, it would cause a problem.)
>
> Instead, since the number of splits is mostly 1, I will include the name
> of the first split to the first Get with addColumn(). With that, most
> queries can be satisfied with single Get.
>
> Thanks again.
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Saturday, January 17, 2015 6:31 AM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> To clarify what I meant, the test passes with the following change:
>
>       Get g = new Get(RowKey);
>
>       byte[] minColumn = new byte[]{(byte)0};
>
>       int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0,
> FirstColumnNameBytes.length,
>
>         minColumn, 0, minColumn.length);
>
>       byte[] maxColumn = Bytes.toBytes("~");
>
>       int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0,
> FirstColumnNameBytes.length,
>
>         maxColumn, 0, maxColumn.length);
>
>       if (cmpMin <= 0 || cmpMax >= 0) {
>
>         g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  //
> should be redundant...
>
>       }
>
>       g.setFilter(new ColumnRangeFilter(minColumn, false,
>
>         maxColumn, false));  // ...since this includes the first column
>
> FYI
>
> On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thanks for the background information.
> >
> > For your last question, the columns given by addColumn() calls
> > (ColumnTracker
> > uses) are checked first.
> > So yes.
> >
> > Relaxing this limitation may take some effort - ScanQueryMatcher can
> > take Filter user passes into account. But the filter may not be
> > ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> > To add such logic into ScanQueryMatcher#match() makes the code less
> > maintainable.
> >
> > Can you check whether the column in addColumn() is covered by the
> > ColumnRangeFilter and if so, do not call addColumn() ?
> >
> > Cheers
> >
> > On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim
> > <ta...@innowireless.com>
> > wrote:
> >
> >> It's a somewhat long story.
> >> Maybe I use HBase some weird way.
> >>
> >> My use case is as follows:
> >>
> >> I didn't want to put many small file into HDFS. (Since it is bad for
> >> HDFS, both for scalability and performance)
> >>
> >> The small files are grouped by a test log, since the files are many
> >> facets of the result of the analysis of one test log. So, they could
> >> be the members of one SequentialFile.
> >> But I felt SequentialFile (or other similar ones) not attractive,
> >> since anyway I would get many not-so-big(about ~20MB, except for rare
> >> cases) Sequential files since the analysis result files are not so
> >> big and the test log files are continually generated.
> >> So some manual file management and merge could be a must.
> >>
> >> So, I decided to use a HBase record as a kind of 'directory' to avoid
> >> the manual file management. (directory = file group) By this, the
> >> 'files' are automatically 'merged' into appropriately sized HFiles,
> >> and as a bonus that 'files' can be automatically deleted when it's
> >> lifetime is done.
> >>
> >> The 'directory' has the following files.
> >>
> >> - 'm': meta file. (to check the version of the 'directory' format)
> >> - 'Result.csv.0'
> >> - 'Result.csv.1'
> >> - ...
> >> - 'Result.csv.p': parts file. (has the split count and each size. 'p'
> >> is for 'parts')
> >> - 'AnotherResultA.csv.0'
> >> - 'AnotherResultA.csv.1'
> >> - ...
> >> - 'AnotherResultA.csv.p'
> >> - 'TestEnvironment.txt'
> >>
> >> Each 'file' is saved as a column.
> >>
> >> Result files are split for the following reasons:
> >> - To handle extreme case the file is too big to be processed by one
> task.
> >> - To save the task process memory: the split size is actually smaller
> >> than 64MB(size for one task) and individually compressed. By this, a
> >> task process can have at most one column uncompressed. A task is
> >> assigned multiple 'splits'.
> >>
> >> For this, I've written an InputFormat class.
> >>
> >> Now, the InputFormat class can first Get both 'm' and a parts file to
> >> get the inputSplit information. This is not a problem. Single Get
> >> with 2
> >> addColumn() is sufficient.
> >> But when the whole content of a file must be read(like
> >> Files.readAllBytes()), must Get 'm' and unknown number of splits that
> >> has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole
> >> content by single Get. (addColumn() + ColumnRangeFilter) But for the
> >> current HBase status, it seems that I have to invoke 2 Gets, or
> >> disable the version check. (Maybe not a big deal?)
> >>
> >> That's all.
> >>
> >> If you think that this Record is not efficient, or there is better
> >> solution, please let me know.
> >>
> >> BTW, for the current status, when both addColumn() and
> >> ColumnRangeFilter are applied, they are practically combined by 'AND'
> operator. Right?
> >>
> >> -----Original Message-----
> >> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> Sent: Friday, January 16, 2015 3:39 PM
> >> To: user@hbase.apache.org
> >> Subject: Re: Get addColumn + ColumnRangeFilter
> >>
> >> I reproduced the failed test (testAddColumnWithColumnRangeFilter)
> >> after modifying your test case to fit master branch.
> >>
> >> The reason for one Cell being returned is that ExplicitColumnTracker
> >> is used by ScanQueryMatcher to first check if the column is part of
> >> the requested columns (f:fc in your case). The other columns don't
> >> pass this check, hence they're not included in the result.
> >>
> >> Before this part of code is changed, can I ask why you need to call
> >> g.addColumn() when g has ColumnRangeFilter associated with it.
> >>
> >> Cheers
> >>
> >> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim
> >> <ta...@innowireless.com>
> >> wrote:
> >>
> >> > (Sorry if this mail is a duplicate)
> >> >
> >> > Hi Ted,
> >> >
> >> > I've attached 2 unit test classes.
> >> >
> >> > Both have one failed test.
> >> >
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> >> > Expected: 10, Actual 1
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> >> > Result is empty
> >> >
> >> > If the tests have problems, please let me know.
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> > Sent: Thursday, January 15, 2015 6:59 PM
> >> > To: user@hbase.apache.org
> >> > Subject: Re: Get addColumn + ColumnRangeFilter
> >> >
> >> > Can you write a unit test which shows this behavior?
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> >> > taeyun.kim.innowireless@gmail.com> wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > >
> >> > >
> >> > > I have a situation that both Get.addColumn() and
> >> > > Get.setFilter(new
> >> > > ColumnRangeFilter(…)) needed to Get.
> >> > >
> >> > > The source code snippet is as follows:
> >> > >
> >> > >
> >> > >
> >> > >        Get g = new Get(getRowKey(lfileId));
> >> > >
> >> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
> >> > > MetaColumnNameBytes);
> >> > >
> >> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name),
> >> > > false,
> >> > >
> >> > >            Bytes.toBytes(name + "~"), false));
> >> > >
> >> > >        Result r = table.get(g);
> >> > >
> >> > >
> >> > >
> >> > >        if (r.isEmpty())
> >> > >
> >> > >            throw new FileNotFoundException(
> >> > >
> >> > >                String.format("%d:%d:%s", projectId, lfileId,
> >> > > name));
> >> > >
> >> > >
> >> > >
> >> > > When g.addColumn() is commented out, the Result is not empty,
> >> > > while with g.addColumn the Result is empty(FileNotFoundException is
> thrown).
> >> > >
> >> > > Is it illegal to use both methods?
> >> > >
> >> > >
> >> > >
> >> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >> > >
> >> > >
> >> > >
> >> > > Thanks.
> >> >
> >>
> >>
> >
>
>

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

Thanks.

But in my case it is unlikely that the FirstColumnName would be included in the range. (If it is included, it would cause a problem.)

Instead, since the number of splits is mostly 1, I will include the name of the first split to the first Get with addColumn(). With that, most queries can be satisfied with single Get.

Thanks again.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Saturday, January 17, 2015 6:31 AM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

To clarify what I meant, the test passes with the following change:

      Get g = new Get(RowKey);

      byte[] minColumn = new byte[]{(byte)0};

      int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0, FirstColumnNameBytes.length,

        minColumn, 0, minColumn.length);

      byte[] maxColumn = Bytes.toBytes("~");

      int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0, FirstColumnNameBytes.length,

        maxColumn, 0, maxColumn.length);

      if (cmpMin <= 0 || cmpMax >= 0) {

        g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  // should be redundant...

      }

      g.setFilter(new ColumnRangeFilter(minColumn, false,

        maxColumn, false));  // ...since this includes the first column

FYI

On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <yu...@gmail.com> wrote:

> Thanks for the background information.
>
> For your last question, the columns given by addColumn() calls 
> (ColumnTracker
> uses) are checked first.
> So yes.
>
> Relaxing this limitation may take some effort - ScanQueryMatcher can 
> take Filter user passes into account. But the filter may not be 
> ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> To add such logic into ScanQueryMatcher#match() makes the code less 
> maintainable.
>
> Can you check whether the column in addColumn() is covered by the 
> ColumnRangeFilter and if so, do not call addColumn() ?
>
> Cheers
>
> On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim 
> <ta...@innowireless.com>
> wrote:
>
>> It's a somewhat long story.
>> Maybe I use HBase some weird way.
>>
>> My use case is as follows:
>>
>> I didn't want to put many small file into HDFS. (Since it is bad for 
>> HDFS, both for scalability and performance)
>>
>> The small files are grouped by a test log, since the files are many 
>> facets of the result of the analysis of one test log. So, they could 
>> be the members of one SequentialFile.
>> But I felt SequentialFile (or other similar ones) not attractive, 
>> since anyway I would get many not-so-big(about ~20MB, except for rare 
>> cases) Sequential files since the analysis result files are not so 
>> big and the test log files are continually generated.
>> So some manual file management and merge could be a must.
>>
>> So, I decided to use a HBase record as a kind of 'directory' to avoid 
>> the manual file management. (directory = file group) By this, the 
>> 'files' are automatically 'merged' into appropriately sized HFiles, 
>> and as a bonus that 'files' can be automatically deleted when it's 
>> lifetime is done.
>>
>> The 'directory' has the following files.
>>
>> - 'm': meta file. (to check the version of the 'directory' format)
>> - 'Result.csv.0'
>> - 'Result.csv.1'
>> - ...
>> - 'Result.csv.p': parts file. (has the split count and each size. 'p' 
>> is for 'parts')
>> - 'AnotherResultA.csv.0'
>> - 'AnotherResultA.csv.1'
>> - ...
>> - 'AnotherResultA.csv.p'
>> - 'TestEnvironment.txt'
>>
>> Each 'file' is saved as a column.
>>
>> Result files are split for the following reasons:
>> - To handle extreme case the file is too big to be processed by one task.
>> - To save the task process memory: the split size is actually smaller 
>> than 64MB(size for one task) and individually compressed. By this, a 
>> task process can have at most one column uncompressed. A task is 
>> assigned multiple 'splits'.
>>
>> For this, I've written an InputFormat class.
>>
>> Now, the InputFormat class can first Get both 'm' and a parts file to 
>> get the inputSplit information. This is not a problem. Single Get 
>> with 2
>> addColumn() is sufficient.
>> But when the whole content of a file must be read(like 
>> Files.readAllBytes()), must Get 'm' and unknown number of splits that 
>> has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole 
>> content by single Get. (addColumn() + ColumnRangeFilter) But for the 
>> current HBase status, it seems that I have to invoke 2 Gets, or 
>> disable the version check. (Maybe not a big deal?)
>>
>> That's all.
>>
>> If you think that this Record is not efficient, or there is better 
>> solution, please let me know.
>>
>> BTW, for the current status, when both addColumn() and 
>> ColumnRangeFilter are applied, they are practically combined by 'AND' operator. Right?
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>> Sent: Friday, January 16, 2015 3:39 PM
>> To: user@hbase.apache.org
>> Subject: Re: Get addColumn + ColumnRangeFilter
>>
>> I reproduced the failed test (testAddColumnWithColumnRangeFilter) 
>> after modifying your test case to fit master branch.
>>
>> The reason for one Cell being returned is that ExplicitColumnTracker 
>> is used by ScanQueryMatcher to first check if the column is part of 
>> the requested columns (f:fc in your case). The other columns don't 
>> pass this check, hence they're not included in the result.
>>
>> Before this part of code is changed, can I ask why you need to call
>> g.addColumn() when g has ColumnRangeFilter associated with it.
>>
>> Cheers
>>
>> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim 
>> <ta...@innowireless.com>
>> wrote:
>>
>> > (Sorry if this mail is a duplicate)
>> >
>> > Hi Ted,
>> >
>> > I've attached 2 unit test classes.
>> >
>> > Both have one failed test.
>> >
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
>> > Expected: 10, Actual 1
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
>> > Result is empty
>> >
>> > If the tests have problems, please let me know.
>> >
>> >
>> > -----Original Message-----
>> > From: Ted Yu [mailto:yuzhihong@gmail.com]
>> > Sent: Thursday, January 15, 2015 6:59 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Get addColumn + ColumnRangeFilter
>> >
>> > Can you write a unit test which shows this behavior?
>> >
>> > Thanks
>> >
>> >
>> >
>> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
>> > taeyun.kim.innowireless@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > >
>> > >
>> > > I have a situation that both Get.addColumn() and 
>> > > Get.setFilter(new
>> > > ColumnRangeFilter(…)) needed to Get.
>> > >
>> > > The source code snippet is as follows:
>> > >
>> > >
>> > >
>> > >        Get g = new Get(getRowKey(lfileId));
>> > >
>> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
>> > > MetaColumnNameBytes);
>> > >
>> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), 
>> > > false,
>> > >
>> > >            Bytes.toBytes(name + "~"), false));
>> > >
>> > >        Result r = table.get(g);
>> > >
>> > >
>> > >
>> > >        if (r.isEmpty())
>> > >
>> > >            throw new FileNotFoundException(
>> > >
>> > >                String.format("%d:%d:%s", projectId, lfileId, 
>> > > name));
>> > >
>> > >
>> > >
>> > > When g.addColumn() is commented out, the Result is not empty, 
>> > > while with g.addColumn the Result is empty(FileNotFoundException is thrown).
>> > >
>> > > Is it illegal to use both methods?
>> > >
>> > >
>> > >
>> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
>> > >
>> > >
>> > >
>> > > Thanks.
>> >
>>
>>
>

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

To clarify what I meant, the test passes with the following change:

      Get g = new Get(RowKey);

      byte[] minColumn = new byte[]{(byte)0};

      int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0,
FirstColumnNameBytes.length,

        minColumn, 0, minColumn.length);

      byte[] maxColumn = Bytes.toBytes("~");

      int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0,
FirstColumnNameBytes.length,

        maxColumn, 0, maxColumn.length);

      if (cmpMin <= 0 || cmpMax >= 0) {

        g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  //
should be redundant...

      }

      g.setFilter(new ColumnRangeFilter(minColumn, false,

        maxColumn, false));  // ...since this includes the first column

FYI

On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <yu...@gmail.com> wrote:

> Thanks for the background information.
>
> For your last question, the columns given by addColumn() calls (ColumnTracker
> uses) are checked first.
> So yes.
>
> Relaxing this limitation may take some effort - ScanQueryMatcher can take
> Filter user passes into account. But the filter may not be
> ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> To add such logic into ScanQueryMatcher#match() makes the code less
> maintainable.
>
> Can you check whether the column in addColumn() is covered by the ColumnRangeFilter
> and if so, do not call addColumn() ?
>
> Cheers
>
> On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim <ta...@innowireless.com>
> wrote:
>
>> It's a somewhat long story.
>> Maybe I use HBase some weird way.
>>
>> My use case is as follows:
>>
>> I didn't want to put many small file into HDFS. (Since it is bad for
>> HDFS, both for scalability and performance)
>>
>> The small files are grouped by a test log, since the files are many
>> facets of the result of the analysis of one test log. So, they could be the
>> members of one SequentialFile.
>> But I felt SequentialFile (or other similar ones) not attractive, since
>> anyway I would get many not-so-big(about ~20MB, except for rare cases)
>> Sequential files since the analysis result files are not so big and the
>> test log files are continually generated.
>> So some manual file management and merge could be a must.
>>
>> So, I decided to use a HBase record as a kind of 'directory' to avoid the
>> manual file management. (directory = file group)
>> By this, the 'files' are automatically 'merged' into appropriately sized
>> HFiles, and as a bonus that 'files' can be automatically deleted when it's
>> lifetime is done.
>>
>> The 'directory' has the following files.
>>
>> - 'm': meta file. (to check the version of the 'directory' format)
>> - 'Result.csv.0'
>> - 'Result.csv.1'
>> - ...
>> - 'Result.csv.p': parts file. (has the split count and each size. 'p' is
>> for 'parts')
>> - 'AnotherResultA.csv.0'
>> - 'AnotherResultA.csv.1'
>> - ...
>> - 'AnotherResultA.csv.p'
>> - 'TestEnvironment.txt'
>>
>> Each 'file' is saved as a column.
>>
>> Result files are split for the following reasons:
>> - To handle extreme case the file is too big to be processed by one task.
>> - To save the task process memory: the split size is actually smaller
>> than 64MB(size for one task) and individually compressed. By this, a task
>> process can have at most one column uncompressed. A task is assigned
>> multiple 'splits'.
>>
>> For this, I've written an InputFormat class.
>>
>> Now, the InputFormat class can first Get both 'm' and a parts file to get
>> the inputSplit information. This is not a problem. Single Get with 2
>> addColumn() is sufficient.
>> But when the whole content of a file must be read(like
>> Files.readAllBytes()), must Get 'm' and unknown number of splits that has a
>> name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single
>> Get. (addColumn() + ColumnRangeFilter)
>> But for the current HBase status, it seems that I have to invoke 2 Gets,
>> or disable the version check. (Maybe not a big deal?)
>>
>> That's all.
>>
>> If you think that this Record is not efficient, or there is better
>> solution, please let me know.
>>
>> BTW, for the current status, when both addColumn() and ColumnRangeFilter
>> are applied, they are practically combined by 'AND' operator. Right?
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>> Sent: Friday, January 16, 2015 3:39 PM
>> To: user@hbase.apache.org
>> Subject: Re: Get addColumn + ColumnRangeFilter
>>
>> I reproduced the failed test (testAddColumnWithColumnRangeFilter) after
>> modifying your test case to fit master branch.
>>
>> The reason for one Cell being returned is that ExplicitColumnTracker is
>> used by ScanQueryMatcher to first check if the column is part of the
>> requested columns (f:fc in your case). The other columns don't pass this
>> check, hence they're not included in the result.
>>
>> Before this part of code is changed, can I ask why you need to call
>> g.addColumn() when g has ColumnRangeFilter associated with it.
>>
>> Cheers
>>
>> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <ta...@innowireless.com>
>> wrote:
>>
>> > (Sorry if this mail is a duplicate)
>> >
>> > Hi Ted,
>> >
>> > I've attached 2 unit test classes.
>> >
>> > Both have one failed test.
>> >
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
>> > Expected: 10, Actual 1
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
>> > Result is empty
>> >
>> > If the tests have problems, please let me know.
>> >
>> >
>> > -----Original Message-----
>> > From: Ted Yu [mailto:yuzhihong@gmail.com]
>> > Sent: Thursday, January 15, 2015 6:59 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Get addColumn + ColumnRangeFilter
>> >
>> > Can you write a unit test which shows this behavior?
>> >
>> > Thanks
>> >
>> >
>> >
>> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
>> > taeyun.kim.innowireless@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > >
>> > >
>> > > I have a situation that both Get.addColumn() and Get.setFilter(new
>> > > ColumnRangeFilter(…)) needed to Get.
>> > >
>> > > The source code snippet is as follows:
>> > >
>> > >
>> > >
>> > >        Get g = new Get(getRowKey(lfileId));
>> > >
>> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
>> > > MetaColumnNameBytes);
>> > >
>> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
>> > >
>> > >            Bytes.toBytes(name + "~"), false));
>> > >
>> > >        Result r = table.get(g);
>> > >
>> > >
>> > >
>> > >        if (r.isEmpty())
>> > >
>> > >            throw new FileNotFoundException(
>> > >
>> > >                String.format("%d:%d:%s", projectId, lfileId, name));
>> > >
>> > >
>> > >
>> > > When g.addColumn() is commented out, the Result is not empty, while
>> > > with g.addColumn the Result is empty(FileNotFoundException is thrown).
>> > >
>> > > Is it illegal to use both methods?
>> > >
>> > >
>> > >
>> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
>> > >
>> > >
>> > >
>> > > Thanks.
>> >
>>
>>
>

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

Thanks for the background information.

For your last question, the columns given by addColumn() calls (ColumnTracker
uses) are checked first.
So yes.

Relaxing this limitation may take some effort - ScanQueryMatcher can take
Filter user passes into account. But the filter may not be ColumnRangeFilter.
It can be FilterList involving ColumnRangeFilter.
To add such logic into ScanQueryMatcher#match() makes the code less
maintainable.

Can you check whether the column in addColumn() is covered by the
ColumnRangeFilter
and if so, do not call addColumn() ?

Cheers

On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> It's a somewhat long story.
> Maybe I use HBase some weird way.
>
> My use case is as follows:
>
> I didn't want to put many small file into HDFS. (Since it is bad for HDFS,
> both for scalability and performance)
>
> The small files are grouped by a test log, since the files are many facets
> of the result of the analysis of one test log. So, they could be the
> members of one SequentialFile.
> But I felt SequentialFile (or other similar ones) not attractive, since
> anyway I would get many not-so-big(about ~20MB, except for rare cases)
> Sequential files since the analysis result files are not so big and the
> test log files are continually generated.
> So some manual file management and merge could be a must.
>
> So, I decided to use a HBase record as a kind of 'directory' to avoid the
> manual file management. (directory = file group)
> By this, the 'files' are automatically 'merged' into appropriately sized
> HFiles, and as a bonus that 'files' can be automatically deleted when it's
> lifetime is done.
>
> The 'directory' has the following files.
>
> - 'm': meta file. (to check the version of the 'directory' format)
> - 'Result.csv.0'
> - 'Result.csv.1'
> - ...
> - 'Result.csv.p': parts file. (has the split count and each size. 'p' is
> for 'parts')
> - 'AnotherResultA.csv.0'
> - 'AnotherResultA.csv.1'
> - ...
> - 'AnotherResultA.csv.p'
> - 'TestEnvironment.txt'
>
> Each 'file' is saved as a column.
>
> Result files are split for the following reasons:
> - To handle extreme case the file is too big to be processed by one task.
> - To save the task process memory: the split size is actually smaller than
> 64MB(size for one task) and individually compressed. By this, a task
> process can have at most one column uncompressed. A task is assigned
> multiple 'splits'.
>
> For this, I've written an InputFormat class.
>
> Now, the InputFormat class can first Get both 'm' and a parts file to get
> the inputSplit information. This is not a problem. Single Get with 2
> addColumn() is sufficient.
> But when the whole content of a file must be read(like
> Files.readAllBytes()), must Get 'm' and unknown number of splits that has a
> name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single
> Get. (addColumn() + ColumnRangeFilter)
> But for the current HBase status, it seems that I have to invoke 2 Gets,
> or disable the version check. (Maybe not a big deal?)
>
> That's all.
>
> If you think that this Record is not efficient, or there is better
> solution, please let me know.
>
> BTW, for the current status, when both addColumn() and ColumnRangeFilter
> are applied, they are practically combined by 'AND' operator. Right?
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Friday, January 16, 2015 3:39 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> I reproduced the failed test (testAddColumnWithColumnRangeFilter) after
> modifying your test case to fit master branch.
>
> The reason for one Cell being returned is that ExplicitColumnTracker is
> used by ScanQueryMatcher to first check if the column is part of the
> requested columns (f:fc in your case). The other columns don't pass this
> check, hence they're not included in the result.
>
> Before this part of code is changed, can I ask why you need to call
> g.addColumn() when g has ColumnRangeFilter associated with it.
>
> Cheers
>
> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <ta...@innowireless.com>
> wrote:
>
> > (Sorry if this mail is a duplicate)
> >
> > Hi Ted,
> >
> > I've attached 2 unit test classes.
> >
> > Both have one failed test.
> >
> > -
> >
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> > Expected: 10, Actual 1
> > -
> >
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> > Result is empty
> >
> > If the tests have problems, please let me know.
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Thursday, January 15, 2015 6:59 PM
> > To: user@hbase.apache.org
> > Subject: Re: Get addColumn + ColumnRangeFilter
> >
> > Can you write a unit test which shows this behavior?
> >
> > Thanks
> >
> >
> >
> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> > taeyun.kim.innowireless@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I have a situation that both Get.addColumn() and Get.setFilter(new
> > > ColumnRangeFilter(…)) needed to Get.
> > >
> > > The source code snippet is as follows:
> > >
> > >
> > >
> > >        Get g = new Get(getRowKey(lfileId));
> > >
> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
> > > MetaColumnNameBytes);
> > >
> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> > >
> > >            Bytes.toBytes(name + "~"), false));
> > >
> > >        Result r = table.get(g);
> > >
> > >
> > >
> > >        if (r.isEmpty())
> > >
> > >            throw new FileNotFoundException(
> > >
> > >                String.format("%d:%d:%s", projectId, lfileId, name));
> > >
> > >
> > >
> > > When g.addColumn() is commented out, the Result is not empty, while
> > > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> > >
> > > Is it illegal to use both methods?
> > >
> > >
> > >
> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> > >
> > >
> > >
> > > Thanks.
> >
>
>

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

It's a somewhat long story.
Maybe I use HBase some weird way.

My use case is as follows:

I didn't want to put many small file into HDFS. (Since it is bad for HDFS, both for scalability and performance)

The small files are grouped by a test log, since the files are many facets of the result of the analysis of one test log. So, they could be the members of one SequentialFile.
But I felt SequentialFile (or other similar ones) not attractive, since anyway I would get many not-so-big(about ~20MB, except for rare cases) Sequential files since the analysis result files are not so big and the test log files are continually generated.
So some manual file management and merge could be a must.

So, I decided to use a HBase record as a kind of 'directory' to avoid the manual file management. (directory = file group)
By this, the 'files' are automatically 'merged' into appropriately sized HFiles, and as a bonus that 'files' can be automatically deleted when it's lifetime is done.

The 'directory' has the following files.

- 'm': meta file. (to check the version of the 'directory' format)
- 'Result.csv.0'
- 'Result.csv.1'
- ...
- 'Result.csv.p': parts file. (has the split count and each size. 'p' is for 'parts')
- 'AnotherResultA.csv.0'
- 'AnotherResultA.csv.1'
- ...
- 'AnotherResultA.csv.p'
- 'TestEnvironment.txt'

Each 'file' is saved as a column.

Result files are split for the following reasons:
- To handle extreme case the file is too big to be processed by one task.
- To save the task process memory: the split size is actually smaller than 64MB(size for one task) and individually compressed. By this, a task process can have at most one column uncompressed. A task is assigned multiple 'splits'.

For this, I've written an InputFormat class.

Now, the InputFormat class can first Get both 'm' and a parts file to get the inputSplit information. This is not a problem. Single Get with 2 addColumn() is sufficient.
But when the whole content of a file must be read(like Files.readAllBytes()), must Get 'm' and unknown number of splits that has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single Get. (addColumn() + ColumnRangeFilter)
But for the current HBase status, it seems that I have to invoke 2 Gets, or disable the version check. (Maybe not a big deal?)

That's all.

If you think that this Record is not efficient, or there is better solution, please let me know.

BTW, for the current status, when both addColumn() and ColumnRangeFilter are applied, they are practically combined by 'AND' operator. Right?

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Friday, January 16, 2015 3:39 PM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

I reproduced the failed test (testAddColumnWithColumnRangeFilter) after modifying your test case to fit master branch.

The reason for one Cell being returned is that ExplicitColumnTracker is used by ScanQueryMatcher to first check if the column is part of the requested columns (f:fc in your case). The other columns don't pass this check, hence they're not included in the result.

Before this part of code is changed, can I ask why you need to call
g.addColumn() when g has ColumnRangeFilter associated with it.

Cheers

On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> (Sorry if this mail is a duplicate)
>
> Hi Ted,
>
> I've attached 2 unit test classes.
>
> Both have one failed test.
>
> -
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> Expected: 10, Actual 1
> -
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> Result is empty
>
> If the tests have problems, please let me know.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, January 15, 2015 6:59 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> Can you write a unit test which shows this behavior?
>
> Thanks
>
>
>
> > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> taeyun.kim.innowireless@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I have a situation that both Get.addColumn() and Get.setFilter(new
> > ColumnRangeFilter(…)) needed to Get.
> >
> > The source code snippet is as follows:
> >
> >
> >
> >        Get g = new Get(getRowKey(lfileId));
> >
> >        g.addColumn(Schema.ColumnFamilyNameBytes, 
> > MetaColumnNameBytes);
> >
> >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> >
> >            Bytes.toBytes(name + "~"), false));
> >
> >        Result r = table.get(g);
> >
> >
> >
> >        if (r.isEmpty())
> >
> >            throw new FileNotFoundException(
> >
> >                String.format("%d:%d:%s", projectId, lfileId, name));
> >
> >
> >
> > When g.addColumn() is commented out, the Result is not empty, while 
> > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> >
> > Is it illegal to use both methods?
> >
> >
> >
> > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >
> >
> >
> > Thanks.
>

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

Some more.

The files cannot be physically merged (that is, each file must retain its identity) since there is a requirement that the individual file group must be able to be deleted.
And since the files are individually postprocessed, there is no need to scan through all the file groups, so HBase' 'slow' scan speed relative to the HDFS sequential read is not a concern.

-----Original Message-----
From: Taeyun Kim [mailto:taeyun.kim@innowireless.com] 
Sent: Friday, January 16, 2015 4:36 PM
To: 'user@hbase.apache.org'
Subject: RE: Get addColumn + ColumnRangeFilter

It's a somewhat long story.
Maybe I use HBase some weird way.

My use case is as follows:

I didn't want to put many small file into HDFS. (Since it is bad for HDFS, both for scalability and performance)

The small files are grouped by a test log, since the files are many facets of the result of the analysis of one test log. So, they could be the members of one SequentialFile.
But I felt SequentialFile (or other similar ones) not attractive, since anyway I would get many not-so-big(about ~20MB, except for rare cases) Sequential files since the analysis result files are not so big and the test log files are continually generated.
So some manual file management and merge could be a must.

So, I decided to use a HBase record as a kind of 'directory' to avoid the manual file management. (directory = file group) By this, the 'files' are automatically 'merged' into appropriately sized HFiles, and as a bonus that 'files' can be automatically deleted when it's lifetime is done.

The 'directory' has the following files.

- 'm': meta file. (to check the version of the 'directory' format)
- 'Result.csv.0'
- 'Result.csv.1'
- ...
- 'Result.csv.p': parts file. (has the split count and each size. 'p' is for 'parts')
- 'AnotherResultA.csv.0'
- 'AnotherResultA.csv.1'
- ...
- 'AnotherResultA.csv.p'
- 'TestEnvironment.txt'

Each 'file' is saved as a column.

Result files are split for the following reasons:
- To handle extreme case the file is too big to be processed by one task.
- To save the task process memory: the split size is actually smaller than 64MB(size for one task) and individually compressed. By this, a task process can have at most one column uncompressed. A task is assigned multiple 'splits'.

For this, I've written an InputFormat class.

Now, the InputFormat class can first Get both 'm' and a parts file to get the inputSplit information. This is not a problem. Single Get with 2 addColumn() is sufficient.
But when the whole content of a file must be read(like Files.readAllBytes()), must Get 'm' and unknown number of splits that has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single Get. (addColumn() + ColumnRangeFilter) But for the current HBase status, it seems that I have to invoke 2 Gets, or disable the version check. (Maybe not a big deal?)

That's all.

If you think that this Record is not efficient, or there is better solution, please let me know.

BTW, for the current status, when both addColumn() and ColumnRangeFilter are applied, they are practically combined by 'AND' operator. Right?

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com]
Sent: Friday, January 16, 2015 3:39 PM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

I reproduced the failed test (testAddColumnWithColumnRangeFilter) after modifying your test case to fit master branch.

The reason for one Cell being returned is that ExplicitColumnTracker is used by ScanQueryMatcher to first check if the column is part of the requested columns (f:fc in your case). The other columns don't pass this check, hence they're not included in the result.

Before this part of code is changed, can I ask why you need to call
g.addColumn() when g has ColumnRangeFilter associated with it.

Cheers

On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> (Sorry if this mail is a duplicate)
>
> Hi Ted,
>
> I've attached 2 unit test classes.
>
> Both have one failed test.
>
> -
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> Expected: 10, Actual 1
> -
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> Result is empty
>
> If the tests have problems, please let me know.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, January 15, 2015 6:59 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> Can you write a unit test which shows this behavior?
>
> Thanks
>
>
>
> > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> taeyun.kim.innowireless@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I have a situation that both Get.addColumn() and Get.setFilter(new
> > ColumnRangeFilter(…)) needed to Get.
> >
> > The source code snippet is as follows:
> >
> >
> >
> >        Get g = new Get(getRowKey(lfileId));
> >
> >        g.addColumn(Schema.ColumnFamilyNameBytes,
> > MetaColumnNameBytes);
> >
> >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> >
> >            Bytes.toBytes(name + "~"), false));
> >
> >        Result r = table.get(g);
> >
> >
> >
> >        if (r.isEmpty())
> >
> >            throw new FileNotFoundException(
> >
> >                String.format("%d:%d:%s", projectId, lfileId, name));
> >
> >
> >
> > When g.addColumn() is commented out, the Result is not empty, while 
> > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> >
> > Is it illegal to use both methods?
> >
> >
> >
> > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >
> >
> >
> > Thanks.
>

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

I reproduced the failed test (testAddColumnWithColumnRangeFilter) after
modifying your test case to fit master branch.

The reason for one Cell being returned is that ExplicitColumnTracker is
used by ScanQueryMatcher to first check if the column is part of the
requested columns (f:fc in your case). The other columns don't pass this
check, hence they're not included in the result.

Before this part of code is changed, can I ask why you need to call
g.addColumn() when g has ColumnRangeFilter associated with it.

Cheers

On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <ta...@innowireless.com>
wrote:

> (Sorry if this mail is a duplicate)
>
> Hi Ted,
>
> I've attached 2 unit test classes.
>
> Both have one failed test.
>
> -
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> Expected: 10, Actual 1
> -
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> Result is empty
>
> If the tests have problems, please let me know.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, January 15, 2015 6:59 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> Can you write a unit test which shows this behavior?
>
> Thanks
>
>
>
> > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> taeyun.kim.innowireless@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I have a situation that both Get.addColumn() and Get.setFilter(new
> > ColumnRangeFilter(…)) needed to Get.
> >
> > The source code snippet is as follows:
> >
> >
> >
> >        Get g = new Get(getRowKey(lfileId));
> >
> >        g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
> >
> >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> >
> >            Bytes.toBytes(name + "~"), false));
> >
> >        Result r = table.get(g);
> >
> >
> >
> >        if (r.isEmpty())
> >
> >            throw new FileNotFoundException(
> >
> >                String.format("%d:%d:%s", projectId, lfileId, name));
> >
> >
> >
> > When g.addColumn() is commented out, the Result is not empty, while
> > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> >
> > Is it illegal to use both methods?
> >
> >
> >
> > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >
> >
> >
> > Thanks.
>

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

(Sorry if this mail is a duplicate)

Hi Ted,

I've attached 2 unit test classes.

Both have one failed test.

- HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter(): Expected: 10, Actual 1
- HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter(): Result is empty

If the tests have problems, please let me know.


-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, January 15, 2015 6:59 PM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

Can you write a unit test which shows this behavior?

Thanks



> On Jan 14, 2015, at 9:09 PM, Taeyun Kim <ta...@gmail.com> wrote:
> 
> Hi,
> 
> 
> 
> I have a situation that both Get.addColumn() and Get.setFilter(new
> ColumnRangeFilter(…)) needed to Get.
> 
> The source code snippet is as follows:
> 
> 
> 
>        Get g = new Get(getRowKey(lfileId));
> 
>        g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
> 
>        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> 
>            Bytes.toBytes(name + "~"), false));
> 
>        Result r = table.get(g);
> 
> 
> 
>        if (r.isEmpty())
> 
>            throw new FileNotFoundException(
> 
>                String.format("%d:%d:%s", projectId, lfileId, name));
> 
> 
> 
> When g.addColumn() is commented out, the Result is not empty, while 
> with g.addColumn the Result is empty(FileNotFoundException is thrown).
> 
> Is it illegal to use both methods?
> 
> 
> 
> BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> 
> 
> 
> Thanks.

RE: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@innowireless.com>.

(This is 3rd attempt to send this mail. Sorry if this mail is a duplicate)

Hi Ted,

I've made 2 JUnit test classes.

Both have one failed test.

- HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter(): Expected: 10, Actual 1
- HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter(): Result is empty

Since it seems that a mail with attachment is being rejected by the mailing server, and I'm somewhat in a hurry, I'm pasting the unit test code here.

---------------------------------------------
HBaseAddColumnWithColumnRangeFilterTest1.java
---------------------------------------------
package com.test.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.ColumnRangeFilter;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class HBaseAddColumnWithColumnRangeFilterTest1
{
    @Before
    public void setUp() throws Exception
    {
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf(TestTableName));
        HColumnDescriptor colDesc = new HColumnDescriptor(ColumnFamilyName);
        tableDesc.addFamily(colDesc);
        admin.createTable(tableDesc);

        try (HTable table = new HTable(conf, TestTableName))
        {
            byte[] content = Bytes.toBytes("content");
            Put p = new Put(RowKey);
            p.add(ColumnFamilyNameBytes, FirstColumnNameBytes, content);
            for (byte i = 0; i < 10; i++)
            {
                byte[] columnNameBytes = new byte[]{i};
                p.add(ColumnFamilyNameBytes, columnNameBytes, content);
            }
            table.put(p);
        }
    }

    @After
    public void tearDown() throws Exception
    {
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        admin.disableTable(TestTableName);
        admin.deleteTable(TestTableName);
    }

    @Test
    public void testAddColumn() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 1, r.rawCells().length);
        }
    }

    @Test
    public void testColumnRangeFilter() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            g.setFilter(new ColumnRangeFilter(new byte[]{(byte)0}, false,
                Bytes.toBytes("~"), false));  // includes the first column
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 10, r.rawCells().length);
        }
    }

    @Test
    public void testAddColumnWithColumnRangeFilter() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  // should be redundant...
            g.setFilter(new ColumnRangeFilter(new byte[]{(byte)0}, false,
                Bytes.toBytes("~"), false));  // ...since this includes the first column
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 10, r.rawCells().length);
        }
    }

    static final String TestTableName = "AddColumnWithColumnRangeFilterTest";
    static final String ColumnFamilyName = "f";
    static final byte[] ColumnFamilyNameBytes = Bytes.toBytes(ColumnFamilyName);
    static final byte[] RowKey = Bytes.toBytes("1234");
    static final byte[] FirstColumnNameBytes = Bytes.toBytes("fc");
}
---------------------------------------------


---------------------------------------------
HBaseAddColumnWithColumnRangeFilterTest2.java
---------------------------------------------
package com.test.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.ColumnRangeFilter;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import javax.print.attribute.standard.MediaSize;
import java.io.IOException;

public class HBaseAddColumnWithColumnRangeFilterTest2
{
    @Before
    public void setUp() throws Exception
    {
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf(TestTableName));
        HColumnDescriptor colDesc = new HColumnDescriptor(ColumnFamilyName);
        tableDesc.addFamily(colDesc);
        admin.createTable(tableDesc);

        try (HTable table = new HTable(conf, TestTableName))
        {
            byte[] content = Bytes.toBytes("content");
            Put p = new Put(RowKey);
            p.add(ColumnFamilyNameBytes, FirstColumnNameBytes, content);
            for (int i = 0; i < 10; i++)
            {
                byte[] columnNameBytes = Bytes.toBytes(OtherColumnNamePrefix +
                    String.format(".%d", i));
                p.add(ColumnFamilyNameBytes, columnNameBytes, content);
            }
            table.put(p);
        }
    }

    @After
    public void tearDown() throws Exception
    {
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        admin.disableTable(TestTableName);
        admin.deleteTable(TestTableName);
    }

    @Test
    public void testAddColumn() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 1, r.rawCells().length);
        }
    }

    @Test
    public void testColumnRangeFilter() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            // should include only the OtherColumns
            g.setFilter(new ColumnRangeFilter(Bytes.toBytes(OtherColumnNamePrefix), false,
                Bytes.toBytes(OtherColumnNamePrefix + "~"), false));
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 10, r.rawCells().length);
        }
    }

    @Test
    public void testAddColumnWithColumnRangeFilter() throws IOException
    {
        try (HTable table = new HTable(HBaseConfiguration.create(), TestTableName))
        {
            Get g = new Get(RowKey);
            g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);
            g.setFilter(new ColumnRangeFilter(Bytes.toBytes(OtherColumnNamePrefix), false,
                Bytes.toBytes(OtherColumnNamePrefix + "~"), false));
            Result r = table.get(g);

            Assert.assertFalse("Result should not be empty", r.isEmpty());
            Assert.assertEquals("Result cell count should match", 11, r.rawCells().length);
        }
    }

    static final String TestTableName = "AddColumnWithColumnRangeFilterTest";
    static final String ColumnFamilyName = "f";
    static final byte[] ColumnFamilyNameBytes = Bytes.toBytes(ColumnFamilyName);
    static final byte[] RowKey = Bytes.toBytes("1234");
    static final byte[] FirstColumnNameBytes = Bytes.toBytes("fc");
    static final String OtherColumnNamePrefix = "oc";
}
----------------------------------------

If the tests have problems, please let me know.

Thanks.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, January 15, 2015 6:59 PM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

Can you write a unit test which shows this behavior?

Thanks



> On Jan 14, 2015, at 9:09 PM, Taeyun Kim <ta...@gmail.com> wrote:
> 
> Hi,
> 
> 
> 
> I have a situation that both Get.addColumn() and Get.setFilter(new
> ColumnRangeFilter(…)) needed to Get.
> 
> The source code snippet is as follows:
> 
> 
> 
>        Get g = new Get(getRowKey(lfileId));
> 
>        g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
> 
>        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> 
>            Bytes.toBytes(name + "~"), false));
> 
>        Result r = table.get(g);
> 
> 
> 
>        if (r.isEmpty())
> 
>            throw new FileNotFoundException(
> 
>                String.format("%d:%d:%s", projectId, lfileId, name));
> 
> 
> 
> When g.addColumn() is commented out, the Result is not empty, while 
> with g.addColumn the Result is empty(FileNotFoundException is thrown).
> 
> Is it illegal to use both methods?
> 
> 
> 
> BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> 
> 
> 
> Thanks.

Re: Get addColumn + ColumnRangeFilter

Posted by Ted Yu <yu...@gmail.com>.

Can you write a unit test which shows this behavior?

Thanks



> On Jan 14, 2015, at 9:09 PM, Taeyun Kim <ta...@gmail.com> wrote:
> 
> Hi,
> 
> 
> 
> I have a situation that both Get.addColumn() and Get.setFilter(new
> ColumnRangeFilter(…)) needed to Get.
> 
> The source code snippet is as follows:
> 
> 
> 
>        Get g = new Get(getRowKey(lfileId));
> 
>        g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
> 
>        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> 
>            Bytes.toBytes(name + "~"), false));
> 
>        Result r = table.get(g);
> 
> 
> 
>        if (r.isEmpty())
> 
>            throw new FileNotFoundException(
> 
>                String.format("%d:%d:%s", projectId, lfileId, name));
> 
> 
> 
> When g.addColumn() is commented out, the Result is not empty, while with
> g.addColumn the Result is empty(FileNotFoundException is thrown).
> 
> Is it illegal to use both methods?
> 
> 
> 
> BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> 
> 
> 
> Thanks.

Re: Get addColumn + ColumnRangeFilter

Posted by Taeyun Kim <ta...@gmail.com>.

Hi Ted,

I've attached 2 unit test classes.

Both have one failed test.

-
HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
Expected: 10, Actual 1
- HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
Result is empty

If the test has problems, please let me know.

(BTW, sorry for replying my own mail not yours. GMail somehow dropped that
mail, so I cannot find it...)

2015-01-15 14:09 GMT+09:00 Taeyun Kim <ta...@gmail.com>:

> Hi,
>
>
>
> I have a situation that both Get.addColumn() and Get.setFilter(new
> ColumnRangeFilter(…)) needed to Get.
>
> The source code snippet is as follows:
>
>
>
>         Get g = new Get(getRowKey(lfileId));
>
>         g.addColumn(Schema.ColumnFamilyNameBytes, MetaColumnNameBytes);
>
>         g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
>
>             Bytes.toBytes(name + "~"), false));
>
>         Result r = table.get(g);
>
>
>
>         if (r.isEmpty())
>
>             throw new FileNotFoundException(
>
>                 String.format("%d:%d:%s", projectId, lfileId, name));
>
>
>
> When g.addColumn() is commented out, the Result is not empty, while with
> g.addColumn the Result is empty(FileNotFoundException is thrown).
>
> Is it illegal to use both methods?
>
>
>
> BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
>
>
>
> Thanks.
>