You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by John <jo...@gmail.com> on 2013/10/24 16:52:48 UTC

Add Columnsize Filter for Scan Operation

Hi,

I'm write currently a HBase Java programm which iterates over every row in
a table. I have to modiy some rows if the column size (the amount of
columns in this row) is bigger than 25000.

Here is my sourcode: http://pastebin.com/njqG6ry6

Is there any way to add a Filter to the scan Operation and load only rows
where the size is bigger than 25k?

Currently I check the size at the client, but therefore I have to load
every row to the client site. It would be better if the wrong rows already
filtered at the "server" site.

thanks

John

Re: RE: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

ah, thats it - thanks :)


2013/10/26 Dhaval Shah <pr...@yahoo.co.in>

> Mapper.cleanup is always called after all map calls are over
>
> Sent from Yahoo Mail on Android
>
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Mapper.cleanup is always called after all map calls are over

Sent from Yahoo Mail on Android

Re: RE: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

Ah, I see there is one issue left. It's not very likely that it happens,
but it could. My map() looks like this

map() {
if (row.getColumnSize < batchSize && currentRowName != lastRowName) {
   DROP ROW
   return;
}

if (row.getColumnSize < batchSize && currentRowName = lastRowName) {
 STORE FINAL RESULT INTO HBASE
}

--- // do something with the column elements
}

This works fine, BUT there is one special case. If the batch size is 100
and the row has 500 column-elements. The last row will not be stored
because the storing only happens if the last row-batch is smaller than 100
and in this case every map() call has the size 100.

So, is there a way to check if the current result is the last result? Or
maybe a method that will finaly call after all map() calls?

kind regards


2013/10/26 Dhaval Shah <pr...@yahoo.co.in>

> Cool
>
> Sent from Yahoo Mail on Android
>
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Cool

Sent from Yahoo Mail on Android

Re: RE: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

@Dhaval: Thanks! I did'nt know that. I've created now a field in the Mapper
class which stores information about the map() before. That works fine for
me.

regards,
john


2013/10/25 Dhaval Shah <pr...@yahoo.co.in>

> John, an important point to note here is that even though rows will get
> split over multiple calls to scanner.next(), all batches of 1 row will
> always reach 1 mapper. Another important point to note is that these
> batches will appear in consecutive calls to mapper.map()
>
> What this means is that you don't need to send your data to the reducer
> (and be more efficient by not writing to disk, no shuffle/sort phases and
> so on). You can just keep the state in memory for a particular row being
> processed (effectively a running count on the number of columns) and make
> the final decision when the row ends (effectively you encounter a different
> row or all rows are exhausted and you reach the cleanup function).
>
> The way I would do it is a map only MR job which keeps the state in memory
> as described above and uses the KeyOnlyFilter to reduce the amount of data
> flowing to the mapper
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: John <jo...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Friday, 25 October 2013 8:02 AM
> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>
>
> One thing I could do is to drop every batch-row where the column-size is
> smaller than the batch size. Something like if(rowsize < batchsize-1) drop
> row. The problem with this version is that the last row of a big row is
> also droped. Here a little example:
> There is one row:
> row1: 3500 columns
>
> If I set the batch to 1000. the mapper function got for the first row
>
> 1. Iteration: map function got 1000 columns -> write to disk for the
> reducer
> 2. Iteration map function got 1000 columns -> write to disk for the reducer
> 3. Iteration map function got 1000 columns -> write to disk for the reducer
> 4. Iteration map function got 500 columns -> drop, because it's smaller
> than the batch size
>
> Is there a way to count the columns over different map-functions?
>
> regards
>
>
>
> 2013/10/25 John <jo...@gmail.com>
>
> > I try to build a MR-Job, but in my case that doesn't work. Because if I
> > set for example the batch to 1000 and there are 5000 columns in row. Now
> i
> > found to generate something for rows where are the column size is bigger
> > than 2500. BUT since the map function is executed for every batch-row i
> > can't say if the row has a size bigger than 2500.
> >
> > any ideas?
> >
> >
> > 2013/10/25 lars hofhansl <la...@apache.org>
> >
> >> We need to finish up HBASE-8369
> >>
> >>
> >>
> >> ________________________________
> >>  From: Dhaval Shah <pr...@yahoo.co.in>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Sent: Thursday, October 24, 2013 4:38 PM
> >> Subject: Re: RE: Add Columnsize Filter for Scan Operation
> >>
> >>
> >> Well that depends on your use case ;)
> >>
> >> There are many nuances/code complexities to keep in mind:
> >> - merging results of various HFiles (each region can have.more than one)
> >> - merging results of WAL
> >> - applying delete markers
> >> - how about data which is only in memory of region servers and no where
> >> else
> >> - applying bloom filters for efficiency
> >> - what about hbase filters?
> >>
> >> At some point you would basically start rewriting an hbase region server
> >> on you map reduce job which is not ideal for maintainability.
> >>
> >> Do we ever read MySQL data files directly or issue a SQL query? Kind of
> >> goes back to the same argument ;)
> >>
> >> Sent from Yahoo Mail on Android
> >>
> >
> >
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

John, an important point to note here is that even though rows will get split over multiple calls to scanner.next(), all batches of 1 row will always reach 1 mapper. Another important point to note is that these batches will appear in consecutive calls to mapper.map()

What this means is that you don't need to send your data to the reducer (and be more efficient by not writing to disk, no shuffle/sort phases and so on). You can just keep the state in memory for a particular row being processed (effectively a running count on the number of columns) and make the final decision when the row ends (effectively you encounter a different row or all rows are exhausted and you reach the cleanup function).

The way I would do it is a map only MR job which keeps the state in memory as described above and uses the KeyOnlyFilter to reduce the amount of data flowing to the mapper
 
Regards,
Dhaval


________________________________
 From: John <jo...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Friday, 25 October 2013 8:02 AM
Subject: Re: RE: Add Columnsize Filter for Scan Operation
 

One thing I could do is to drop every batch-row where the column-size is
smaller than the batch size. Something like if(rowsize < batchsize-1) drop
row. The problem with this version is that the last row of a big row is
also droped. Here a little example:
There is one row:
row1: 3500 columns

If I set the batch to 1000. the mapper function got for the first row

1. Iteration: map function got 1000 columns -> write to disk for the reducer
2. Iteration map function got 1000 columns -> write to disk for the reducer
3. Iteration map function got 1000 columns -> write to disk for the reducer
4. Iteration map function got 500 columns -> drop, because it's smaller
than the batch size

Is there a way to count the columns over different map-functions?

regards



2013/10/25 John <jo...@gmail.com>

> I try to build a MR-Job, but in my case that doesn't work. Because if I
> set for example the batch to 1000 and there are 5000 columns in row. Now i
> found to generate something for rows where are the column size is bigger
> than 2500. BUT since the map function is executed for every batch-row i
> can't say if the row has a size bigger than 2500.
>
> any ideas?
>
>
> 2013/10/25 lars hofhansl <la...@apache.org>
>
>> We need to finish up HBASE-8369
>>
>>
>>
>> ________________________________
>>  From: Dhaval Shah <pr...@yahoo.co.in>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Thursday, October 24, 2013 4:38 PM
>> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>>
>>
>> Well that depends on your use case ;)
>>
>> There are many nuances/code complexities to keep in mind:
>> - merging results of various HFiles (each region can have.more than one)
>> - merging results of WAL
>> - applying delete markers
>> - how about data which is only in memory of region servers and no where
>> else
>> - applying bloom filters for efficiency
>> - what about hbase filters?
>>
>> At some point you would basically start rewriting an hbase region server
>> on you map reduce job which is not ideal for maintainability.
>>
>> Do we ever read MySQL data files directly or issue a SQL query? Kind of
>> goes back to the same argument ;)
>>
>> Sent from Yahoo Mail on Android
>>
>
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

One thing I could do is to drop every batch-row where the column-size is
smaller than the batch size. Something like if(rowsize < batchsize-1) drop
row. The problem with this version is that the last row of a big row is
also droped. Here a little example:
There is one row:
row1: 3500 columns

If I set the batch to 1000. the mapper function got for the first row

1. Iteration: map function got 1000 columns -> write to disk for the reducer
2. Iteration map function got 1000 columns -> write to disk for the reducer
3. Iteration map function got 1000 columns -> write to disk for the reducer
4. Iteration map function got 500 columns -> drop, because it's smaller
than the batch size

Is there a way to count the columns over different map-functions?

regards


2013/10/25 John <jo...@gmail.com>

> I try to build a MR-Job, but in my case that doesn't work. Because if I
> set for example the batch to 1000 and there are 5000 columns in row. Now i
> found to generate something for rows where are the column size is bigger
> than 2500. BUT since the map function is executed for every batch-row i
> can't say if the row has a size bigger than 2500.
>
> any ideas?
>
>
> 2013/10/25 lars hofhansl <la...@apache.org>
>
>> We need to finish up HBASE-8369
>>
>>
>>
>> ________________________________
>>  From: Dhaval Shah <pr...@yahoo.co.in>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Thursday, October 24, 2013 4:38 PM
>> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>>
>>
>> Well that depends on your use case ;)
>>
>> There are many nuances/code complexities to keep in mind:
>> - merging results of various HFiles (each region can have.more than one)
>> - merging results of WAL
>> - applying delete markers
>> - how about data which is only in memory of region servers and no where
>> else
>> - applying bloom filters for efficiency
>> - what about hbase filters?
>>
>> At some point you would basically start rewriting an hbase region server
>> on you map reduce job which is not ideal for maintainability.
>>
>> Do we ever read MySQL data files directly or issue a SQL query? Kind of
>> goes back to the same argument ;)
>>
>> Sent from Yahoo Mail on Android
>>
>
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

I try to build a MR-Job, but in my case that doesn't work. Because if I set
for example the batch to 1000 and there are 5000 columns in row. Now i
found to generate something for rows where are the column size is bigger
than 2500. BUT since the map function is executed for every batch-row i
can't say if the row has a size bigger than 2500.

any ideas?


2013/10/25 lars hofhansl <la...@apache.org>

> We need to finish up HBASE-8369
>
>
>
> ________________________________
>  From: Dhaval Shah <pr...@yahoo.co.in>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Thursday, October 24, 2013 4:38 PM
> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>
>
> Well that depends on your use case ;)
>
> There are many nuances/code complexities to keep in mind:
> - merging results of various HFiles (each region can have.more than one)
> - merging results of WAL
> - applying delete markers
> - how about data which is only in memory of region servers and no where
> else
> - applying bloom filters for efficiency
> - what about hbase filters?
>
> At some point you would basically start rewriting an hbase region server
> on you map reduce job which is not ideal for maintainability.
>
> Do we ever read MySQL data files directly or issue a SQL query? Kind of
> goes back to the same argument ;)
>
> Sent from Yahoo Mail on Android
>

Re: RE: Add Columnsize Filter for Scan Operation

Posted by lars hofhansl <la...@apache.org>.

We need to finish up HBASE-8369



________________________________
 From: Dhaval Shah <pr...@yahoo.co.in>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Thursday, October 24, 2013 4:38 PM
Subject: Re: RE: Add Columnsize Filter for Scan Operation
 

Well that depends on your use case ;)

There are many nuances/code complexities to keep in mind:
- merging results of various HFiles (each region can have.more than one)
- merging results of WAL
- applying delete markers
- how about data which is only in memory of region servers and no where else
- applying bloom filters for efficiency
- what about hbase filters?

At some point you would basically start rewriting an hbase region server on you map reduce job which is not ideal for maintainability. 

Do we ever read MySQL data files directly or issue a SQL query? Kind of goes back to the same argument ;)

Sent from Yahoo Mail on Android

Re: RE: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Well that depends on your use case ;)

There are many nuances/code complexities to keep in mind:
- merging results of various HFiles (each region can have.more than one)
- merging results of WAL
- applying delete markers
- how about data which is only in memory of region servers and no where else
- applying bloom filters for efficiency
- what about hbase filters?

At some point you would basically start rewriting an hbase region server on you map reduce job which is not ideal for maintainability. 

Do we ever read MySQL data files directly or issue a SQL query? Kind of goes back to the same argument ;)

Sent from Yahoo Mail on Android

RE: Add Columnsize Filter for Scan Operation

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Using HBase client API (scanners) for M/R is so oldish :). HFile has well defined format and it is much more efficient to read them directly.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Dhaval Shah [prince_mithibai@yahoo.co.in]
Sent: Thursday, October 24, 2013 9:53 AM
To: user@hbase.apache.org
Subject: Re: Add Columnsize Filter for Scan Operation

Jean, if we don't add setBatch to the scan, MR job does cause HBase to crash due to OOME. We have run into this in the past as well. Basically the problem is - Say I have a region server with 12GB of RAM and a row of size 20GB (an extreme example, in practice, HBase runs out of memory way before 20GB). If I query the entire row, HBase does not have enough memory to hold/process it for the response.

In practice, if your setCaching > 1, then the aggregate of all rows growing too big can also cause the same issue.

I think 1 way we can solve this issue is making the HBase server serve responses in a streaming fashion somehow (not exactly sure about the details on how this can work but if it has to hold the entire row in memory, its going to be bound by HBase heap size)

Regards,
Dhaval

________________________________
 From: Jean-Marc Spaggiari <je...@spaggiari.org>
To: user <us...@hbase.apache.org>
Sent: Thursday, 24 October 2013 12:37 PM
Subject: Re: Add Columnsize Filter for Scan Operation

If the MR crash because of the number of columns, then we have an issue
that we need to fix ;) Please open a JIRA provide details if you are facing
that.

Thanks,

JM

2013/10/24 John <jo...@gmail.com>

> @Jean-Marc: Sure, I can do that, but thats a little bit complicated because
> the the rows has sometimes Millions of Columns and I have to handle them
> into different batches because otherwise hbase crashs. Maybe I will try it
> later, but first I want to try the API version. It works okay so far, but I
> want to improve it a little bit.
>
> @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> count the number of columns in that filter (that works obviously with the
> count field). But there is no Method that is caleld after iterating over
> all elements, so I can not return the Drop ReturnCode in the filterKeyValue
> Method because I did'nt know when it was the last one. Any ideas?
>
> regards
>
>
> 2013/10/24 Ted Yu <yu...@gmail.com>
>
> > Please take a look
> > at
> src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> >
> >  * Simple filter that returns first N columns on row only.
> >
> > You can modify the filter to suit your needs.
> >
> > Cheers
> >
> >
> > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm write currently a HBase Java programm which iterates over every row
> > in
> > > a table. I have to modiy some rows if the column size (the amount of
> > > columns in this row) is bigger than 25000.
> > >
> > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > >
> > > Is there any way to add a Filter to the scan Operation and load only
> rows
> > > where the size is bigger than 25k?
> > >
> > > Currently I check the size at the client, but therefore I have to load
> > > every row to the client site. It would be better if the wrong rows
> > already
> > > filtered at the "server" site.
> > >
> > > thanks
> > >
> > > John
> > >
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Interesting!! Can't wait to see this in action. I am already imagining huge performance gains
 
Regards,
Dhaval


________________________________
 From: Ted Yu <yu...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Thursday, 24 October 2013 1:06 PM
Subject: Re: Add Columnsize Filter for Scan Operation
 

For streaming responses, there is this JIRA:

HBASE-8691 High-Throughput Streaming Scan API



On Thu, Oct 24, 2013 at 9:53 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Jean, if we don't add setBatch to the scan, MR job does cause HBase to
> crash due to OOME. We have run into this in the past as well. Basically the
> problem is - Say I have a region server with 12GB of RAM and a row of size
> 20GB (an extreme example, in practice, HBase runs out of memory way before
> 20GB). If I query the entire row, HBase does not have enough memory to
> hold/process it for the response.
>
> In practice, if your setCaching > 1, then the aggregate of all rows
> growing too big can also cause the same issue.
>
> I think 1 way we can solve this issue is making the HBase server serve
> responses in a streaming fashion somehow (not exactly sure about the
> details on how this can work but if it has to hold the entire row in
> memory, its going to be bound by HBase heap size)
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: Jean-Marc Spaggiari <je...@spaggiari.org>
> To: user <us...@hbase.apache.org>
> Sent: Thursday, 24 October 2013 12:37 PM
> Subject: Re: Add Columnsize Filter for Scan Operation
>
>
> If the MR crash because of the number of columns, then we have an issue
> that we need to fix ;) Please open a JIRA provide details if you are facing
> that.
>
> Thanks,
>
> JM
>
>
>
> 2013/10/24 John <jo...@gmail.com>
>
> > @Jean-Marc: Sure, I can do that, but thats a little bit complicated
> because
> > the the rows has sometimes Millions of Columns and I have to handle them
> > into different batches because otherwise hbase crashs. Maybe I will try
> it
> > later, but first I want to try the API version. It works okay so far,
> but I
> > want to improve it a little bit.
> >
> > @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> > count the number of columns in that filter (that works obviously with the
> > count field). But there is no Method that is caleld after iterating over
> > all elements, so I can not return the Drop ReturnCode in the
> filterKeyValue
> > Method because I did'nt know when it was the last one. Any ideas?
> >
> > regards
> >
> >
> > 2013/10/24 Ted Yu <yu...@gmail.com>
> >
> > > Please take a look
> > > at
> > src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> > >
> > >  * Simple filter that returns first N columns on row only.
> > >
> > > You can modify the filter to suit your needs.
> > >
> > > Cheers
> > >
> > >
> > > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm write currently a HBase Java programm which iterates over every
> row
> > > in
> > > > a table. I have to modiy some rows if the column size (the amount of
> > > > columns in this row) is bigger than 25000.
> > > >
> > > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > > >
> > > > Is there any way to add a Filter to the scan Operation and load only
> > rows
> > > > where the size is bigger than 25k?
> > > >
> > > > Currently I check the size at the client, but therefore I have to
> load
> > > > every row to the client site. It would be better if the wrong rows
> > > already
> > > > filtered at the "server" site.
> > > >
> > > > thanks
> > > >
> > > > John
> > > >
> > >
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by Ted Yu <yu...@gmail.com>.

For streaming responses, there is this JIRA:

HBASE-8691 High-Throughput Streaming Scan API


On Thu, Oct 24, 2013 at 9:53 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Jean, if we don't add setBatch to the scan, MR job does cause HBase to
> crash due to OOME. We have run into this in the past as well. Basically the
> problem is - Say I have a region server with 12GB of RAM and a row of size
> 20GB (an extreme example, in practice, HBase runs out of memory way before
> 20GB). If I query the entire row, HBase does not have enough memory to
> hold/process it for the response.
>
> In practice, if your setCaching > 1, then the aggregate of all rows
> growing too big can also cause the same issue.
>
> I think 1 way we can solve this issue is making the HBase server serve
> responses in a streaming fashion somehow (not exactly sure about the
> details on how this can work but if it has to hold the entire row in
> memory, its going to be bound by HBase heap size)
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: Jean-Marc Spaggiari <je...@spaggiari.org>
> To: user <us...@hbase.apache.org>
> Sent: Thursday, 24 October 2013 12:37 PM
> Subject: Re: Add Columnsize Filter for Scan Operation
>
>
> If the MR crash because of the number of columns, then we have an issue
> that we need to fix ;) Please open a JIRA provide details if you are facing
> that.
>
> Thanks,
>
> JM
>
>
>
> 2013/10/24 John <jo...@gmail.com>
>
> > @Jean-Marc: Sure, I can do that, but thats a little bit complicated
> because
> > the the rows has sometimes Millions of Columns and I have to handle them
> > into different batches because otherwise hbase crashs. Maybe I will try
> it
> > later, but first I want to try the API version. It works okay so far,
> but I
> > want to improve it a little bit.
> >
> > @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> > count the number of columns in that filter (that works obviously with the
> > count field). But there is no Method that is caleld after iterating over
> > all elements, so I can not return the Drop ReturnCode in the
> filterKeyValue
> > Method because I did'nt know when it was the last one. Any ideas?
> >
> > regards
> >
> >
> > 2013/10/24 Ted Yu <yu...@gmail.com>
> >
> > > Please take a look
> > > at
> > src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> > >
> > >  * Simple filter that returns first N columns on row only.
> > >
> > > You can modify the filter to suit your needs.
> > >
> > > Cheers
> > >
> > >
> > > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm write currently a HBase Java programm which iterates over every
> row
> > > in
> > > > a table. I have to modiy some rows if the column size (the amount of
> > > > columns in this row) is bigger than 25000.
> > > >
> > > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > > >
> > > > Is there any way to add a Filter to the scan Operation and load only
> > rows
> > > > where the size is bigger than 25k?
> > > >
> > > > Currently I check the size at the client, but therefore I have to
> load
> > > > every row to the client site. It would be better if the wrong rows
> > > already
> > > > filtered at the "server" site.
> > > >
> > > > thanks
> > > >
> > > > John
> > > >
> > >
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Jean, if we don't add setBatch to the scan, MR job does cause HBase to crash due to OOME. We have run into this in the past as well. Basically the problem is - Say I have a region server with 12GB of RAM and a row of size 20GB (an extreme example, in practice, HBase runs out of memory way before 20GB). If I query the entire row, HBase does not have enough memory to hold/process it for the response. 

In practice, if your setCaching > 1, then the aggregate of all rows growing too big can also cause the same issue. 

I think 1 way we can solve this issue is making the HBase server serve responses in a streaming fashion somehow (not exactly sure about the details on how this can work but if it has to hold the entire row in memory, its going to be bound by HBase heap size)
 
Regards,
Dhaval


________________________________
 From: Jean-Marc Spaggiari <je...@spaggiari.org>
To: user <us...@hbase.apache.org> 
Sent: Thursday, 24 October 2013 12:37 PM
Subject: Re: Add Columnsize Filter for Scan Operation
 

If the MR crash because of the number of columns, then we have an issue
that we need to fix ;) Please open a JIRA provide details if you are facing
that.

Thanks,

JM



2013/10/24 John <jo...@gmail.com>

> @Jean-Marc: Sure, I can do that, but thats a little bit complicated because
> the the rows has sometimes Millions of Columns and I have to handle them
> into different batches because otherwise hbase crashs. Maybe I will try it
> later, but first I want to try the API version. It works okay so far, but I
> want to improve it a little bit.
>
> @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> count the number of columns in that filter (that works obviously with the
> count field). But there is no Method that is caleld after iterating over
> all elements, so I can not return the Drop ReturnCode in the filterKeyValue
> Method because I did'nt know when it was the last one. Any ideas?
>
> regards
>
>
> 2013/10/24 Ted Yu <yu...@gmail.com>
>
> > Please take a look
> > at
> src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> >
> >  * Simple filter that returns first N columns on row only.
> >
> > You can modify the filter to suit your needs.
> >
> > Cheers
> >
> >
> > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm write currently a HBase Java programm which iterates over every row
> > in
> > > a table. I have to modiy some rows if the column size (the amount of
> > > columns in this row) is bigger than 25000.
> > >
> > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > >
> > > Is there any way to add a Filter to the scan Operation and load only
> rows
> > > where the size is bigger than 25k?
> > >
> > > Currently I check the size at the client, but therefore I have to load
> > > every row to the client site. It would be better if the wrong rows
> > already
> > > filtered at the "server" site.
> > >
> > > thanks
> > >
> > > John
> > >
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

I already mentioned that here:
https://groups.google.com/forum/#!topic/nosql-databases/ZWyc4zDursg ... .
I'm not sure if it is a issue. After setting the "batch size" everything
worked nice for me.

Anyway, that was another problem :) If there would be a Filter my current
code would work fine with the HBase Java API.

John


2013/10/24 Jean-Marc Spaggiari <je...@spaggiari.org>

> If the MR crash because of the number of columns, then we have an issue
> that we need to fix ;) Please open a JIRA provide details if you are facing
> that.
>
> Thanks,
>
> JM
>
>
> 2013/10/24 John <jo...@gmail.com>
>
> > @Jean-Marc: Sure, I can do that, but thats a little bit complicated
> because
> > the the rows has sometimes Millions of Columns and I have to handle them
> > into different batches because otherwise hbase crashs. Maybe I will try
> it
> > later, but first I want to try the API version. It works okay so far,
> but I
> > want to improve it a little bit.
> >
> > @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> > count the number of columns in that filter (that works obviously with the
> > count field). But there is no Method that is caleld after iterating over
> > all elements, so I can not return the Drop ReturnCode in the
> filterKeyValue
> > Method because I did'nt know when it was the last one. Any ideas?
> >
> > regards
> >
> >
> > 2013/10/24 Ted Yu <yu...@gmail.com>
> >
> > > Please take a look
> > > at
> > src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> > >
> > >  * Simple filter that returns first N columns on row only.
> > >
> > > You can modify the filter to suit your needs.
> > >
> > > Cheers
> > >
> > >
> > > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm write currently a HBase Java programm which iterates over every
> row
> > > in
> > > > a table. I have to modiy some rows if the column size (the amount of
> > > > columns in this row) is bigger than 25000.
> > > >
> > > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > > >
> > > > Is there any way to add a Filter to the scan Operation and load only
> > rows
> > > > where the size is bigger than 25k?
> > > >
> > > > Currently I check the size at the client, but therefore I have to
> load
> > > > every row to the client site. It would be better if the wrong rows
> > > already
> > > > filtered at the "server" site.
> > > >
> > > > thanks
> > > >
> > > > John
> > > >
> > >
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

If the MR crash because of the number of columns, then we have an issue
that we need to fix ;) Please open a JIRA provide details if you are facing
that.

Thanks,

JM


2013/10/24 John <jo...@gmail.com>

> @Jean-Marc: Sure, I can do that, but thats a little bit complicated because
> the the rows has sometimes Millions of Columns and I have to handle them
> into different batches because otherwise hbase crashs. Maybe I will try it
> later, but first I want to try the API version. It works okay so far, but I
> want to improve it a little bit.
>
> @Ted: I try to modify it, but I have no idea how exactly do this. I've to
> count the number of columns in that filter (that works obviously with the
> count field). But there is no Method that is caleld after iterating over
> all elements, so I can not return the Drop ReturnCode in the filterKeyValue
> Method because I did'nt know when it was the last one. Any ideas?
>
> regards
>
>
> 2013/10/24 Ted Yu <yu...@gmail.com>
>
> > Please take a look
> > at
> src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
> >
> >  * Simple filter that returns first N columns on row only.
> >
> > You can modify the filter to suit your needs.
> >
> > Cheers
> >
> >
> > On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm write currently a HBase Java programm which iterates over every row
> > in
> > > a table. I have to modiy some rows if the column size (the amount of
> > > columns in this row) is bigger than 25000.
> > >
> > > Here is my sourcode: http://pastebin.com/njqG6ry6
> > >
> > > Is there any way to add a Filter to the scan Operation and load only
> rows
> > > where the size is bigger than 25k?
> > >
> > > Currently I check the size at the client, but therefore I have to load
> > > every row to the client site. It would be better if the wrong rows
> > already
> > > filtered at the "server" site.
> > >
> > > thanks
> > >
> > > John
> > >
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by John <jo...@gmail.com>.

@Jean-Marc: Sure, I can do that, but thats a little bit complicated because
the the rows has sometimes Millions of Columns and I have to handle them
into different batches because otherwise hbase crashs. Maybe I will try it
later, but first I want to try the API version. It works okay so far, but I
want to improve it a little bit.

@Ted: I try to modify it, but I have no idea how exactly do this. I've to
count the number of columns in that filter (that works obviously with the
count field). But there is no Method that is caleld after iterating over
all elements, so I can not return the Drop ReturnCode in the filterKeyValue
Method because I did'nt know when it was the last one. Any ideas?

regards

2013/10/24 Ted Yu <yu...@gmail.com>

> Please take a look
> at src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :
>
>  * Simple filter that returns first N columns on row only.
>
> You can modify the filter to suit your needs.
>
> Cheers
>
>
> On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm write currently a HBase Java programm which iterates over every row
> in
> > a table. I have to modiy some rows if the column size (the amount of
> > columns in this row) is bigger than 25000.
> >
> > Here is my sourcode: http://pastebin.com/njqG6ry6
> >
> > Is there any way to add a Filter to the scan Operation and load only rows
> > where the size is bigger than 25k?
> >
> > Currently I check the size at the client, but therefore I have to load
> > every row to the client site. It would be better if the wrong rows
> already
> > filtered at the "server" site.
> >
> > thanks
> >
> > John
> >
>

Re: Add Columnsize Filter for Scan Operation

Posted by Ted Yu <yu...@gmail.com>.

Please take a look
at src/main/java/org/apache/hadoop/hbase/filter/ColumnCountGetFilter.java :

 * Simple filter that returns first N columns on row only.

You can modify the filter to suit your needs.

Cheers

On Thu, Oct 24, 2013 at 7:52 AM, John <jo...@gmail.com> wrote:

> Hi,
>
> I'm write currently a HBase Java programm which iterates over every row in
> a table. I have to modiy some rows if the column size (the amount of
> columns in this row) is bigger than 25000.
>
> Here is my sourcode: http://pastebin.com/njqG6ry6
>
> Is there any way to add a Filter to the scan Operation and load only rows
> where the size is bigger than 25k?
>
> Currently I check the size at the client, but therefore I have to load
> every row to the client site. It would be better if the wrong rows already
> filtered at the "server" site.
>
> thanks
>
> John
>

Re: Add Columnsize Filter for Scan Operation

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi John,

Sorry it's not going to reply to your question, but if you do a full table
scan, you might want to do it with a MapReduce job so it will be way faster.

For the filter, you might have to implement your own. I'm not sure there is
any filter based on the cell size today :(

JM


2013/10/24 John <jo...@gmail.com>

> Hi,
>
> I'm write currently a HBase Java programm which iterates over every row in
> a table. I have to modiy some rows if the column size (the amount of
> columns in this row) is bigger than 25000.
>
> Here is my sourcode: http://pastebin.com/njqG6ry6
>
> Is there any way to add a Filter to the scan Operation and load only rows
> where the size is bigger than 25k?
>
> Currently I check the size at the client, but therefore I have to load
> every row to the client site. It would be better if the wrong rows already
> filtered at the "server" site.
>
> thanks
>
> John
>