You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rakhi Khatwani <ra...@gmail.com> on 2009/04/07 11:26:37 UTC

help with map-reduce

Hi,
     i have a map reduce program with which i read from a hbase table.
In my map program i check if the column value of a is xxx, if yes then
continue with processing else skip it.
however if my table is really big, most of my time in the map gets wasted
for processing unwanted rows.
is there any way through which we could send a subset of rows (based on the
value of a particular column family) to the map???

i have also gone through TableInputFormatBase but am not able to figure out
how do we set the input format if we are using TableMapReduceUtil class to
initialize table map jobs. or is there any other way i could use it.

Thanks in Advance,
Raakhi.

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi Tim,
       i made a class which extends table input format base and set Htable ,
InputColumns and the row filter. but i don't know how to set that class as
an input to my map reduce program.
Currently i am using TableMapReduceUtil to set my tablename and
columnFamilies to set the input to my map class.

On Tue, Apr 7, 2009 at 3:11 PM, tim robertson <ti...@gmail.com>wrote:

> I am a newbie, but...
>
> I think it will boil down to something looking at the column and
> applying the filter.  I don't think without reworking the model or
> adding some kind of index you would get around this.
>
> Why not set a RowFilter to the TableInputFormat and then it is
> filtered before your map - I presume this would be more efficient than
> shuffling all the data through the task tracking of Hadoop MR.
>
> Cheers
>
> Tim
>
>
>
> On Tue, Apr 7, 2009 at 11:26 AM, Rakhi Khatwani
> <ra...@gmail.com> wrote:
> > Hi,
> >     i have a map reduce program with which i read from a hbase table.
> > In my map program i check if the column value of a is xxx, if yes then
> > continue with processing else skip it.
> > however if my table is really big, most of my time in the map gets wasted
> > for processing unwanted rows.
> > is there any way through which we could send a subset of rows (based on
> the
> > value of a particular column family) to the map???
> >
> > i have also gone through TableInputFormatBase but am not able to figure
> out
> > how do we set the input format if we are using TableMapReduceUtil class
> to
> > initialize table map jobs. or is there any other way i could use it.
> >
> > Thanks in Advance,
> > Raakhi.
> >
>

Re: help with map-reduce

Posted by tim robertson <ti...@gmail.com>.
I am a newbie, but...

I think it will boil down to something looking at the column and
applying the filter.  I don't think without reworking the model or
adding some kind of index you would get around this.

Why not set a RowFilter to the TableInputFormat and then it is
filtered before your map - I presume this would be more efficient than
shuffling all the data through the task tracking of Hadoop MR.

Cheers

Tim



On Tue, Apr 7, 2009 at 11:26 AM, Rakhi Khatwani
<ra...@gmail.com> wrote:
> Hi,
>     i have a map reduce program with which i read from a hbase table.
> In my map program i check if the column value of a is xxx, if yes then
> continue with processing else skip it.
> however if my table is really big, most of my time in the map gets wasted
> for processing unwanted rows.
> is there any way through which we could send a subset of rows (based on the
> value of a particular column family) to the map???
>
> i have also gone through TableInputFormatBase but am not able to figure out
> how do we set the input format if we are using TableMapReduceUtil class to
> initialize table map jobs. or is there any other way i could use it.
>
> Thanks in Advance,
> Raakhi.
>

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi Lars,
            thanks for your suggesstion... i will try this out 2day :)

thanks once again
Rakhi

On Thu, Apr 9, 2009 at 6:58 PM, Lars George <la...@worldlingo.com> wrote:

> Hi Rakhi,
>
> The second part was meant to say: "...Setting it to *false*activates
> the...", so call it like this:
>
>
> final RowFilterInterface colFilter = new
> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>  "UNCOLLECTED".getBytes(), false);
>
> Regards,
> Lars
>
> PS: And sorry for my misspelling of your name
>
>
>
> Lars George wrote:
>
>> Hi Rahki,
>>
>> Looking through the code of the ColumnValueFilter again, it seems it does
>> what you want when you add the extra "filterIfColumnMissing" parameter to
>> the constructor and set it to "false". The default "true" does the column
>> filtering and will return all rows that have that column. Setting it to true
>> activates the "filterRow()" (although I am not sure yet where that is called
>> - the others I can see in the StoreScanner class in use) to filter rows out
>> that do not have a column match - which is what you want. Of course you
>> still need to invert the check as mentioned in the previous email.
>>
>> Lars
>>
>> Rakhi Khatwani wrote:
>>
>>> Hi Lars,
>>>                 Hmm... i had a look at other filters.. but i thought
>>> ColumnValueFilter would be more appropriate coz in the constructor we
>>> could
>>> mention the column name and the value.
>>> Probably i am going wrong there.
>>>
>>> what i want is to filter out all the rows based on some column value.
>>> what
>>> do you suggest??.
>>>
>>> thanks a ton
>>> Rakhi
>>>
>>> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <la...@worldlingo.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hi Rakhi,
>>>>
>>>> Sorry, not yet. This is not an easy thing to replicate. I will try
>>>> though
>>>> over the next few days if I find time. A few things to note though
>>>> first.
>>>> The way filters work is that they do *not* let filtered rows through but
>>>> actually filters them out. That means you logic seems reversed:
>>>>
>>>>  final RowFilterInterface colFilter = new
>>>> ColumnValueFilter("Status:".getBytes(),
>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>  "UNCOLLECTED".getBytes());
>>>>  setRowFilter(colFilter);
>>>>
>>>>
>>>> I think you *want* the uncollected columns to be processed? At least
>>>> that
>>>> is what you said below :) So you will have to filter all other rows out
>>>> of
>>>> the set that are NOT EQUAL to "UNCOLLECTED".
>>>>
>>>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you
>>>> systems
>>>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that
>>>> should
>>>> of course match the way you store those strings in the first place. The
>>>> filters do a byte level compare so that is very sensitive.
>>>>
>>>> This does not address yet why you see both values or have matches at
>>>> all.
>>>> It rather sounds like the filter is not active?
>>>>
>>>> And lastly, using the ColumnValueFilter will always let throw all rows!
>>>> It
>>>> is designed to strip out the columns of each row, but not filter on the
>>>> row
>>>> itself. Is that what you want? If not you may have to use a different
>>>> filter
>>>> class.
>>>>
>>>>
>>>> Lars
>>>>
>>>>
>>>> Rakhi Khatwani wrote:
>>>>
>>>>
>>>>
>>>>> Hi Lars,
>>>>>             Just wanted to follow up, did you try out the column value
>>>>> filter? did it work??
>>>>> i really need it to improve the performance of my map-reduce programs.
>>>>>
>>>>> Thanks a ton,
>>>>> Raakhi
>>>>>
>>>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <
>>>>> rakhi.khatwani@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Lars,
>>>>>>
>>>>>> Well the details are as follows:
>>>>>>
>>>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>>>> below:
>>>>>>
>>>>>> one columnFamily called content and
>>>>>> one columnFamily called status [which takes the values ANALYSED,
>>>>>> UNANALYSED] (all in upper case... i checked it, there is no issue with
>>>>>> the
>>>>>> spelling/case).
>>>>>>
>>>>>> Hope this helps,
>>>>>> Thanks.
>>>>>> Rakhi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Rakhi,
>>>>>>>
>>>>>>> Wow, same here. I copied your RowFilter line and when I press the dot
>>>>>>> key
>>>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>>>
>>>>>>> Apart from that, you are also saying that the filter is not working
>>>>>>> as
>>>>>>> expected? Do you use any column qualifiers for the "Status:" column?
>>>>>>> Are
>>>>>>> the
>>>>>>> values in the correct casing, i.e. are the values stored in uppercase
>>>>>>> as
>>>>>>> you
>>>>>>> have it in your example below? I assume the comparison is byte
>>>>>>> sensitive.
>>>>>>> Please give us more details, maybe a small sample table dump so that
>>>>>>> we
>>>>>>> can
>>>>>>> test this?
>>>>>>>
>>>>>>> Lars
>>>>>>>
>>>>>>> Rakhi Khatwani wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>         I did try the filter... but using ColumnValueFilter. i
>>>>>>>> declared
>>>>>>>> a
>>>>>>>> ColumnValueFilter as follows:
>>>>>>>>
>>>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>>>  implements JobConfigurable {
>>>>>>>>
>>>>>>>>           public void configure(final JobConf jobConf) {
>>>>>>>>
>>>>>>>>          setHtable(tablename);
>>>>>>>>
>>>>>>>>          setInputColumns(columnName);
>>>>>>>>
>>>>>>>>
>>>>>>>>           final RowFilterInterface colFilter =
>>>>>>>>                                               new
>>>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>>>> "UNCOLLECTED".getBytes());
>>>>>>>>             setRowFilter(colFilter);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> and thn i use my class as the input format to my map function.
>>>>>>>>
>>>>>>>>
>>>>>>>> in my map function, i set my log to display the value of my Status
>>>>>>>> Column
>>>>>>>> family.
>>>>>>>>
>>>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>>>> Uncollected"
>>>>>>>> for some rows
>>>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>>>
>>>>>>>> but what i want is to send only those records whose 'Status: is
>>>>>>>> uncollected'.
>>>>>>>>
>>>>>>>> i even considered using the method filterRow described by the API as
>>>>>>>> follows:
>>>>>>>>  boolean *filterRow<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>>>>>       *(SortedMap<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>>>>>       <byte[],Cell<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> columns)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>        Filter on the fully assembled row.
>>>>>>>>
>>>>>>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>>>>>>> its really weird... i have tried it on 3 different machines (2
>>>>>>>> machines
>>>>>>>> on
>>>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>>>> myEclipse).
>>>>>>>>
>>>>>>>>
>>>>>>>> i dunno if i am going wrong somewhere
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Raakhi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Rakhi,
>>>>>>>>>
>>>>>>>>> The way the filters work is that you either use the supplied
>>>>>>>>> filters
>>>>>>>>> or
>>>>>>>>> create your own subclasses - but then you will have to deploy that
>>>>>>>>> class
>>>>>>>>> to
>>>>>>>>> all RegionServers while adding it to their respective hbase-env.sh
>>>>>>>>> (in
>>>>>>>>> the
>>>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if
>>>>>>>>> this
>>>>>>>>> could be done dynamically (
>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>>>
>>>>>>>>> Once you have that done or use one of the supplied one then you can
>>>>>>>>> assign
>>>>>>>>> the filter by overriding the TableInputFormat's configure() method
>>>>>>>>> and
>>>>>>>>> assign it like so:
>>>>>>>>>
>>>>>>>>>  public void configure(JobConf job) {
>>>>>>>>>  RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>>>  setRowFilter(filter);
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> As Tim points out, setting the whole thing up is done in your main
>>>>>>>>> M/R
>>>>>>>>> tool
>>>>>>>>> based application, similar to:
>>>>>>>>>
>>>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>>>>>> IdentityTableMap.class,
>>>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>>>
>>>>>>>>> Of course depending on what classes you want to replace or if this
>>>>>>>>> is
>>>>>>>>> a
>>>>>>>>> Reduce oriented job (means a default identity + filter map and all
>>>>>>>>> the
>>>>>>>>> work
>>>>>>>>> done in the Reduce phase) or the other way around. But the
>>>>>>>>> principles
>>>>>>>>> and
>>>>>>>>> filtering are the same.
>>>>>>>>>
>>>>>>>>> HTH,
>>>>>>>>> Lars
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Rakhi Khatwani wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> there is a server-side mechanism to filter rows, it's found in
>>>>>>>>>>> the
>>>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this
>>>>>>>>>>> interops
>>>>>>>>>>> with
>>>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>>>
>>>>>>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>>>>>>> exactly
>>>>>>>>>>> what you want.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>   Hi,
>>>>>>>>>>>>  i have a map reduce program with which i read from a hbase
>>>>>>>>>>>> table.
>>>>>>>>>>>> In my map program i check if the column value of a is xxx, if
>>>>>>>>>>>> yes
>>>>>>>>>>>> then
>>>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>>>> however if my table is really big, most of my time in the map
>>>>>>>>>>>> gets
>>>>>>>>>>>> wasted
>>>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>>>> is there any way through which we could send a subset of rows
>>>>>>>>>>>> (based
>>>>>>>>>>>> on
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>>>>
>>>>>>>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>>>>>>>> figure
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> out
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> how do we set the input format if we are using
>>>>>>>>>>>> TableMapReduceUtil
>>>>>>>>>>>> class
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> initialize table map jobs. or is there any other way i could use
>>>>>>>>>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in Advance,
>>>>>>>>>>>> Raakhi.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Hi Rakhi,

The second part was meant to say: "...Setting it to *false*activates 
the...", so call it like this:


 final RowFilterInterface colFilter = new 
ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
  "UNCOLLECTED".getBytes(), false);

Regards,
Lars

PS: And sorry for my misspelling of your name


Lars George wrote:
> Hi Rahki,
>
> Looking through the code of the ColumnValueFilter again, it seems it 
> does what you want when you add the extra "filterIfColumnMissing" 
> parameter to the constructor and set it to "false". The default "true" 
> does the column filtering and will return all rows that have that 
> column. Setting it to true activates the "filterRow()" (although I am 
> not sure yet where that is called - the others I can see in the 
> StoreScanner class in use) to filter rows out that do not have a 
> column match - which is what you want. Of course you still need to 
> invert the check as mentioned in the previous email.
>
> Lars
>
> Rakhi Khatwani wrote:
>> Hi Lars,
>>                  Hmm... i had a look at other filters.. but i thought
>> ColumnValueFilter would be more appropriate coz in the constructor we 
>> could
>> mention the column name and the value.
>> Probably i am going wrong there.
>>
>> what i want is to filter out all the rows based on some column value. 
>> what
>> do you suggest??.
>>
>> thanks a ton
>> Rakhi
>>
>> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <la...@worldlingo.com> 
>> wrote:
>>
>>  
>>> Hi Rakhi,
>>>
>>> Sorry, not yet. This is not an easy thing to replicate. I will try 
>>> though
>>> over the next few days if I find time. A few things to note though 
>>> first.
>>> The way filters work is that they do *not* let filtered rows through 
>>> but
>>> actually filters them out. That means you logic seems reversed:
>>>
>>>  final RowFilterInterface colFilter = new
>>> ColumnValueFilter("Status:".getBytes(), 
>>> ColumnValueFilter.CompareOp.EQUAL,
>>>   "UNCOLLECTED".getBytes());
>>>  setRowFilter(colFilter);
>>>
>>>
>>> I think you *want* the uncollected columns to be processed? At least 
>>> that
>>> is what you said below :) So you will have to filter all other rows 
>>> out of
>>> the set that are NOT EQUAL to "UNCOLLECTED".
>>>
>>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you 
>>> systems
>>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that 
>>> should
>>> of course match the way you store those strings in the first place. The
>>> filters do a byte level compare so that is very sensitive.
>>>
>>> This does not address yet why you see both values or have matches at 
>>> all.
>>> It rather sounds like the filter is not active?
>>>
>>> And lastly, using the ColumnValueFilter will always let throw all 
>>> rows! It
>>> is designed to strip out the columns of each row, but not filter on 
>>> the row
>>> itself. Is that what you want? If not you may have to use a 
>>> different filter
>>> class.
>>>
>>>
>>> Lars
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>    
>>>> Hi Lars,
>>>>              Just wanted to follow up, did you try out the column 
>>>> value
>>>> filter? did it work??
>>>> i really need it to improve the performance of my map-reduce programs.
>>>>
>>>> Thanks a ton,
>>>> Raakhi
>>>>
>>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani 
>>>> <rakhi.khatwani@gmail.com
>>>>      
>>>>> wrote:
>>>>>         
>>>>
>>>>      
>>>>> Hi Lars,
>>>>>
>>>>> Well the details are as follows:
>>>>>
>>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>>> below:
>>>>>
>>>>> one columnFamily called content and
>>>>> one columnFamily called status [which takes the values ANALYSED,
>>>>> UNANALYSED] (all in upper case... i checked it, there is no issue 
>>>>> with
>>>>> the
>>>>> spelling/case).
>>>>>
>>>>> Hope this helps,
>>>>> Thanks.
>>>>> Rakhi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> 
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>        
>>>>>> Hi Rakhi,
>>>>>>
>>>>>> Wow, same here. I copied your RowFilter line and when I press the 
>>>>>> dot
>>>>>> key
>>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>>
>>>>>> Apart from that, you are also saying that the filter is not 
>>>>>> working as
>>>>>> expected? Do you use any column qualifiers for the "Status:" 
>>>>>> column? Are
>>>>>> the
>>>>>> values in the correct casing, i.e. are the values stored in 
>>>>>> uppercase as
>>>>>> you
>>>>>> have it in your example below? I assume the comparison is byte
>>>>>> sensitive.
>>>>>> Please give us more details, maybe a small sample table dump so 
>>>>>> that we
>>>>>> can
>>>>>> test this?
>>>>>>
>>>>>> Lars
>>>>>>
>>>>>> Rakhi Khatwani wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>          
>>>>>>> Hi,
>>>>>>>          I did try the filter... but using ColumnValueFilter. i
>>>>>>> declared
>>>>>>> a
>>>>>>> ColumnValueFilter as follows:
>>>>>>>
>>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>>   implements JobConfigurable {
>>>>>>>
>>>>>>>            public void configure(final JobConf jobConf) {
>>>>>>>
>>>>>>>           setHtable(tablename);
>>>>>>>
>>>>>>>           setInputColumns(columnName);
>>>>>>>
>>>>>>>
>>>>>>>            final RowFilterInterface colFilter =
>>>>>>>                                                new
>>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>>> "UNCOLLECTED".getBytes());
>>>>>>>              setRowFilter(colFilter);
>>>>>>>  }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> and thn i use my class as the input format to my map function.
>>>>>>>
>>>>>>>
>>>>>>> in my map function, i set my log to display the value of my Status
>>>>>>> Column
>>>>>>> family.
>>>>>>>
>>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>>> Uncollected"
>>>>>>> for some rows
>>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>>
>>>>>>> but what i want is to send only those records whose 'Status: is
>>>>>>> uncollected'.
>>>>>>>
>>>>>>> i even considered using the method filterRow described by the 
>>>>>>> API as
>>>>>>> follows:
>>>>>>>  boolean *filterRow<
>>>>>>>
>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29 
>>>>>>>
>>>>>>>        *(SortedMap<
>>>>>>>
>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true 
>>>>>>>
>>>>>>>        <byte[],Cell<
>>>>>>>
>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>            
>>>>>>>> columns)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>>         Filter on the fully assembled row.
>>>>>>>
>>>>>>> but as soon as i type colFilter followed by a '.', my eclipse 
>>>>>>> hangs.
>>>>>>> its really weird... i have tried it on 3 different machines (2 
>>>>>>> machines
>>>>>>> on
>>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>>> myEclipse).
>>>>>>>
>>>>>>>
>>>>>>> i dunno if i am going wrong somewhere
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Raakhi
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>            
>>>>>>>> Hi Rakhi,
>>>>>>>>
>>>>>>>> The way the filters work is that you either use the supplied 
>>>>>>>> filters
>>>>>>>> or
>>>>>>>> create your own subclasses - but then you will have to deploy that
>>>>>>>> class
>>>>>>>> to
>>>>>>>> all RegionServers while adding it to their respective 
>>>>>>>> hbase-env.sh (in
>>>>>>>> the
>>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if
>>>>>>>> this
>>>>>>>> could be done dynamically (
>>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>>
>>>>>>>> Once you have that done or use one of the supplied one then you 
>>>>>>>> can
>>>>>>>> assign
>>>>>>>> the filter by overriding the TableInputFormat's configure() 
>>>>>>>> method and
>>>>>>>> assign it like so:
>>>>>>>>
>>>>>>>>  public void configure(JobConf job) {
>>>>>>>>   RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>>   setRowFilter(filter);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> As Tim points out, setting the whole thing up is done in your 
>>>>>>>> main M/R
>>>>>>>> tool
>>>>>>>> based application, similar to:
>>>>>>>>
>>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>>>>> IdentityTableMap.class,
>>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>>
>>>>>>>> Of course depending on what classes you want to replace or if 
>>>>>>>> this is
>>>>>>>> a
>>>>>>>> Reduce oriented job (means a default identity + filter map and 
>>>>>>>> all the
>>>>>>>> work
>>>>>>>> done in the Reduce phase) or the other way around. But the 
>>>>>>>> principles
>>>>>>>> and
>>>>>>>> filtering are the same.
>>>>>>>>
>>>>>>>> HTH,
>>>>>>>> Lars
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Rakhi Khatwani wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>              
>>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>>
>>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                
>>>>>>>>>> there is a server-side mechanism to filter rows, it's found 
>>>>>>>>>> in the
>>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this
>>>>>>>>>> interops
>>>>>>>>>> with
>>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>>
>>>>>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>>>>>> exactly
>>>>>>>>>> what you want.
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> wrote:
>>>>>>>>>>>    Hi,
>>>>>>>>>>>  i have a map reduce program with which i read from a hbase 
>>>>>>>>>>> table.
>>>>>>>>>>> In my map program i check if the column value of a is xxx, 
>>>>>>>>>>> if yes
>>>>>>>>>>> then
>>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>>> however if my table is really big, most of my time in the 
>>>>>>>>>>> map gets
>>>>>>>>>>> wasted
>>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>>> is there any way through which we could send a subset of rows
>>>>>>>>>>> (based
>>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>>>
>>>>>>>>>>> i have also gone through TableInputFormatBase but am not 
>>>>>>>>>>> able to
>>>>>>>>>>> figure
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> out
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> how do we set the input format if we are using 
>>>>>>>>>>> TableMapReduceUtil
>>>>>>>>>>> class
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> initialize table map jobs. or is there any other way i could 
>>>>>>>>>>> use
>>>>>>>>>>> it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in Advance,
>>>>>>>>>>> Raakhi.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>       
>>
>>   

Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Hi Rahki,

Looking through the code of the ColumnValueFilter again, it seems it 
does what you want when you add the extra "filterIfColumnMissing" 
parameter to the constructor and set it to "false". The default "true" 
does the column filtering and will return all rows that have that 
column. Setting it to true activates the "filterRow()" (although I am 
not sure yet where that is called - the others I can see in the 
StoreScanner class in use) to filter rows out that do not have a column 
match - which is what you want. Of course you still need to invert the 
check as mentioned in the previous email.

Lars

Rakhi Khatwani wrote:
> Hi Lars,
>                  Hmm... i had a look at other filters.. but i thought
> ColumnValueFilter would be more appropriate coz in the constructor we could
> mention the column name and the value.
> Probably i am going wrong there.
>
> what i want is to filter out all the rows based on some column value. what
> do you suggest??.
>
> thanks a ton
> Rakhi
>
> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <la...@worldlingo.com> wrote:
>
>   
>> Hi Rakhi,
>>
>> Sorry, not yet. This is not an easy thing to replicate. I will try though
>> over the next few days if I find time. A few things to note though first.
>> The way filters work is that they do *not* let filtered rows through but
>> actually filters them out. That means you logic seems reversed:
>>
>>  final RowFilterInterface colFilter = new
>> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>>   "UNCOLLECTED".getBytes());
>>  setRowFilter(colFilter);
>>
>>
>> I think you *want* the uncollected columns to be processed? At least that
>> is what you said below :) So you will have to filter all other rows out of
>> the set that are NOT EQUAL to "UNCOLLECTED".
>>
>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you systems
>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that should
>> of course match the way you store those strings in the first place. The
>> filters do a byte level compare so that is very sensitive.
>>
>> This does not address yet why you see both values or have matches at all.
>> It rather sounds like the filter is not active?
>>
>> And lastly, using the ColumnValueFilter will always let throw all rows! It
>> is designed to strip out the columns of each row, but not filter on the row
>> itself. Is that what you want? If not you may have to use a different filter
>> class.
>>
>>
>> Lars
>>
>>
>> Rakhi Khatwani wrote:
>>
>>     
>>> Hi Lars,
>>>              Just wanted to follow up, did you try out the column value
>>> filter? did it work??
>>> i really need it to improve the performance of my map-reduce programs.
>>>
>>> Thanks a ton,
>>> Raakhi
>>>
>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> Hi Lars,
>>>>
>>>> Well the details are as follows:
>>>>
>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>> below:
>>>>
>>>> one columnFamily called content and
>>>> one columnFamily called status [which takes the values ANALYSED,
>>>> UNANALYSED] (all in upper case... i checked it, there is no issue with
>>>> the
>>>> spelling/case).
>>>>
>>>> Hope this helps,
>>>> Thanks.
>>>> Rakhi
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi Rakhi,
>>>>>
>>>>> Wow, same here. I copied your RowFilter line and when I press the dot
>>>>> key
>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>
>>>>> Apart from that, you are also saying that the filter is not working as
>>>>> expected? Do you use any column qualifiers for the "Status:" column? Are
>>>>> the
>>>>> values in the correct casing, i.e. are the values stored in uppercase as
>>>>> you
>>>>> have it in your example below? I assume the comparison is byte
>>>>> sensitive.
>>>>> Please give us more details, maybe a small sample table dump so that we
>>>>> can
>>>>> test this?
>>>>>
>>>>> Lars
>>>>>
>>>>> Rakhi Khatwani wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>          I did try the filter... but using ColumnValueFilter. i
>>>>>> declared
>>>>>> a
>>>>>> ColumnValueFilter as follows:
>>>>>>
>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>   implements JobConfigurable {
>>>>>>
>>>>>>            public void configure(final JobConf jobConf) {
>>>>>>
>>>>>>           setHtable(tablename);
>>>>>>
>>>>>>           setInputColumns(columnName);
>>>>>>
>>>>>>
>>>>>>            final RowFilterInterface colFilter =
>>>>>>                                                new
>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>> "UNCOLLECTED".getBytes());
>>>>>>              setRowFilter(colFilter);
>>>>>>  }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> and thn i use my class as the input format to my map function.
>>>>>>
>>>>>>
>>>>>> in my map function, i set my log to display the value of my Status
>>>>>> Column
>>>>>> family.
>>>>>>
>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>> Uncollected"
>>>>>> for some rows
>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>
>>>>>> but what i want is to send only those records whose 'Status: is
>>>>>> uncollected'.
>>>>>>
>>>>>> i even considered using the method filterRow described by the API as
>>>>>> follows:
>>>>>>  boolean *filterRow<
>>>>>>
>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>>>        *(SortedMap<
>>>>>>
>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>>>        <byte[],Cell<
>>>>>>
>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> columns)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>         Filter on the fully assembled row.
>>>>>>
>>>>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>>>>> its really weird... i have tried it on 3 different machines (2 machines
>>>>>> on
>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>> myEclipse).
>>>>>>
>>>>>>
>>>>>> i dunno if i am going wrong somewhere
>>>>>>
>>>>>> Thanks,
>>>>>> Raakhi
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Rakhi,
>>>>>>>
>>>>>>> The way the filters work is that you either use the supplied filters
>>>>>>> or
>>>>>>> create your own subclasses - but then you will have to deploy that
>>>>>>> class
>>>>>>> to
>>>>>>> all RegionServers while adding it to their respective hbase-env.sh (in
>>>>>>> the
>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if
>>>>>>> this
>>>>>>> could be done dynamically (
>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>
>>>>>>> Once you have that done or use one of the supplied one then you can
>>>>>>> assign
>>>>>>> the filter by overriding the TableInputFormat's configure() method and
>>>>>>> assign it like so:
>>>>>>>
>>>>>>>  public void configure(JobConf job) {
>>>>>>>   RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>   setRowFilter(filter);
>>>>>>>  }
>>>>>>>
>>>>>>> As Tim points out, setting the whole thing up is done in your main M/R
>>>>>>> tool
>>>>>>> based application, similar to:
>>>>>>>
>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>>>> IdentityTableMap.class,
>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>
>>>>>>> Of course depending on what classes you want to replace or if this is
>>>>>>> a
>>>>>>> Reduce oriented job (means a default identity + filter map and all the
>>>>>>> work
>>>>>>> done in the Reduce phase) or the other way around. But the principles
>>>>>>> and
>>>>>>> filtering are the same.
>>>>>>>
>>>>>>> HTH,
>>>>>>> Lars
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Rakhi Khatwani wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>
>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> there is a server-side mechanism to filter rows, it's found in the
>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this
>>>>>>>>> interops
>>>>>>>>> with
>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>
>>>>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>>>>> exactly
>>>>>>>>> what you want.
>>>>>>>>>
>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> wrote:
>>>>>>>>>>    Hi,
>>>>>>>>>>  i have a map reduce program with which i read from a hbase table.
>>>>>>>>>> In my map program i check if the column value of a is xxx, if yes
>>>>>>>>>> then
>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>> however if my table is really big, most of my time in the map gets
>>>>>>>>>> wasted
>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>> is there any way through which we could send a subset of rows
>>>>>>>>>> (based
>>>>>>>>>> on
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>>
>>>>>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>>>>>> figure
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> out
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>>>>>> class
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> initialize table map jobs. or is there any other way i could use
>>>>>>>>>> it.
>>>>>>>>>>
>>>>>>>>>> Thanks in Advance,
>>>>>>>>>> Raakhi.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>       
>
>   

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi Lars,
                 Hmm... i had a look at other filters.. but i thought
ColumnValueFilter would be more appropriate coz in the constructor we could
mention the column name and the value.
Probably i am going wrong there.

what i want is to filter out all the rows based on some column value. what
do you suggest??.

thanks a ton
Rakhi

On Thu, Apr 9, 2009 at 11:46 AM, Lars George <la...@worldlingo.com> wrote:

> Hi Rakhi,
>
> Sorry, not yet. This is not an easy thing to replicate. I will try though
> over the next few days if I find time. A few things to note though first.
> The way filters work is that they do *not* let filtered rows through but
> actually filters them out. That means you logic seems reversed:
>
>  final RowFilterInterface colFilter = new
> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>   "UNCOLLECTED".getBytes());
>  setRowFilter(colFilter);
>
>
> I think you *want* the uncollected columns to be processed? At least that
> is what you said below :) So you will have to filter all other rows out of
> the set that are NOT EQUAL to "UNCOLLECTED".
>
> Second, be careful with "UNCOLLECTED".getBytes() as that uses you systems
> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that should
> of course match the way you store those strings in the first place. The
> filters do a byte level compare so that is very sensitive.
>
> This does not address yet why you see both values or have matches at all.
> It rather sounds like the filter is not active?
>
> And lastly, using the ColumnValueFilter will always let throw all rows! It
> is designed to strip out the columns of each row, but not filter on the row
> itself. Is that what you want? If not you may have to use a different filter
> class.
>
>
> Lars
>
>
> Rakhi Khatwani wrote:
>
>> Hi Lars,
>>              Just wanted to follow up, did you try out the column value
>> filter? did it work??
>> i really need it to improve the performance of my map-reduce programs.
>>
>> Thanks a ton,
>> Raakhi
>>
>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>> >wrote:
>>
>>
>>
>>> Hi Lars,
>>>
>>> Well the details are as follows:
>>>
>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>> below:
>>>
>>> one columnFamily called content and
>>> one columnFamily called status [which takes the values ANALYSED,
>>> UNANALYSED] (all in upper case... i checked it, there is no issue with
>>> the
>>> spelling/case).
>>>
>>> Hope this helps,
>>> Thanks.
>>> Rakhi
>>>
>>>
>>>
>>>
>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> wrote:
>>>
>>>
>>>
>>>> Hi Rakhi,
>>>>
>>>> Wow, same here. I copied your RowFilter line and when I press the dot
>>>> key
>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>
>>>> Apart from that, you are also saying that the filter is not working as
>>>> expected? Do you use any column qualifiers for the "Status:" column? Are
>>>> the
>>>> values in the correct casing, i.e. are the values stored in uppercase as
>>>> you
>>>> have it in your example below? I assume the comparison is byte
>>>> sensitive.
>>>> Please give us more details, maybe a small sample table dump so that we
>>>> can
>>>> test this?
>>>>
>>>> Lars
>>>>
>>>> Rakhi Khatwani wrote:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>          I did try the filter... but using ColumnValueFilter. i
>>>>> declared
>>>>> a
>>>>> ColumnValueFilter as follows:
>>>>>
>>>>> public class TableInputFilter extends TableInputFormat
>>>>>   implements JobConfigurable {
>>>>>
>>>>>            public void configure(final JobConf jobConf) {
>>>>>
>>>>>           setHtable(tablename);
>>>>>
>>>>>           setInputColumns(columnName);
>>>>>
>>>>>
>>>>>            final RowFilterInterface colFilter =
>>>>>                                                new
>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>> "UNCOLLECTED".getBytes());
>>>>>              setRowFilter(colFilter);
>>>>>  }
>>>>>
>>>>> }
>>>>>
>>>>> and thn i use my class as the input format to my map function.
>>>>>
>>>>>
>>>>> in my map function, i set my log to display the value of my Status
>>>>> Column
>>>>> family.
>>>>>
>>>>> when i execute my map reduce function, it displays "Status::
>>>>> Uncollected"
>>>>> for some rows
>>>>> and Status = "Collected" for rest of the rows.
>>>>>
>>>>> but what i want is to send only those records whose 'Status: is
>>>>> uncollected'.
>>>>>
>>>>> i even considered using the method filterRow described by the API as
>>>>> follows:
>>>>>  boolean *filterRow<
>>>>>
>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>>        *(SortedMap<
>>>>>
>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>>        <byte[],Cell<
>>>>>
>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>>
>>>>>
>>>>>> columns)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>         Filter on the fully assembled row.
>>>>>
>>>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>>>> its really weird... i have tried it on 3 different machines (2 machines
>>>>> on
>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>> myEclipse).
>>>>>
>>>>>
>>>>> i dunno if i am going wrong somewhere
>>>>>
>>>>> Thanks,
>>>>> Raakhi
>>>>>
>>>>>
>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Rakhi,
>>>>>>
>>>>>> The way the filters work is that you either use the supplied filters
>>>>>> or
>>>>>> create your own subclasses - but then you will have to deploy that
>>>>>> class
>>>>>> to
>>>>>> all RegionServers while adding it to their respective hbase-env.sh (in
>>>>>> the
>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if
>>>>>> this
>>>>>> could be done dynamically (
>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>
>>>>>> Once you have that done or use one of the supplied one then you can
>>>>>> assign
>>>>>> the filter by overriding the TableInputFormat's configure() method and
>>>>>> assign it like so:
>>>>>>
>>>>>>  public void configure(JobConf job) {
>>>>>>   RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>   setRowFilter(filter);
>>>>>>  }
>>>>>>
>>>>>> As Tim points out, setting the whole thing up is done in your main M/R
>>>>>> tool
>>>>>> based application, similar to:
>>>>>>
>>>>>>  JobConf job = new JobConf(...);
>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>>> IdentityTableMap.class,
>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>
>>>>>> Of course depending on what classes you want to replace or if this is
>>>>>> a
>>>>>> Reduce oriented job (means a default identity + filter map and all the
>>>>>> work
>>>>>> done in the Reduce phase) or the other way around. But the principles
>>>>>> and
>>>>>> filtering are the same.
>>>>>>
>>>>>> HTH,
>>>>>> Lars
>>>>>>
>>>>>>
>>>>>>
>>>>>> Rakhi Khatwani wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks Ryan, i will try that
>>>>>>>
>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> there is a server-side mechanism to filter rows, it's found in the
>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this
>>>>>>>> interops
>>>>>>>> with
>>>>>>>> the TableInputFormat exactly.
>>>>>>>>
>>>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>>>> exactly
>>>>>>>> what you want.
>>>>>>>>
>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>    Hi,
>>>>>>>>>  i have a map reduce program with which i read from a hbase table.
>>>>>>>>> In my map program i check if the column value of a is xxx, if yes
>>>>>>>>> then
>>>>>>>>> continue with processing else skip it.
>>>>>>>>> however if my table is really big, most of my time in the map gets
>>>>>>>>> wasted
>>>>>>>>> for processing unwanted rows.
>>>>>>>>> is there any way through which we could send a subset of rows
>>>>>>>>> (based
>>>>>>>>> on
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>
>>>>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>>>>> figure
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> out
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>>>>> class
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> initialize table map jobs. or is there any other way i could use
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> Thanks in Advance,
>>>>>>>>> Raakhi.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>
>>
>

Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Hi Rakhi,

Sorry, not yet. This is not an easy thing to replicate. I will try 
though over the next few days if I find time. A few things to note 
though first. The way filters work is that they do *not* let filtered 
rows through but actually filters them out. That means you logic seems 
reversed:

  final RowFilterInterface colFilter = new ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
    "UNCOLLECTED".getBytes());
  setRowFilter(colFilter);


I think you *want* the uncollected columns to be processed? At least 
that is what you said below :) So you will have to filter all other rows 
out of the set that are NOT EQUAL to "UNCOLLECTED".

Second, be careful with "UNCOLLECTED".getBytes() as that uses you 
systems default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but 
that should of course match the way you store those strings in the first 
place. The filters do a byte level compare so that is very sensitive.

This does not address yet why you see both values or have matches at 
all. It rather sounds like the filter is not active?

And lastly, using the ColumnValueFilter will always let throw all rows! 
It is designed to strip out the columns of each row, but not filter on 
the row itself. Is that what you want? If not you may have to use a 
different filter class.

Lars


Rakhi Khatwani wrote:
> Hi Lars,
>               Just wanted to follow up, did you try out the column value
> filter? did it work??
> i really need it to improve the performance of my map-reduce programs.
>
> Thanks a ton,
> Raakhi
>
> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <ra...@gmail.com>wrote:
>
>   
>> Hi Lars,
>>
>> Well the details are as follows:
>>
>> table1 has the rowkey as some url, and 2 ColumnFamilies as described below:
>>
>> one columnFamily called content and
>> one columnFamily called status [which takes the values ANALYSED,
>> UNANALYSED] (all in upper case... i checked it, there is no issue with the
>> spelling/case).
>>
>> Hope this helps,
>> Thanks.
>> Rakhi
>>
>>
>>
>>
>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> wrote:
>>
>>     
>>> Hi Rakhi,
>>>
>>> Wow, same here. I copied your RowFilter line and when I press the dot key
>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>
>>> Apart from that, you are also saying that the filter is not working as
>>> expected? Do you use any column qualifiers for the "Status:" column? Are the
>>> values in the correct casing, i.e. are the values stored in uppercase as you
>>> have it in your example below? I assume the comparison is byte sensitive.
>>> Please give us more details, maybe a small sample table dump so that we can
>>> test this?
>>>
>>> Lars
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>       
>>>> Hi,
>>>>           I did try the filter... but using ColumnValueFilter. i declared
>>>> a
>>>> ColumnValueFilter as follows:
>>>>
>>>> public class TableInputFilter extends TableInputFormat
>>>>    implements JobConfigurable {
>>>>
>>>>             public void configure(final JobConf jobConf) {
>>>>
>>>>            setHtable(tablename);
>>>>
>>>>            setInputColumns(columnName);
>>>>
>>>>
>>>>             final RowFilterInterface colFilter =
>>>>                                                 new
>>>> ColumnValueFilter("Status:".getBytes(),
>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>> "UNCOLLECTED".getBytes());
>>>>               setRowFilter(colFilter);
>>>>   }
>>>>
>>>> }
>>>>
>>>> and thn i use my class as the input format to my map function.
>>>>
>>>>
>>>> in my map function, i set my log to display the value of my Status Column
>>>> family.
>>>>
>>>> when i execute my map reduce function, it displays "Status:: Uncollected"
>>>> for some rows
>>>> and Status = "Collected" for rest of the rows.
>>>>
>>>> but what i want is to send only those records whose 'Status: is
>>>> uncollected'.
>>>>
>>>> i even considered using the method filterRow described by the API as
>>>> follows:
>>>>  boolean *filterRow<
>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>         
>>>> *(SortedMap<
>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>         
>>>> <byte[],Cell<
>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>         
>>>>         
>>>>> columns)
>>>>>
>>>>>
>>>>>           
>>>>          Filter on the fully assembled row.
>>>>
>>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>>> its really weird... i have tried it on 3 different machines (2 machines
>>>> on
>>>> linux running eclipse gannymade 3.4 and one on windows using myEclipse).
>>>>
>>>>
>>>> i dunno if i am going wrong somewhere
>>>>
>>>> Thanks,
>>>> Raakhi
>>>>
>>>>
>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi Rakhi,
>>>>>
>>>>> The way the filters work is that you either use the supplied filters or
>>>>> create your own subclasses - but then you will have to deploy that class
>>>>> to
>>>>> all RegionServers while adding it to their respective hbase-env.sh (in
>>>>> the
>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if this
>>>>> could be done dynamically (
>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>
>>>>> Once you have that done or use one of the supplied one then you can
>>>>> assign
>>>>> the filter by overriding the TableInputFormat's configure() method and
>>>>> assign it like so:
>>>>>
>>>>>  public void configure(JobConf job) {
>>>>>    RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>    setRowFilter(filter);
>>>>>  }
>>>>>
>>>>> As Tim points out, setting the whole thing up is done in your main M/R
>>>>> tool
>>>>> based application, similar to:
>>>>>
>>>>>  JobConf job = new JobConf(...);
>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>> IdentityTableMap.class,
>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>
>>>>> Of course depending on what classes you want to replace or if this is a
>>>>> Reduce oriented job (means a default identity + filter map and all the
>>>>> work
>>>>> done in the Reduce phase) or the other way around. But the principles
>>>>> and
>>>>> filtering are the same.
>>>>>
>>>>> HTH,
>>>>> Lars
>>>>>
>>>>>
>>>>>
>>>>> Rakhi Khatwani wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Thanks Ryan, i will try that
>>>>>>
>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> there is a server-side mechanism to filter rows, it's found in the
>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>>>>>> with
>>>>>>> the TableInputFormat exactly.
>>>>>>>
>>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>>> exactly
>>>>>>> what you want.
>>>>>>>
>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> wrote:
>>>>>>>>     Hi,
>>>>>>>>   i have a map reduce program with which i read from a hbase table.
>>>>>>>> In my map program i check if the column value of a is xxx, if yes
>>>>>>>> then
>>>>>>>> continue with processing else skip it.
>>>>>>>> however if my table is really big, most of my time in the map gets
>>>>>>>> wasted
>>>>>>>> for processing unwanted rows.
>>>>>>>> is there any way through which we could send a subset of rows (based
>>>>>>>> on
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> value of a particular column family) to the map???
>>>>>>>>
>>>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>>>> figure
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> out
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>>>> class
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> initialize table map jobs. or is there any other way i could use it.
>>>>>>>>
>>>>>>>> Thanks in Advance,
>>>>>>>> Raakhi.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>         
>
>   

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi Lars,
              Just wanted to follow up, did you try out the column value
filter? did it work??
i really need it to improve the performance of my map-reduce programs.

Thanks a ton,
Raakhi

On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <ra...@gmail.com>wrote:

> Hi Lars,
>
> Well the details are as follows:
>
> table1 has the rowkey as some url, and 2 ColumnFamilies as described below:
>
> one columnFamily called content and
> one columnFamily called status [which takes the values ANALYSED,
> UNANALYSED] (all in upper case... i checked it, there is no issue with the
> spelling/case).
>
> Hope this helps,
> Thanks.
> Rakhi
>
>
>
>
> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> wrote:
>
>> Hi Rakhi,
>>
>> Wow, same here. I copied your RowFilter line and when I press the dot key
>> and the fly up opens Eclipse hangs. Nice... NOT!
>>
>> Apart from that, you are also saying that the filter is not working as
>> expected? Do you use any column qualifiers for the "Status:" column? Are the
>> values in the correct casing, i.e. are the values stored in uppercase as you
>> have it in your example below? I assume the comparison is byte sensitive.
>> Please give us more details, maybe a small sample table dump so that we can
>> test this?
>>
>> Lars
>>
>> Rakhi Khatwani wrote:
>>
>>> Hi,
>>>           I did try the filter... but using ColumnValueFilter. i declared
>>> a
>>> ColumnValueFilter as follows:
>>>
>>> public class TableInputFilter extends TableInputFormat
>>>    implements JobConfigurable {
>>>
>>>             public void configure(final JobConf jobConf) {
>>>
>>>            setHtable(tablename);
>>>
>>>            setInputColumns(columnName);
>>>
>>>
>>>             final RowFilterInterface colFilter =
>>>                                                 new
>>> ColumnValueFilter("Status:".getBytes(),
>>> ColumnValueFilter.CompareOp.EQUAL,
>>> "UNCOLLECTED".getBytes());
>>>               setRowFilter(colFilter);
>>>   }
>>>
>>> }
>>>
>>> and thn i use my class as the input format to my map function.
>>>
>>>
>>> in my map function, i set my log to display the value of my Status Column
>>> family.
>>>
>>> when i execute my map reduce function, it displays "Status:: Uncollected"
>>> for some rows
>>> and Status = "Collected" for rest of the rows.
>>>
>>> but what i want is to send only those records whose 'Status: is
>>> uncollected'.
>>>
>>> i even considered using the method filterRow described by the API as
>>> follows:
>>>  boolean *filterRow<
>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>> >
>>> *(SortedMap<
>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>> >
>>> <byte[],Cell<
>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>> >
>>>
>>>
>>>> columns)
>>>>
>>>>
>>>          Filter on the fully assembled row.
>>>
>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>> its really weird... i have tried it on 3 different machines (2 machines
>>> on
>>> linux running eclipse gannymade 3.4 and one on windows using myEclipse).
>>>
>>>
>>> i dunno if i am going wrong somewhere
>>>
>>> Thanks,
>>> Raakhi
>>>
>>>
>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:
>>>
>>>
>>>
>>>> Hi Rakhi,
>>>>
>>>> The way the filters work is that you either use the supplied filters or
>>>> create your own subclasses - but then you will have to deploy that class
>>>> to
>>>> all RegionServers while adding it to their respective hbase-env.sh (in
>>>> the
>>>> "export HBASE_CLASSPATH" variable). We are discussing currently if this
>>>> could be done dynamically (
>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>
>>>> Once you have that done or use one of the supplied one then you can
>>>> assign
>>>> the filter by overriding the TableInputFormat's configure() method and
>>>> assign it like so:
>>>>
>>>>  public void configure(JobConf job) {
>>>>    RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>    setRowFilter(filter);
>>>>  }
>>>>
>>>> As Tim points out, setting the whole thing up is done in your main M/R
>>>> tool
>>>> based application, similar to:
>>>>
>>>>  JobConf job = new JobConf(...);
>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>> IdentityTableMap.class,
>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>  job.setReducerClass(MyTableReduce.class);
>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>
>>>> Of course depending on what classes you want to replace or if this is a
>>>> Reduce oriented job (means a default identity + filter map and all the
>>>> work
>>>> done in the Reduce phase) or the other way around. But the principles
>>>> and
>>>> filtering are the same.
>>>>
>>>> HTH,
>>>> Lars
>>>>
>>>>
>>>>
>>>> Rakhi Khatwani wrote:
>>>>
>>>>
>>>>
>>>>> Thanks Ryan, i will try that
>>>>>
>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> there is a server-side mechanism to filter rows, it's found in the
>>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>>>>> with
>>>>>> the TableInputFormat exactly.
>>>>>>
>>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>>> exactly
>>>>>> what you want.
>>>>>>
>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>> rakhi.khatwani@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>     Hi,
>>>>>>>   i have a map reduce program with which i read from a hbase table.
>>>>>>> In my map program i check if the column value of a is xxx, if yes
>>>>>>> then
>>>>>>> continue with processing else skip it.
>>>>>>> however if my table is really big, most of my time in the map gets
>>>>>>> wasted
>>>>>>> for processing unwanted rows.
>>>>>>> is there any way through which we could send a subset of rows (based
>>>>>>> on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> value of a particular column family) to the map???
>>>>>>>
>>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>>> figure
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> out
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>>> class
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> initialize table map jobs. or is there any other way i could use it.
>>>>>>>
>>>>>>> Thanks in Advance,
>>>>>>> Raakhi.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi Lars,

Well the details are as follows:

table1 has the rowkey as some url, and 2 ColumnFamilies as described below:

one columnFamily called content and
one columnFamily called status [which takes the values ANALYSED, UNANALYSED]
(all in upper case... i checked it, there is no issue with the
spelling/case).

Hope this helps,
Thanks.
Rakhi



On Wed, Apr 8, 2009 at 1:59 PM, Lars George <la...@worldlingo.com> wrote:

> Hi Rakhi,
>
> Wow, same here. I copied your RowFilter line and when I press the dot key
> and the fly up opens Eclipse hangs. Nice... NOT!
>
> Apart from that, you are also saying that the filter is not working as
> expected? Do you use any column qualifiers for the "Status:" column? Are the
> values in the correct casing, i.e. are the values stored in uppercase as you
> have it in your example below? I assume the comparison is byte sensitive.
> Please give us more details, maybe a small sample table dump so that we can
> test this?
>
> Lars
>
> Rakhi Khatwani wrote:
>
>> Hi,
>>           I did try the filter... but using ColumnValueFilter. i declared
>> a
>> ColumnValueFilter as follows:
>>
>> public class TableInputFilter extends TableInputFormat
>>    implements JobConfigurable {
>>
>>             public void configure(final JobConf jobConf) {
>>
>>            setHtable(tablename);
>>
>>            setInputColumns(columnName);
>>
>>
>>             final RowFilterInterface colFilter =
>>                                                 new
>> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>> "UNCOLLECTED".getBytes());
>>               setRowFilter(colFilter);
>>   }
>>
>> }
>>
>> and thn i use my class as the input format to my map function.
>>
>>
>> in my map function, i set my log to display the value of my Status Column
>> family.
>>
>> when i execute my map reduce function, it displays "Status:: Uncollected"
>> for some rows
>> and Status = "Collected" for rest of the rows.
>>
>> but what i want is to send only those records whose 'Status: is
>> uncollected'.
>>
>> i even considered using the method filterRow described by the API as
>> follows:
>>  boolean *filterRow<
>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>> >
>> *(SortedMap<
>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>> >
>> <byte[],Cell<
>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>> >
>>
>>
>>> columns)
>>>
>>>
>>          Filter on the fully assembled row.
>>
>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>> its really weird... i have tried it on 3 different machines (2 machines on
>> linux running eclipse gannymade 3.4 and one on windows using myEclipse).
>>
>>
>> i dunno if i am going wrong somewhere
>>
>> Thanks,
>> Raakhi
>>
>>
>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:
>>
>>
>>
>>> Hi Rakhi,
>>>
>>> The way the filters work is that you either use the supplied filters or
>>> create your own subclasses - but then you will have to deploy that class
>>> to
>>> all RegionServers while adding it to their respective hbase-env.sh (in
>>> the
>>> "export HBASE_CLASSPATH" variable). We are discussing currently if this
>>> could be done dynamically (
>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>
>>> Once you have that done or use one of the supplied one then you can
>>> assign
>>> the filter by overriding the TableInputFormat's configure() method and
>>> assign it like so:
>>>
>>>  public void configure(JobConf job) {
>>>    RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>    setRowFilter(filter);
>>>  }
>>>
>>> As Tim points out, setting the whole thing up is done in your main M/R
>>> tool
>>> based application, similar to:
>>>
>>>  JobConf job = new JobConf(...);
>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>> IdentityTableMap.class,
>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>  job.setReducerClass(MyTableReduce.class);
>>>  job.setInputFormat(MyTableInputFormat.class);
>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>
>>> Of course depending on what classes you want to replace or if this is a
>>> Reduce oriented job (means a default identity + filter map and all the
>>> work
>>> done in the Reduce phase) or the other way around. But the principles and
>>> filtering are the same.
>>>
>>> HTH,
>>> Lars
>>>
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>
>>>
>>>> Thanks Ryan, i will try that
>>>>
>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> there is a server-side mechanism to filter rows, it's found in the
>>>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>>>> with
>>>>> the TableInputFormat exactly.
>>>>>
>>>>> setting a filter to reduce the # of rows returned is pretty much
>>>>> exactly
>>>>> what you want.
>>>>>
>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>> rakhi.khatwani@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Hi,
>>>>>>   i have a map reduce program with which i read from a hbase table.
>>>>>> In my map program i check if the column value of a is xxx, if yes then
>>>>>> continue with processing else skip it.
>>>>>> however if my table is really big, most of my time in the map gets
>>>>>> wasted
>>>>>> for processing unwanted rows.
>>>>>> is there any way through which we could send a subset of rows (based
>>>>>> on
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> the
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> value of a particular column family) to the map???
>>>>>>
>>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>>> figure
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> out
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>> class
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> initialize table map jobs. or is there any other way i could use it.
>>>>>>
>>>>>> Thanks in Advance,
>>>>>> Raakhi.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Hi Rakhi,

Wow, same here. I copied your RowFilter line and when I press the dot 
key and the fly up opens Eclipse hangs. Nice... NOT!

Apart from that, you are also saying that the filter is not working as 
expected? Do you use any column qualifiers for the "Status:" column? Are 
the values in the correct casing, i.e. are the values stored in 
uppercase as you have it in your example below? I assume the comparison 
is byte sensitive. Please give us more details, maybe a small sample 
table dump so that we can test this?

Lars

Rakhi Khatwani wrote:
> Hi,
>            I did try the filter... but using ColumnValueFilter. i declared a
> ColumnValueFilter as follows:
>
> public class TableInputFilter extends TableInputFormat
>     implements JobConfigurable {
>
>              public void configure(final JobConf jobConf) {
>
>             setHtable(tablename);
>
>             setInputColumns(columnName);
>
>
>              final RowFilterInterface colFilter =
>                                                  new
> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
> "UNCOLLECTED".getBytes());
>                setRowFilter(colFilter);
>    }
>
> }
>
> and thn i use my class as the input format to my map function.
>
>
> in my map function, i set my log to display the value of my Status Column
> family.
>
> when i execute my map reduce function, it displays "Status:: Uncollected"
> for some rows
> and Status = "Collected" for rest of the rows.
>
> but what i want is to send only those records whose 'Status: is
> uncollected'.
>
> i even considered using the method filterRow described by the API as
> follows:
>   boolean *filterRow<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29>
> *(SortedMap<http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true>
> <byte[],Cell<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html>
>   
>> columns)
>>     
>           Filter on the fully assembled row.
>
> but as soon as i type colFilter followed by a '.', my eclipse hangs.
> its really weird... i have tried it on 3 different machines (2 machines on
> linux running eclipse gannymade 3.4 and one on windows using myEclipse).
>
>
> i dunno if i am going wrong somewhere
>
> Thanks,
> Raakhi
>
>
> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:
>
>   
>> Hi Rakhi,
>>
>> The way the filters work is that you either use the supplied filters or
>> create your own subclasses - but then you will have to deploy that class to
>> all RegionServers while adding it to their respective hbase-env.sh (in the
>> "export HBASE_CLASSPATH" variable). We are discussing currently if this
>> could be done dynamically (
>> https://issues.apache.org/jira/browse/HBASE-1288).
>>
>> Once you have that done or use one of the supplied one then you can assign
>> the filter by overriding the TableInputFormat's configure() method and
>> assign it like so:
>>
>>  public void configure(JobConf job) {
>>     RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>     setRowFilter(filter);
>>  }
>>
>> As Tim points out, setting the whole thing up is done in your main M/R tool
>> based application, similar to:
>>
>>  JobConf job = new JobConf(...);
>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>> IdentityTableMap.class,
>>   ImmutableBytesWritable.class, RowResult.class, job);
>>  job.setReducerClass(MyTableReduce.class);
>>  job.setInputFormat(MyTableInputFormat.class);
>>  job.setOutputFormat(MyTableOutputFormat.class);
>>
>> Of course depending on what classes you want to replace or if this is a
>> Reduce oriented job (means a default identity + filter map and all the work
>> done in the Reduce phase) or the other way around. But the principles and
>> filtering are the same.
>>
>> HTH,
>> Lars
>>
>>
>>
>> Rakhi Khatwani wrote:
>>
>>     
>>> Thanks Ryan, i will try that
>>>
>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>
>>>
>>>
>>>       
>>>> there is a server-side mechanism to filter rows, it's found in the
>>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>>> with
>>>> the TableInputFormat exactly.
>>>>
>>>> setting a filter to reduce the # of rows returned is pretty much exactly
>>>> what you want.
>>>>
>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>>>
>>>>
>>>>         
>>>>> wrote:
>>>>>      Hi,
>>>>>    i have a map reduce program with which i read from a hbase table.
>>>>> In my map program i check if the column value of a is xxx, if yes then
>>>>> continue with processing else skip it.
>>>>> however if my table is really big, most of my time in the map gets
>>>>> wasted
>>>>> for processing unwanted rows.
>>>>> is there any way through which we could send a subset of rows (based on
>>>>>
>>>>>
>>>>>           
>>>> the
>>>>
>>>>
>>>>         
>>>>> value of a particular column family) to the map???
>>>>>
>>>>> i have also gone through TableInputFormatBase but am not able to figure
>>>>>
>>>>>
>>>>>           
>>>> out
>>>>
>>>>
>>>>         
>>>>> how do we set the input format if we are using TableMapReduceUtil class
>>>>>
>>>>>
>>>>>           
>>>> to
>>>>
>>>>
>>>>         
>>>>> initialize table map jobs. or is there any other way i could use it.
>>>>>
>>>>> Thanks in Advance,
>>>>> Raakhi.
>>>>>
>>>>>
>>>>>
>>>>>           
>>>       
>
>   

Re: help with map-reduce

Posted by check_writer <tu...@masaitechnologies.com>.
I have this EXACT same problem and I thought it was just me.   For some
reason my eclipse just hangs as I try to extend the PageRowFilter like the
following:

Scanner scanner = table.getScanner(new String[] { colfam1 + "nodeid" },
"999-1", 2280278, new PageRowFilter(1)
		{
			public boolean filterColumn(byte[] rowKey, byte[] colKey, byte[] data)
			{
				return true;
			}
		});

I had to type it by hand.  Also When I run this  the small program,  HBase
ends up giving me this error

--------------------------
Exception in thread "main"
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 127.0.0.1:49847 for region mytable,,1239130171356, row
'999-1', but failed after 10 attempts.
Exceptions:
java.io.IOException: Call to /127.0.0.1:49847 failed on local exception:
java.io.EOFException
java.io.IOException: Call to /127.0.0.1:49847 failed on local exception:
java.io.EOFException

	at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:858)
	at
org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1594)
	at
org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1539)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:862)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:838)
	at mytest.HbastTester.main(HbastTester.java:98)
-----------------------------------
My region server is active at port 60020 and the info is on 60030, and I'm
running HBase in local mode.    But other simpler-nonfilter based scanners
work just fine.

Anyway clues would be helpful.
thanks!
check_writer.



Rakhi Khatwani wrote:
> 
> Hi,
>            I did try the filter... but using ColumnValueFilter. i declared
> a
> ColumnValueFilter as follows:
> 
> public class TableInputFilter extends TableInputFormat
>     implements JobConfigurable {
> 
>              public void configure(final JobConf jobConf) {
> 
>             setHtable(tablename);
> 
>             setInputColumns(columnName);
> 
> 
>              final RowFilterInterface colFilter =
>                                                  new
> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
> "UNCOLLECTED".getBytes());
>                setRowFilter(colFilter);
>    }
> 
> }
> 
> and thn i use my class as the input format to my map function.
> 
> 
> in my map function, i set my log to display the value of my Status Column
> family.
> 
> when i execute my map reduce function, it displays "Status:: Uncollected"
> for some rows
> and Status = "Collected" for rest of the rows.
> 
> but what i want is to send only those records whose 'Status: is
> uncollected'.
> 
> i even considered using the method filterRow described by the API as
> follows:
>   boolean
> *filterRow<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29>
> *(SortedMap<http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true>
> <byte[],Cell<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html>
>> columns)
>           Filter on the fully assembled row.
> 
> but as soon as i type colFilter followed by a '.', my eclipse hangs.
> its really weird... i have tried it on 3 different machines (2 machines on
> linux running eclipse gannymade 3.4 and one on windows using myEclipse).
> 
> 
> i dunno if i am going wrong somewhere
> 
> Thanks,
> Raakhi
> 
> 
> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:
> 
>> Hi Rakhi,
>>
>> The way the filters work is that you either use the supplied filters or
>> create your own subclasses - but then you will have to deploy that class
>> to
>> all RegionServers while adding it to their respective hbase-env.sh (in
>> the
>> "export HBASE_CLASSPATH" variable). We are discussing currently if this
>> could be done dynamically (
>> https://issues.apache.org/jira/browse/HBASE-1288).
>>
>> Once you have that done or use one of the supplied one then you can
>> assign
>> the filter by overriding the TableInputFormat's configure() method and
>> assign it like so:
>>
>>  public void configure(JobConf job) {
>>     RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>     setRowFilter(filter);
>>  }
>>
>> As Tim points out, setting the whole thing up is done in your main M/R
>> tool
>> based application, similar to:
>>
>>  JobConf job = new JobConf(...);
>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>> IdentityTableMap.class,
>>   ImmutableBytesWritable.class, RowResult.class, job);
>>  job.setReducerClass(MyTableReduce.class);
>>  job.setInputFormat(MyTableInputFormat.class);
>>  job.setOutputFormat(MyTableOutputFormat.class);
>>
>> Of course depending on what classes you want to replace or if this is a
>> Reduce oriented job (means a default identity + filter map and all the
>> work
>> done in the Reduce phase) or the other way around. But the principles and
>> filtering are the same.
>>
>> HTH,
>> Lars
>>
>>
>>
>> Rakhi Khatwani wrote:
>>
>>> Thanks Ryan, i will try that
>>>
>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>
>>>
>>>
>>>> there is a server-side mechanism to filter rows, it's found in the
>>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>>> with
>>>> the TableInputFormat exactly.
>>>>
>>>> setting a filter to reduce the # of rows returned is pretty much
>>>> exactly
>>>> what you want.
>>>>
>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani
>>>> <rakhi.khatwani@gmail.com
>>>>
>>>>
>>>>> wrote:
>>>>>      Hi,
>>>>>    i have a map reduce program with which i read from a hbase table.
>>>>> In my map program i check if the column value of a is xxx, if yes then
>>>>> continue with processing else skip it.
>>>>> however if my table is really big, most of my time in the map gets
>>>>> wasted
>>>>> for processing unwanted rows.
>>>>> is there any way through which we could send a subset of rows (based
>>>>> on
>>>>>
>>>>>
>>>> the
>>>>
>>>>
>>>>> value of a particular column family) to the map???
>>>>>
>>>>> i have also gone through TableInputFormatBase but am not able to
>>>>> figure
>>>>>
>>>>>
>>>> out
>>>>
>>>>
>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>> class
>>>>>
>>>>>
>>>> to
>>>>
>>>>
>>>>> initialize table map jobs. or is there any other way i could use it.
>>>>>
>>>>> Thanks in Advance,
>>>>> Raakhi.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/help-with-map-reduce-tp22925481p22943183.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: help with map-reduce

Posted by "Ramesh.Ramasamy" <ra...@gmail.com>.
Lars,

In addtion,  I tried to make a jar out of the src folder. BYW, my platform
us Ubuntu 8.10, the kernel is 2.6.27-14

ramesh@master:/usr/share/man$ uname -a
Linux master 2.6.27-14-generic #1 SMP Tue Jun 30 19:57:39 UTC 2009 i686
GNU/Linux

Ramesh



larsgeorge wrote:
> 
> Ramesh,
> 
> Interesting that you mention this. I have the same issue with the Scan 
> object. When I type
> 
>    Scan scan = new Scan();
>    scan.addCo
> 
> and wait for the context help of Eclipse to open it freezes on me. Other 
> classes are fine. I also wondered what the issue is and started to look 
> into getting a stack dump out of Eclipse but have not yet continued on it.
> 
> Have you added the source or javadoc to the Eclipse configuration for 
> the hbase jar? Just wondering.
> 
> Lars
> 
> 
> On 7/1/09 2:35 PM, Ramesh.Ramasamy wrote:
>> Hi,
>>
>> I am using Eclipse 3.3, JDK 1.6.0_12 and Hadoop/Hbase 0.19.1.
>>
>> On coding using some of the filter classes, eclipse hangs, and have no
>> other
>> option to continue it unless kill/restart the process. Does any body
>> figured
>> it out the problem and have a fix?
>>
>> TIA,
>> Ramesh
>>
>>
>>    
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/help-with-map-reduce-tp22925481p24317804.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: help with map-reduce

Posted by "Ramesh.Ramasamy" <ra...@gmail.com>.
Lars,

I got added /opt/hbase-0.19.1/src/java folder for source, the source code
comes with the tar.gz.

Except the filters package, it is working for most of the classes (as for as
I tested).

Ramesh


larsgeorge wrote:
> 
> Ramesh,
> 
> Interesting that you mention this. I have the same issue with the Scan 
> object. When I type
> 
>    Scan scan = new Scan();
>    scan.addCo
> 
> and wait for the context help of Eclipse to open it freezes on me. Other 
> classes are fine. I also wondered what the issue is and started to look 
> into getting a stack dump out of Eclipse but have not yet continued on it.
> 
> Have you added the source or javadoc to the Eclipse configuration for 
> the hbase jar? Just wondering.
> 
> Lars
> 
> 
> On 7/1/09 2:35 PM, Ramesh.Ramasamy wrote:
>> Hi,
>>
>> I am using Eclipse 3.3, JDK 1.6.0_12 and Hadoop/Hbase 0.19.1.
>>
>> On coding using some of the filter classes, eclipse hangs, and have no
>> other
>> option to continue it unless kill/restart the process. Does any body
>> figured
>> it out the problem and have a fix?
>>
>> TIA,
>> Ramesh
>>
>>
>>    
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/help-with-map-reduce-tp22925481p24317299.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Ramesh,

Interesting that you mention this. I have the same issue with the Scan 
object. When I type

   Scan scan = new Scan();
   scan.addCo

and wait for the context help of Eclipse to open it freezes on me. Other 
classes are fine. I also wondered what the issue is and started to look 
into getting a stack dump out of Eclipse but have not yet continued on it.

Have you added the source or javadoc to the Eclipse configuration for 
the hbase jar? Just wondering.

Lars


On 7/1/09 2:35 PM, Ramesh.Ramasamy wrote:
> Hi,
>
> I am using Eclipse 3.3, JDK 1.6.0_12 and Hadoop/Hbase 0.19.1.
>
> On coding using some of the filter classes, eclipse hangs, and have no other
> option to continue it unless kill/restart the process. Does any body figured
> it out the problem and have a fix?
>
> TIA,
> Ramesh
>
>
>    


Re: help with map-reduce

Posted by "Ramesh.Ramasamy" <ra...@gmail.com>.
Hi,

I am using Eclipse 3.3, JDK 1.6.0_12 and Hadoop/Hbase 0.19.1.

On coding using some of the filter classes, eclipse hangs, and have no other
option to continue it unless kill/restart the process. Does any body figured
it out the problem and have a fix? 
 
TIA,
Ramesh


-- 
View this message in context: http://www.nabble.com/help-with-map-reduce-tp22925481p24289040.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi,
           I did try the filter... but using ColumnValueFilter. i declared a
ColumnValueFilter as follows:

public class TableInputFilter extends TableInputFormat
    implements JobConfigurable {

             public void configure(final JobConf jobConf) {

            setHtable(tablename);

            setInputColumns(columnName);


             final RowFilterInterface colFilter =
                                                 new
ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
"UNCOLLECTED".getBytes());
               setRowFilter(colFilter);
   }

}

and thn i use my class as the input format to my map function.


in my map function, i set my log to display the value of my Status Column
family.

when i execute my map reduce function, it displays "Status:: Uncollected"
for some rows
and Status = "Collected" for rest of the rows.

but what i want is to send only those records whose 'Status: is
uncollected'.

i even considered using the method filterRow described by the API as
follows:
  boolean *filterRow<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29>
*(SortedMap<http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true>
<byte[],Cell<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html>
> columns)
          Filter on the fully assembled row.

but as soon as i type colFilter followed by a '.', my eclipse hangs.
its really weird... i have tried it on 3 different machines (2 machines on
linux running eclipse gannymade 3.4 and one on windows using myEclipse).


i dunno if i am going wrong somewhere

Thanks,
Raakhi


On Tue, Apr 7, 2009 at 7:18 PM, Lars George <la...@worldlingo.com> wrote:

> Hi Rakhi,
>
> The way the filters work is that you either use the supplied filters or
> create your own subclasses - but then you will have to deploy that class to
> all RegionServers while adding it to their respective hbase-env.sh (in the
> "export HBASE_CLASSPATH" variable). We are discussing currently if this
> could be done dynamically (
> https://issues.apache.org/jira/browse/HBASE-1288).
>
> Once you have that done or use one of the supplied one then you can assign
> the filter by overriding the TableInputFormat's configure() method and
> assign it like so:
>
>  public void configure(JobConf job) {
>     RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>     setRowFilter(filter);
>  }
>
> As Tim points out, setting the whole thing up is done in your main M/R tool
> based application, similar to:
>
>  JobConf job = new JobConf(...);
>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
> IdentityTableMap.class,
>   ImmutableBytesWritable.class, RowResult.class, job);
>  job.setReducerClass(MyTableReduce.class);
>  job.setInputFormat(MyTableInputFormat.class);
>  job.setOutputFormat(MyTableOutputFormat.class);
>
> Of course depending on what classes you want to replace or if this is a
> Reduce oriented job (means a default identity + filter map and all the work
> done in the Reduce phase) or the other way around. But the principles and
> filtering are the same.
>
> HTH,
> Lars
>
>
>
> Rakhi Khatwani wrote:
>
>> Thanks Ryan, i will try that
>>
>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>
>>
>>
>>> there is a server-side mechanism to filter rows, it's found in the
>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>> with
>>> the TableInputFormat exactly.
>>>
>>> setting a filter to reduce the # of rows returned is pretty much exactly
>>> what you want.
>>>
>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>>
>>>
>>>> wrote:
>>>>      Hi,
>>>>    i have a map reduce program with which i read from a hbase table.
>>>> In my map program i check if the column value of a is xxx, if yes then
>>>> continue with processing else skip it.
>>>> however if my table is really big, most of my time in the map gets
>>>> wasted
>>>> for processing unwanted rows.
>>>> is there any way through which we could send a subset of rows (based on
>>>>
>>>>
>>> the
>>>
>>>
>>>> value of a particular column family) to the map???
>>>>
>>>> i have also gone through TableInputFormatBase but am not able to figure
>>>>
>>>>
>>> out
>>>
>>>
>>>> how do we set the input format if we are using TableMapReduceUtil class
>>>>
>>>>
>>> to
>>>
>>>
>>>> initialize table map jobs. or is there any other way i could use it.
>>>>
>>>> Thanks in Advance,
>>>> Raakhi.
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: help with map-reduce

Posted by Lars George <la...@worldlingo.com>.
Hi Rakhi,

The way the filters work is that you either use the supplied filters or 
create your own subclasses - but then you will have to deploy that class 
to all RegionServers while adding it to their respective hbase-env.sh 
(in the "export HBASE_CLASSPATH" variable). We are discussing currently 
if this could be done dynamically 
(https://issues.apache.org/jira/browse/HBASE-1288).

Once you have that done or use one of the supplied one then you can 
assign the filter by overriding the TableInputFormat's configure() 
method and assign it like so:

  public void configure(JobConf job) {
      RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
      setRowFilter(filter);
  }

As Tim points out, setting the whole thing up is done in your main M/R 
tool based application, similar to:

  JobConf job = new JobConf(...);
  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>", 
IdentityTableMap.class,
    ImmutableBytesWritable.class, RowResult.class, job);
  job.setReducerClass(MyTableReduce.class);
  job.setInputFormat(MyTableInputFormat.class);
  job.setOutputFormat(MyTableOutputFormat.class);

Of course depending on what classes you want to replace or if this is a 
Reduce oriented job (means a default identity + filter map and all the 
work done in the Reduce phase) or the other way around. But the 
principles and filtering are the same.

HTH,
Lars


Rakhi Khatwani wrote:
> Thanks Ryan, i will try that
>
> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>   
>> there is a server-side mechanism to filter rows, it's found in the
>> org.apache.hadoop.hbase.filter package.  im not sure how this interops with
>> the TableInputFormat exactly.
>>
>> setting a filter to reduce the # of rows returned is pretty much exactly
>> what you want.
>>
>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>     
>>> wrote:
>>>       
>>> Hi,
>>>     i have a map reduce program with which i read from a hbase table.
>>> In my map program i check if the column value of a is xxx, if yes then
>>> continue with processing else skip it.
>>> however if my table is really big, most of my time in the map gets wasted
>>> for processing unwanted rows.
>>> is there any way through which we could send a subset of rows (based on
>>>       
>> the
>>     
>>> value of a particular column family) to the map???
>>>
>>> i have also gone through TableInputFormatBase but am not able to figure
>>>       
>> out
>>     
>>> how do we set the input format if we are using TableMapReduceUtil class
>>>       
>> to
>>     
>>> initialize table map jobs. or is there any other way i could use it.
>>>
>>> Thanks in Advance,
>>> Raakhi.
>>>
>>>       
>
>   

Re: help with map-reduce

Posted by tim robertson <ti...@gmail.com>.
Maybe I don't understand, but if you have done the filter and extended
tableinputformat, you can run a MR job with:

JobConf conf = new JobConf(...);
conf.setInputFormat(YourTableInputFormat.class);

Cheers,

Tim




On Tue, Apr 7, 2009 at 11:50 AM, Rakhi Khatwani
<ra...@gmail.com> wrote:
> Thanks Ryan, i will try that
>
> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> there is a server-side mechanism to filter rows, it's found in the
>> org.apache.hadoop.hbase.filter package.  im not sure how this interops with
>> the TableInputFormat exactly.
>>
>> setting a filter to reduce the # of rows returned is pretty much exactly
>> what you want.
>>
>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>> >wrote:
>>
>> > Hi,
>> >     i have a map reduce program with which i read from a hbase table.
>> > In my map program i check if the column value of a is xxx, if yes then
>> > continue with processing else skip it.
>> > however if my table is really big, most of my time in the map gets wasted
>> > for processing unwanted rows.
>> > is there any way through which we could send a subset of rows (based on
>> the
>> > value of a particular column family) to the map???
>> >
>> > i have also gone through TableInputFormatBase but am not able to figure
>> out
>> > how do we set the input format if we are using TableMapReduceUtil class
>> to
>> > initialize table map jobs. or is there any other way i could use it.
>> >
>> > Thanks in Advance,
>> > Raakhi.
>> >
>>
>

Re: help with map-reduce

Posted by Rakhi Khatwani <ra...@gmail.com>.
Thanks Ryan, i will try that

On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ry...@gmail.com> wrote:

> there is a server-side mechanism to filter rows, it's found in the
> org.apache.hadoop.hbase.filter package.  im not sure how this interops with
> the TableInputFormat exactly.
>
> setting a filter to reduce the # of rows returned is pretty much exactly
> what you want.
>
> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
> >wrote:
>
> > Hi,
> >     i have a map reduce program with which i read from a hbase table.
> > In my map program i check if the column value of a is xxx, if yes then
> > continue with processing else skip it.
> > however if my table is really big, most of my time in the map gets wasted
> > for processing unwanted rows.
> > is there any way through which we could send a subset of rows (based on
> the
> > value of a particular column family) to the map???
> >
> > i have also gone through TableInputFormatBase but am not able to figure
> out
> > how do we set the input format if we are using TableMapReduceUtil class
> to
> > initialize table map jobs. or is there any other way i could use it.
> >
> > Thanks in Advance,
> > Raakhi.
> >
>

Re: help with map-reduce

Posted by Ryan Rawson <ry...@gmail.com>.
there is a server-side mechanism to filter rows, it's found in the
org.apache.hadoop.hbase.filter package.  im not sure how this interops with
the TableInputFormat exactly.

setting a filter to reduce the # of rows returned is pretty much exactly
what you want.

On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <ra...@gmail.com>wrote:

> Hi,
>     i have a map reduce program with which i read from a hbase table.
> In my map program i check if the column value of a is xxx, if yes then
> continue with processing else skip it.
> however if my table is really big, most of my time in the map gets wasted
> for processing unwanted rows.
> is there any way through which we could send a subset of rows (based on the
> value of a particular column family) to the map???
>
> i have also gone through TableInputFormatBase but am not able to figure out
> how do we set the input format if we are using TableMapReduceUtil class to
> initialize table map jobs. or is there any other way i could use it.
>
> Thanks in Advance,
> Raakhi.
>