You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by S L <sl...@gmail.com> on 2017/07/14 00:26:34 UTC

Why doesn't my regex work in hbase rowfilter with my scan?

I don't understand why my regex doesn't work when scanning hbase.
Everything looks good to  me but for some reason, it's returning all keys
when it should just return the ones I'm requesting

Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
scan.setCaching(limit);
scan.setCacheBlocks(false);
scan.setTimeRange(start, end);
FilterList filters = new FilterList();
    Filter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL, new
RegexStringComparator("100_.*_\\d{10}"));
            filters.addFilter(rowFilter);
scan.setFilter(filters);

TableMapReduceUtil.initTableMapperJob(tableName, scan, MTTRMapper.class,
Text.class, IntWritable.class, job);

The rowkey is stored as a string in hbase.  The rowkey is in the format of
hash_servername_timestamp, e.g.

    0_myserver.mydomain.com_1234567890

The hash can be any number from 0-199.  In the above filter, I just want to
get all elements with hash = 100 but for some reason, the scan job appears
to return other rowkeys in addition to the ones with hash = 100.

I've tried this with jar versions 1.0.1 and 1.2.0-cdh5.7.2.  What am I
doing wrong that's making the regex not work?

Re: Why doesn't my regex work in hbase rowfilter with my scan?

Posted by Ted Yu <yu...@gmail.com>.
I didn't mean that you cannot have only one filter in a filter list.

Please take a look
at hbase-server/src/test/java/org/apache/hadoop/hbase/filter/TestFilter.java
where RegexStringComparator is used.

It may take you less time if you follow the example there and debug your
regex using sample data in cited formation.

Cheers

On Thu, Jul 13, 2017 at 5:40 PM, S L <sl...@gmail.com> wrote:

> Thanks Ted.  I had other filters in there but wanted to make it simple and
> just have one filter for now and then add them one by one until I get
> everything working.
>
> So I can't have just one filter in a filter list?  Kind of makes it hard to
> debug if I have multiple filters that might be bad (or just one bad and 9
> good but can't figure out which is the bad one).
>
> On Thu, Jul 13, 2017 at 5:34 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > rowFilter is added to filter list which doesn't contain other filters.
> >
> > Maybe the snippet doesn't contain all the code in your class ?
> >
> > On Thu, Jul 13, 2017 at 5:26 PM, S L <sl...@gmail.com> wrote:
> >
> > > I don't understand why my regex doesn't work when scanning hbase.
> > > Everything looks good to  me but for some reason, it's returning all
> keys
> > > when it should just return the ones I'm requesting
> > >
> > > Scan scan = new Scan();
> > > scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
> > > scan.setCaching(limit);
> > > scan.setCacheBlocks(false);
> > > scan.setTimeRange(start, end);
> > > FilterList filters = new FilterList();
> > >     Filter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL,
> new
> > > RegexStringComparator("100_.*_\\d{10}"));
> > >             filters.addFilter(rowFilter);
> > > scan.setFilter(filters);
> > >
> > > TableMapReduceUtil.initTableMapperJob(tableName, scan,
> MTTRMapper.class,
> > > Text.class, IntWritable.class, job);
> > >
> > > The rowkey is stored as a string in hbase.  The rowkey is in the format
> > of
> > > hash_servername_timestamp, e.g.
> > >
> > >     0_myserver.mydomain.com_1234567890
> > >
> > > The hash can be any number from 0-199.  In the above filter, I just
> want
> > to
> > > get all elements with hash = 100 but for some reason, the scan job
> > appears
> > > to return other rowkeys in addition to the ones with hash = 100.
> > >
> > > I've tried this with jar versions 1.0.1 and 1.2.0-cdh5.7.2.  What am I
> > > doing wrong that's making the regex not work?
> > >
> >
>

Re: Why doesn't my regex work in hbase rowfilter with my scan?

Posted by S L <sl...@gmail.com>.
Thanks Ted.  I had other filters in there but wanted to make it simple and
just have one filter for now and then add them one by one until I get
everything working.

So I can't have just one filter in a filter list?  Kind of makes it hard to
debug if I have multiple filters that might be bad (or just one bad and 9
good but can't figure out which is the bad one).

On Thu, Jul 13, 2017 at 5:34 PM, Ted Yu <yu...@gmail.com> wrote:

> rowFilter is added to filter list which doesn't contain other filters.
>
> Maybe the snippet doesn't contain all the code in your class ?
>
> On Thu, Jul 13, 2017 at 5:26 PM, S L <sl...@gmail.com> wrote:
>
> > I don't understand why my regex doesn't work when scanning hbase.
> > Everything looks good to  me but for some reason, it's returning all keys
> > when it should just return the ones I'm requesting
> >
> > Scan scan = new Scan();
> > scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
> > scan.setCaching(limit);
> > scan.setCacheBlocks(false);
> > scan.setTimeRange(start, end);
> > FilterList filters = new FilterList();
> >     Filter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL, new
> > RegexStringComparator("100_.*_\\d{10}"));
> >             filters.addFilter(rowFilter);
> > scan.setFilter(filters);
> >
> > TableMapReduceUtil.initTableMapperJob(tableName, scan, MTTRMapper.class,
> > Text.class, IntWritable.class, job);
> >
> > The rowkey is stored as a string in hbase.  The rowkey is in the format
> of
> > hash_servername_timestamp, e.g.
> >
> >     0_myserver.mydomain.com_1234567890
> >
> > The hash can be any number from 0-199.  In the above filter, I just want
> to
> > get all elements with hash = 100 but for some reason, the scan job
> appears
> > to return other rowkeys in addition to the ones with hash = 100.
> >
> > I've tried this with jar versions 1.0.1 and 1.2.0-cdh5.7.2.  What am I
> > doing wrong that's making the regex not work?
> >
>

Re: Why doesn't my regex work in hbase rowfilter with my scan?

Posted by Ted Yu <yu...@gmail.com>.
rowFilter is added to filter list which doesn't contain other filters.

Maybe the snippet doesn't contain all the code in your class ?

On Thu, Jul 13, 2017 at 5:26 PM, S L <sl...@gmail.com> wrote:

> I don't understand why my regex doesn't work when scanning hbase.
> Everything looks good to  me but for some reason, it's returning all keys
> when it should just return the ones I'm requesting
>
> Scan scan = new Scan();
> scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
> scan.setCaching(limit);
> scan.setCacheBlocks(false);
> scan.setTimeRange(start, end);
> FilterList filters = new FilterList();
>     Filter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL, new
> RegexStringComparator("100_.*_\\d{10}"));
>             filters.addFilter(rowFilter);
> scan.setFilter(filters);
>
> TableMapReduceUtil.initTableMapperJob(tableName, scan, MTTRMapper.class,
> Text.class, IntWritable.class, job);
>
> The rowkey is stored as a string in hbase.  The rowkey is in the format of
> hash_servername_timestamp, e.g.
>
>     0_myserver.mydomain.com_1234567890
>
> The hash can be any number from 0-199.  In the above filter, I just want to
> get all elements with hash = 100 but for some reason, the scan job appears
> to return other rowkeys in addition to the ones with hash = 100.
>
> I've tried this with jar versions 1.0.1 and 1.2.0-cdh5.7.2.  What am I
> doing wrong that's making the regex not work?
>