You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Steinmaurer Thomas <Th...@scch.at> on 2011/08/04 12:57:12 UTC

RE: Something like Execution Plan as in the RDBMS world?

Hi Andy and Ted!

Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.

I have the following row key, as already discussed in other threads.

vehicle_id: up to 16 characters
device_id: up to 16 characters
timestamp: YYYYMMDDhhmmss

Pretty much one row every 5 minutes for a particular vehicle and device.

Now I want to get the rows for an entire day for a particular vehicle and device.

The following range scan implementation:

	Scan scan = new Scan();

	String startKey =
		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
		+ "-"
		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
		+ "-"
		+ "20110808000000";
	String endKey =
		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
		+ "-"
		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
		+ "-"
		+ "20110808235959";
	scan.setStartRow(Bytes.toBytes(startKey));
	scan.setStopRow(Bytes.toBytes(endKey));
	scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));

Takes < 1 sec.

Whereas the following regex based row filter implementation:

	List<Filter> filters = new ArrayList<Filter>();
	RowFilter rf = new RowFilter(
		CompareFilter.CompareOp.EQUAL
		, new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
	);
	filters.add(rf);
	
	QualifierFilter qf = new QualifierFilter(
		CompareFilter.CompareOp.EQUAL
		, new RegexStringComparator("temperature1_value")
	);
	filters.add(qf);
	
	FilterList filterList1 = new FilterList(filters);
	scan.setFilter(filterList1);


Takes around 6 sec on a very small table.


We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.


Thanks!

Thomas


-----Original Message-----
From: Andrew Purtell [mailto:apurtell@apache.org] 
Sent: Mittwoch, 27. Juli 2011 08:25
To: user@hbase.apache.org
Subject: Re: Something like Execution Plan as in the RDBMS world?

> Or is this a complete different thinking?

Yes.

There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.

> Depending on the access pattern, we might be in a situation to use 
>e.g. RegEx filters on rowkeys. I wonder if there is some kind of an 
>execution plan when running a HBase query to better understand

Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment. 

What people do is measure the application metrics of interest and try different approaches to optimize them.

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


>________________________________
>From: Steinmaurer Thomas <Th...@scch.at>
>To: user@hbase.apache.org
>Sent: Tuesday, July 26, 2011 11:10 PM
>Subject: Something like Execution Plan as in the RDBMS world?
>
>Hello,
>
>
>
>we have a three part row-key taking into account that the first part is 
>important for distribution/partitioning when the system grows. 
>Depending on the access pattern, we might be in a situation to use e.g. 
>RegEx filters on rowkeys. I wonder if there is some kind of an 
>execution plan (as known in RDBMS) when running a HBase query to better 
>understand how HBase processes the query and what execution path it 
>takes to generate the result set.
>
>
>
>Or is this a complete different thinking?
>
>
>
>Thanks,
>
>Thomas
>
>
>
>
>
>

Re: Something like Execution Plan as in the RDBMS world?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Your first scan is set to start with an offset in the table and has a
stop row, whereas the other one is a full table scan with filtering
(if my understanding is correct). Compare the second scan with a
straight up full table scan and you should see where the slowdown
comes from (should be from the fact that it has to read everything).

Hope that helps,

J-D

On Thu, Aug 4, 2011 at 3:57 AM, Steinmaurer Thomas
<Th...@scch.at> wrote:
> Hi Andy and Ted!
>
> Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.
>
> I have the following row key, as already discussed in other threads.
>
> vehicle_id: up to 16 characters
> device_id: up to 16 characters
> timestamp: YYYYMMDDhhmmss
>
> Pretty much one row every 5 minutes for a particular vehicle and device.
>
> Now I want to get the rows for an entire day for a particular vehicle and device.
>
> The following range scan implementation:
>
>        Scan scan = new Scan();
>
>        String startKey =
>                String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
>                + "-"
>                + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
>                + "-"
>                + "20110808000000";
>        String endKey =
>                String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
>                + "-"
>                + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
>                + "-"
>                + "20110808235959";
>        scan.setStartRow(Bytes.toBytes(startKey));
>        scan.setStopRow(Bytes.toBytes(endKey));
>        scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));
>
> Takes < 1 sec.
>
> Whereas the following regex based row filter implementation:
>
>        List<Filter> filters = new ArrayList<Filter>();
>        RowFilter rf = new RowFilter(
>                CompareFilter.CompareOp.EQUAL
>                , new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
>        );
>        filters.add(rf);
>
>        QualifierFilter qf = new QualifierFilter(
>                CompareFilter.CompareOp.EQUAL
>                , new RegexStringComparator("temperature1_value")
>        );
>        filters.add(qf);
>
>        FilterList filterList1 = new FilterList(filters);
>        scan.setFilter(filterList1);
>
>
> Takes around 6 sec on a very small table.
>
>
> We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.
>
>
> Thanks!
>
> Thomas
>
>
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Mittwoch, 27. Juli 2011 08:25
> To: user@hbase.apache.org
> Subject: Re: Something like Execution Plan as in the RDBMS world?
>
>> Or is this a complete different thinking?
>
> Yes.
>
> There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
>
>> Depending on the access pattern, we might be in a situation to use
>>e.g. RegEx filters on rowkeys. I wonder if there is some kind of an
>>execution plan when running a HBase query to better understand
>
> Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment.
>
> What people do is measure the application metrics of interest and try different approaches to optimize them.
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
>
>
>>________________________________
>>From: Steinmaurer Thomas <Th...@scch.at>
>>To: user@hbase.apache.org
>>Sent: Tuesday, July 26, 2011 11:10 PM
>>Subject: Something like Execution Plan as in the RDBMS world?
>>
>>Hello,
>>
>>
>>
>>we have a three part row-key taking into account that the first part is
>>important for distribution/partitioning when the system grows.
>>Depending on the access pattern, we might be in a situation to use e.g.
>>RegEx filters on rowkeys. I wonder if there is some kind of an
>>execution plan (as known in RDBMS) when running a HBase query to better
>>understand how HBase processes the query and what execution path it
>>takes to generate the result set.
>>
>>
>>
>>Or is this a complete different thinking?
>>
>>
>>
>>Thanks,
>>
>>Thomas
>>
>>
>>
>>
>>
>>
>

RE: Something like Execution Plan as in the RDBMS world?

Posted by Michael Segel <mi...@hotmail.com>.

Tomas,

If I understand you correctly you have a row key of A,B,C and you wan to fetch only the rows on A and C 
You can do a start row of A 
And then do the end row of A1

So that you get the first row for the give vehicle_id, and then stop when the vehicle_id changes.

You would then have to do a server side filter on values for C to get the timestamp for a given day.
(You could do this with a client side filter, but that means pushing all the data over the wire.) 
[Note having said that, you could just do a client side filter since you only have 115K rows and you're going to get a subset of that returned by the range key.]

The idea of doing something like the following:
SELECT * 
FROM TABLE 
WHERE A=x
AND DAY(C) = y [or some variation]
{A and C are part of a composite index}

doesn't work in HBase.

If your key was ACB, meaning that Vehicle_id, timestamp, device_id  was the composite key, then you could do a start/stop range scan using A and C.

Sorry if I'm missing something since I jumped in the middle of a discussion.

-Mike


> Subject: RE: Something like Execution Plan as in the RDBMS world?
> Date: Thu, 4 Aug 2011 12:57:12 +0200
> From: Thomas.Steinmaurer@scch.at
> To: user@hbase.apache.org; apurtell@apache.org
> 
> Hi Andy and Ted!
> 
> Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.
> 
> I have the following row key, as already discussed in other threads.
> 
> vehicle_id: up to 16 characters
> device_id: up to 16 characters
> timestamp: YYYYMMDDhhmmss
> 
> Pretty much one row every 5 minutes for a particular vehicle and device.
> 
> Now I want to get the rows for an entire day for a particular vehicle and device.
> 
> The following range scan implementation:
> 
> 	Scan scan = new Scan();
> 
> 	String startKey =
> 		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
> 		+ "-"
> 		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
> 		+ "-"
> 		+ "20110808000000";
> 	String endKey =
> 		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
> 		+ "-"
> 		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
> 		+ "-"
> 		+ "20110808235959";
> 	scan.setStartRow(Bytes.toBytes(startKey));
> 	scan.setStopRow(Bytes.toBytes(endKey));
> 	scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));
> 
> Takes < 1 sec.
> 
> Whereas the following regex based row filter implementation:
> 
> 	List<Filter> filters = new ArrayList<Filter>();
> 	RowFilter rf = new RowFilter(
> 		CompareFilter.CompareOp.EQUAL
> 		, new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
> 	);
> 	filters.add(rf);
> 	
> 	QualifierFilter qf = new QualifierFilter(
> 		CompareFilter.CompareOp.EQUAL
> 		, new RegexStringComparator("temperature1_value")
> 	);
> 	filters.add(qf);
> 	
> 	FilterList filterList1 = new FilterList(filters);
> 	scan.setFilter(filterList1);
> 
> 
> Takes around 6 sec on a very small table.
> 
> 
> We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.
> 
> 
> Thanks!
> 
> Thomas
> 
> 
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org] 
> Sent: Mittwoch, 27. Juli 2011 08:25
> To: user@hbase.apache.org
> Subject: Re: Something like Execution Plan as in the RDBMS world?
> 
> > Or is this a complete different thinking?
> 
> Yes.
> 
> There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
> 
> > Depending on the access pattern, we might be in a situation to use 
> >e.g. RegEx filters on rowkeys. I wonder if there is some kind of an 
> >execution plan when running a HBase query to better understand
> 
> Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment. 
> 
> What people do is measure the application metrics of interest and try different approaches to optimize them.
> 
> Best regards,
> 
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
> 
> 
> >________________________________
> >From: Steinmaurer Thomas <Th...@scch.at>
> >To: user@hbase.apache.org
> >Sent: Tuesday, July 26, 2011 11:10 PM
> >Subject: Something like Execution Plan as in the RDBMS world?
> >
> >Hello,
> >
> >
> >
> >we have a three part row-key taking into account that the first part is 
> >important for distribution/partitioning when the system grows. 
> >Depending on the access pattern, we might be in a situation to use e.g. 
> >RegEx filters on rowkeys. I wonder if there is some kind of an 
> >execution plan (as known in RDBMS) when running a HBase query to better 
> >understand how HBase processes the query and what execution path it 
> >takes to generate the result set.
> >
> >
> >
> >Or is this a complete different thinking?
> >
> >
> >
> >Thanks,
> >
> >Thomas
> >
> >
> >
> >
> >
> >