You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@predictionio.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2016/10/14 01:10:00 UTC

clashing hbase queries

The DAG for a template just happens to schedule 2 tasks that do something like this:

val fieldsRDD: RDD[(ItemID, PropertyMap)] = PEventStore.aggregateProperties(
  appName = dsp.appName,
  entityType = "item")(sc)

to execute in parallel

The PEventStore calls from 2 separate closures start hitting HBase and it fails, no matter how high I set the RPC and Scanner Timeout. 

This has only come up recently with some restructuring, which I assume caused the 2 tasks to end up at the same point in the DAG. Is there a way to force one HBase related task to complete before the other is started? They both return RDDs, which are lazy evaluated like promises until the data is needed. Can I force the promise to be kept?


Re: clashing hbase queries

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Status: clashing HBase queries was indeed the problem. I went back to an older version of the template, which did not cause the 2 tasks to be in parallel and things work fine. It only makes sense but the DAG is a mysterious thing so we need a way to serialize HBase access to stop it from being created in a pathological manner.


On Oct 15, 2016, at 3:14 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

I may have been on the wrong track with the 2 parallel task idea, which is a problem. The typical use with Spark is to get all data out of HBase and work on it as RDDs but getting it out may cause parallel tasks to access HBase at the same time. There must be a way to serialize the execution of this kind of task—I just haven’t run into it, Maybe access the count of RDD items immediately to force the HBase queries to happen right away. But...

Looking a little deeper, below is a query that should return nothing, since there are no “item” objects. This seems like it is causing a full DB scan, which I suspect makes it a PIO bug, not an HBase thing. It is also easy to test and work around once I have access to the big fat cluster that it was happening on.

On Oct 13, 2016, at 6:52 PM, Andrew Purtell <an...@gmail.com> wrote:

This sounds like hotspotting. Ideally the workload over the keyspace can be better distributed, which is another avenue of attack - partitioning, keying strategy.  


> On Oct 13, 2016, at 6:10 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> The DAG for a template just happens to schedule 2 tasks that do something like this:
> 
> val fieldsRDD: RDD[(ItemID, PropertyMap)] = PEventStore.aggregateProperties(
> appName = dsp.appName,
> entityType = "item")(sc)
> 
> to execute in parallel
> 
> The PEventStore calls from 2 separate closures start hitting HBase and it fails, no matter how high I set the RPC and Scanner Timeout. 
> 
> This has only come up recently with some restructuring, which I assume caused the 2 tasks to end up at the same point in the DAG. Is there a way to force one HBase related task to complete before the other is started? They both return RDDs, which are lazy evaluated like promises until the data is needed. Can I force the promise to be kept?
> 



Re: clashing hbase queries

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I may have been on the wrong track with the 2 parallel task idea, which is a problem. The typical use with Spark is to get all data out of HBase and work on it as RDDs but getting it out may cause parallel tasks to access HBase at the same time. There must be a way to serialize the execution of this kind of task—I just haven’t run into it, Maybe access the count of RDD items immediately to force the HBase queries to happen right away. But...

Looking a little deeper, below is a query that should return nothing, since there are no “item” objects. This seems like it is causing a full DB scan, which I suspect makes it a PIO bug, not an HBase thing. It is also easy to test and work around once I have access to the big fat cluster that it was happening on.

On Oct 13, 2016, at 6:52 PM, Andrew Purtell <an...@gmail.com> wrote:

This sounds like hotspotting. Ideally the workload over the keyspace can be better distributed, which is another avenue of attack - partitioning, keying strategy.  


> On Oct 13, 2016, at 6:10 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> The DAG for a template just happens to schedule 2 tasks that do something like this:
> 
> val fieldsRDD: RDD[(ItemID, PropertyMap)] = PEventStore.aggregateProperties(
> appName = dsp.appName,
> entityType = "item")(sc)
> 
> to execute in parallel
> 
> The PEventStore calls from 2 separate closures start hitting HBase and it fails, no matter how high I set the RPC and Scanner Timeout. 
> 
> This has only come up recently with some restructuring, which I assume caused the 2 tasks to end up at the same point in the DAG. Is there a way to force one HBase related task to complete before the other is started? They both return RDDs, which are lazy evaluated like promises until the data is needed. Can I force the promise to be kept?
> 


Re: clashing hbase queries

Posted by Andrew Purtell <an...@gmail.com>.
This sounds like hotspotting. Ideally the workload over the keyspace can be better distributed, which is another avenue of attack - partitioning, keying strategy.  


> On Oct 13, 2016, at 6:10 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> The DAG for a template just happens to schedule 2 tasks that do something like this:
> 
> val fieldsRDD: RDD[(ItemID, PropertyMap)] = PEventStore.aggregateProperties(
>  appName = dsp.appName,
>  entityType = "item")(sc)
> 
> to execute in parallel
> 
> The PEventStore calls from 2 separate closures start hitting HBase and it fails, no matter how high I set the RPC and Scanner Timeout. 
> 
> This has only come up recently with some restructuring, which I assume caused the 2 tasks to end up at the same point in the DAG. Is there a way to force one HBase related task to complete before the other is started? They both return RDDs, which are lazy evaluated like promises until the data is needed. Can I force the promise to be kept?
>