You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Roberts, Geoffry [USA]" <Ro...@bah.com> on 2021/01/16 16:27:34 UTC

Q: BatchScanner and parallel (i.e. m/r style) execution

All,

Three questions all asking the same thing:

Can an Accumulo scan or batchscan run like a map/reduce job?

I have an Accumulo 2.0 cluster.

In hadoop, I can launch a map/reduce job on the name node and hadoop distributes the job over the nodes of the cluster and the job runs in parallel.

In accumulo, I am calling the batch scanner from some non-java code that is first distributed across the cluster then on each node it attaches to accumulo and does the scan.  It works on a single node accumulo—so far so good.  I need to escalate and run it multi-node.  I am concerned that I’ll wind up running the same scan on each node, which would return me an array of result sets all alike.  Am I correct?

Can I somehow get the Hadoop m/r effect in accumulo?

Thanks

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com

Booz | Allen | Hamilton
BoozAllen.com

Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Posted by "Roberts, Geoffry [USA]" <Ro...@bah.com>.

Sweet

Thanks

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com

Booz | Allen | Hamilton
BoozAllen.com

From: Christopher <ct...@apache.org>
Reply-To: "user@accumulo.apache.org" <us...@accumulo.apache.org>
Date: Saturday, January 16, 2021 at 1:57 PM
To: accumulo-user <us...@accumulo.apache.org>
Subject: Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Not to all servers, just to those hosting data in that range. But otherwise, yes.

On Sat, Jan 16, 2021 at 1:45 PM Roberts, Geoffry [USA] <Ro...@bah.com>> wrote:
If I have a batch scanner that has one large range, and this range spans several tservers, accumulo will distribute it to all tservers, it will process in parallel; and I’ll get back as single result set?

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com<ma...@bah.com>

Booz | Allen | Hamilton
BoozAllen.com

From: Christopher <ct...@apache.org>>
Reply-To: "user@accumulo.apache.org<ma...@accumulo.apache.org>" <us...@accumulo.apache.org>>
Date: Saturday, January 16, 2021 at 1:39 PM
To: accumulo-user <us...@accumulo.apache.org>>
Subject: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

A BatchScanner takes multiple ranges, groups them by TServer, and then queries TServers in parallel for the ranges that are located in each, returning data in its iterator as it comes back (without regard to order).

If you run the same scan on multiple nodes, the task won't be sub-divided in any way... it will just be multiple nodes querying for the same thing. If you want, you can sub-divide your ranges in your client code, distribute those ranges to different nodes, and have each node scan only its designated range. You probably wouldn't use a BatchScanner for that. A regular Scanner would suffice. This is how AccumuloInputFormat works, implemented for both Hadoop's "mapred" and "mapreduce" APIs.

See more in the Javadocs:

https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgFFpgPRZ$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgIc60MEe$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgCVflycP$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgA-Po3AT$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgJRPi2h0$>



On Sat, Jan 16, 2021 at 11:28 AM Roberts, Geoffry [USA] <Ro...@bah.com>> wrote:
All,

Three questions all asking the same thing:

Can an Accumulo scan or batchscan run like a map/reduce job?

I have an Accumulo 2.0 cluster.

In hadoop, I can launch a map/reduce job on the name node and hadoop distributes the job over the nodes of the cluster and the job runs in parallel.

In accumulo, I am calling the batch scanner from some non-java code that is first distributed across the cluster then on each node it attaches to accumulo and does the scan.  It works on a single node accumulo—so far so good.  I need to escalate and run it multi-node.  I am concerned that I’ll wind up running the same scan on each node, which would return me an array of result sets all alike.  Am I correct?

Can I somehow get the Hadoop m/r effect in accumulo?

Thanks

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com<ma...@bah.com>

Booz | Allen | Hamilton
BoozAllen.com

Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Posted by Christopher <ct...@apache.org>.

Not to all servers, just to those hosting data in that range. But
otherwise, yes.

On Sat, Jan 16, 2021 at 1:45 PM Roberts, Geoffry [USA] <
Roberts_Geoffry@bah.com> wrote:

> If I have a batch scanner that has one large range, and this range spans
> several tservers, accumulo will distribute it to all tservers, it will
> process in parallel; and I’ll get back as single result set?
>
>
>
> Geoffry Roberts
>
> Lead Technologist
>
> 702.290.9098
>
> roberts_geoffry@bah.com
>
>
>
> Booz | Allen | Hamilton
>
> BoozAllen.com
>
>
>
> *From: *Christopher <ct...@apache.org>
> *Reply-To: *"user@accumulo.apache.org" <us...@accumulo.apache.org>
> *Date: *Saturday, January 16, 2021 at 1:39 PM
> *To: *accumulo-user <us...@accumulo.apache.org>
> *Subject: *[External] Re: Q: BatchScanner and parallel (i.e. m/r style)
> execution
>
>
>
> A BatchScanner takes multiple ranges, groups them by TServer, and then
> queries TServers in parallel for the ranges that are located in each,
> returning data in its iterator as it comes back (without regard to order).
>
> If you run the same scan on multiple nodes, the task won't be
> sub-divided in any way... it will just be multiple nodes querying for the
> same thing. If you want, you can sub-divide your ranges in your client
> code, distribute those ranges to different nodes, and have each node scan
> only its designated range. You probably wouldn't use a BatchScanner for
> that. A regular Scanner would suffice. This is how AccumuloInputFormat
> works, implemented for both Hadoop's "mapred" and "mapreduce" APIs.
>
> See more in the Javadocs:
>
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html
> <https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgFFpgPRZ$>
>
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html
> <https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgIc60MEe$>
>
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html
> <https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgCVflycP$>
>
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html
> <https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgA-Po3AT$>
>
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html
> <https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgJRPi2h0$>
>
>
>
>
>
>
>
> On Sat, Jan 16, 2021 at 11:28 AM Roberts, Geoffry [USA] <
> Roberts_Geoffry@bah.com> wrote:
>
> All,
>
>
>
> Three questions all asking the same thing:
>
>
>
> Can an Accumulo scan or batchscan run like a map/reduce job?
>
>
>
> I have an Accumulo 2.0 cluster.
>
>
>
> In hadoop, I can launch a map/reduce job on the name node and hadoop
> distributes the job over the nodes of the cluster and the job runs in
> parallel.
>
>
>
> In accumulo, I am calling the batch scanner from some non-java code that
> is first distributed across the cluster then on each node it attaches to
> accumulo and does the scan.  It works on a single node accumulo—so far so
> good.  I need to escalate and run it multi-node.  I am concerned that I’ll
> wind up running the same scan on each node, which would return me an array
> of result sets all alike.  Am I correct?
>
>
>
> Can I somehow get the Hadoop m/r effect in accumulo?
>
>
>
> Thanks
>
>
>
> Geoffry Roberts
>
> Lead Technologist
>
> 702.290.9098
>
> roberts_geoffry@bah.com
>
>
>
> Booz | Allen | Hamilton
>
> BoozAllen.com
>
>

Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Posted by "Roberts, Geoffry [USA]" <Ro...@bah.com>.

If I have a batch scanner that has one large range, and this range spans several tservers, accumulo will distribute it to all tservers, it will process in parallel; and I’ll get back as single result set?

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com

Booz | Allen | Hamilton
BoozAllen.com

From: Christopher <ct...@apache.org>
Reply-To: "user@accumulo.apache.org" <us...@accumulo.apache.org>
Date: Saturday, January 16, 2021 at 1:39 PM
To: accumulo-user <us...@accumulo.apache.org>
Subject: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

A BatchScanner takes multiple ranges, groups them by TServer, and then queries TServers in parallel for the ranges that are located in each, returning data in its iterator as it comes back (without regard to order).

If you run the same scan on multiple nodes, the task won't be sub-divided in any way... it will just be multiple nodes querying for the same thing. If you want, you can sub-divide your ranges in your client code, distribute those ranges to different nodes, and have each node scan only its designated range. You probably wouldn't use a BatchScanner for that. A regular Scanner would suffice. This is how AccumuloInputFormat works, implemented for both Hadoop's "mapred" and "mapreduce" APIs.

See more in the Javadocs:

https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgFFpgPRZ$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgIc60MEe$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgCVflycP$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgA-Po3AT$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgJRPi2h0$>



On Sat, Jan 16, 2021 at 11:28 AM Roberts, Geoffry [USA] <Ro...@bah.com>> wrote:
All,

Three questions all asking the same thing:

Can an Accumulo scan or batchscan run like a map/reduce job?

I have an Accumulo 2.0 cluster.

In hadoop, I can launch a map/reduce job on the name node and hadoop distributes the job over the nodes of the cluster and the job runs in parallel.

In accumulo, I am calling the batch scanner from some non-java code that is first distributed across the cluster then on each node it attaches to accumulo and does the scan.  It works on a single node accumulo—so far so good.  I need to escalate and run it multi-node.  I am concerned that I’ll wind up running the same scan on each node, which would return me an array of result sets all alike.  Am I correct?

Can I somehow get the Hadoop m/r effect in accumulo?

Thanks

Geoffry Roberts
Lead Technologist
702.290.9098
roberts_geoffry@bah.com<ma...@bah.com>

Booz | Allen | Hamilton
BoozAllen.com

Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Posted by Christopher <ct...@apache.org>.

A BatchScanner takes multiple ranges, groups them by TServer, and then
queries TServers in parallel for the ranges that are located in each,
returning data in its iterator as it comes back (without regard to order).

If you run the same scan on multiple nodes, the task won't be
sub-divided in any way... it will just be multiple nodes querying for the
same thing. If you want, you can sub-divide your ranges in your client
code, distribute those ranges to different nodes, and have each node scan
only its designated range. You probably wouldn't use a BatchScanner for
that. A regular Scanner would suffice. This is how AccumuloInputFormat
works, implemented for both Hadoop's "mapred" and "mapreduce" APIs.

See more in the Javadocs:

https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html

On Sat, Jan 16, 2021 at 11:28 AM Roberts, Geoffry [USA] <
Roberts_Geoffry@bah.com> wrote:

> All,
>
>
>
> Three questions all asking the same thing:
>
>
>
> Can an Accumulo scan or batchscan run like a map/reduce job?
>
>
>
> I have an Accumulo 2.0 cluster.
>
>
>
> In hadoop, I can launch a map/reduce job on the name node and hadoop
> distributes the job over the nodes of the cluster and the job runs in
> parallel.
>
>
>
> In accumulo, I am calling the batch scanner from some non-java code that
> is first distributed across the cluster then on each node it attaches to
> accumulo and does the scan.  It works on a single node accumulo—so far so
> good.  I need to escalate and run it multi-node.  I am concerned that I’ll
> wind up running the same scan on each node, which would return me an array
> of result sets all alike.  Am I correct?
>
>
>
> Can I somehow get the Hadoop m/r effect in accumulo?
>
>
>
> Thanks
>
>
>
> Geoffry Roberts
>
> Lead Technologist
>
> 702.290.9098
>
> roberts_geoffry@bah.com
>
>
>
> Booz | Allen | Hamilton
>
> BoozAllen.com
>