You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2016/10/28 15:38:11 UTC

Scanner timeouts

I’m getting data from HBase using a large Spark cluster with parallelism of near 400. The query fails quire often with the message below. Sometimes a retry will work and sometimes the ultimate failure results (below). 

If I reduce parallelism in Spark it slows other parts of the algorithm unacceptably. I have also experimented with very large RPC/Scanner timeouts of many minutes—to no avail.

Any clues about what to look for or what may be setup wrong in my tables?

Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times, most recent failure: Lost task 44.3 in stage 147.0 (TID 24833, ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?+details
Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times, most recent failure: Lost task 44.3 in stage 147.0 (TID 24833, ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:403) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

Which release of hbase are you using ?

Since the query to hbase comes from Spark, I assume there is no hbase
Filter involved.
So HBASE-13704 wouldn't be applicable in your case.

Can you pastebin region server log(s) around the OutOfOrderScannerNextException
?

Thanks

On Fri, Oct 28, 2016 at 8:38 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I’m getting data from HBase using a large Spark cluster with parallelism
> of near 400. The query fails quire often with the message below. Sometimes
> a retry will work and sometimes the ultimate failure results (below).
>
> If I reduce parallelism in Spark it slows other parts of the algorithm
> unacceptably. I have also experimented with very large RPC/Scanner timeouts
> of many minutes—to no avail.
>
> Any clues about what to look for or what may be setup wrong in my tables?
>
> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException:
> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> timeout?+details
> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException:
> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:403)
> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(
> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks Ted I am aware of that issue of Spark 2.0.1  not handling
connections to Phoenix. For now I use Spark 2.0.1 on Hbase directly or
Spark 2.0.1 on Hbase through Hive external tables.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 22:58, Ted Yu <yu...@gmail.com> wrote:

> That's another way of using hbase.
>
> Watch out for PHOENIX-3333
> <https://issues.apache.org/jira/browse/PHOENIX-3333> if you're running
> queries with Spark 2.0
>
> On Fri, Oct 28, 2016 at 2:38 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Hbase does not have indexes but Phoenix will allow one to create
> secondary
> > indexes on Hbase. The index structure will be created on Hbase itself and
> > you can maintain it from Phoenix.
> >
> > HTH
> >
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 28 October 2016 at 19:29, Ted Yu <yu...@gmail.com> wrote:
> >
> > > bq. with 400 threads hitting HBase at the same time
> > >
> > > How many regions are serving the 400 threads ?
> > > How many region servers do you have ?
> > >
> > > If the regions are spread relatively evenly across the cluster, the
> above
> > > may not be big issue.
> > >
> > > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > >
> > > > Ok, will do.
> > > >
> > > > So the scanner does not indicate of itself that I’ve missed something
> > in
> > > > handling the data. If not index, then made a fast lookup “key”? I ask
> > > > because the timeout change may work but not be the optimal solution.
> > The
> > > > stage that fails is very long compared to other stages. And with 400
> > > > threads hitting HBase at the same time, this seems like something I
> may
> > > > need to restructure and any advice about that would be welcome.
> > > >
> > > > HBase is 1.2.3
> > > >
> > > >
> > > > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > For your first question, you need to pass hbase-site.xml which has
> > config
> > > > parameters affecting client operations to Spark  executors.
> > > >
> > > > bq. missed indexing some column
> > > >
> > > > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> > > >
> > > > Let's see what happens after hbase-site.xml is passed to executors.
> > > >
> > > > BTW Can you tell us the release of hbase you're using ?
> > > >
> > > >
> > > >
> > > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > > wrote:
> > > >
> > > > > So to clarify there are some values in hbase/conf/hbase-site.xml
> that
> > > are
> > > > > needed by the calling code in the Spark driver and executors and so
> > > must
> > > > be
> > > > > passed using --files to spark-submit? If so I can do this.
> > > > >
> > > > > But do I have a deeper issue? Is it typical to need a scan like
> this?
> > > > Have
> > > > > I missed indexing some column maybe?
> > > > >
> > > > >
> > > > > On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> > > > >
> > > > > Mich:
> > > > > bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> > > > >
> > > > > What you observed was different issue.
> > > > > The above looks like trouble with locating region(s) during scan.
> > > > >
> > > > > On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > > > > mich.talebzadeh@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> This is an example I got
> > > > >>
> > > > >> warning: there were two deprecation warnings; re-run with
> > -deprecation
> > > > > for
> > > > >> details
> > > > >> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> > > MapPartitionsRDD[77]
> > > > > at
> > > > >> map at <console>:151
> > > > >> defined class columns
> > > > >> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> > > TICKER:
> > > > >> string]
> > > > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > > after
> > > > >> attempts=36, exceptions:
> > > > >> *Fri Oct 28 13:13:46 BST 2016, null, java.net.
> > SocketTimeoutException:
> > > > >> callTimeout=60000, callDuration=68411: row
> > > > >> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > > > >> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,
> > > 1477246132044,
> > > > >> seqNum=0
> > > > >> at
> > > > >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > > > >> cas.throwEnrichedException(RpcRetryingCallerWithReadRepli
> > > cas.java:276)
> > > > >> at
> > > > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > > > >> ScannerCallableWithReplicas.java:210)
> > > > >> at
> > > > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > > > >> ScannerCallableWithReplicas.java:60)
> > > > >> at
> > > > >> org.apache.hadoop.hbase.client.RpcRetryingCaller.
> > callWithoutRetries(
> > > > >> RpcRetryingCaller.java:210)
> > > > >>
> > > > >>
> > > > >>
> > > > >> Dr Mich Talebzadeh
> > > > >>
> > > > >>
> > > > >>
> > > > >> LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > >> <https://www.linkedin.com/profile/view?id=
> > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > > >> OABUrV8Pw>*
> > > > >>
> > > > >>
> > > > >>
> > > > >> http://talebzadehmich.wordpress.com
> > > > >>
> > > > >>
> > > > >> *Disclaimer:* Use it at your own risk. Any and all responsibility
> > for
> > > > any
> > > > >> loss, damage or destruction of data or any other property which
> may
> > > > arise
> > > > >> from relying on this email's technical content is explicitly
> > > disclaimed.
> > > > >> The author will in no case be liable for any monetary damages
> > arising
> > > > > from
> > > > >> such loss, damage or destruction.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > > > >>
> > > > >>> I will check that, but if that is a server startup thing I was
> not
> > > > aware
> > > > >> I
> > > > >>> had to send it to the executors. So it’s like a connection
> timeout
> > > from
> > > > >>> executor code?
> > > > >>>
> > > > >>>
> > > > >>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> > > > >>>
> > > > >>> How did you change the timeout(s) ?
> > > > >>>
> > > > >>> bq. timeout is currently set to 60000
> > > > >>>
> > > > >>> Did you pass hbase-site.xml using --files to Spark job ?
> > > > >>>
> > > > >>> Cheers
> > > > >>>
> > > > >>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <
> pat@occamsmachete.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> Using standalone Spark. I don’t recall seeing connection lost
> > > errors,
> > > > >> but
> > > > >>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> > > large
> > > > >>>> numbers on the servers.
> > > > >>>>
> > > > >>>> But I also saw in the logs:
> > > > >>>>
> > > > >>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException:
> 381788ms
> > > > >>>> passed since the last invocation, timeout is currently set to
> > 60000
> > > > >>>>
> > > > >>>> Not sure where that is coming from. Does the driver machine
> making
> > > > >>> queries
> > > > >>>> need to have the timeout config also?
> > > > >>>>
> > > > >>>> And why so large, am I doing something wrong?
> > > > >>>>
> > > > >>>>
> > > > >>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >>>>
> > > > >>>> Mich:
> > > > >>>> The OutOfOrderScannerNextException indicated problem with read
> > from
> > > > >>> hbase.
> > > > >>>>
> > > > >>>> How did you know connection to Spark cluster was lost ?
> > > > >>>>
> > > > >>>> Cheers
> > > > >>>>
> > > > >>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > > > >>>> mich.talebzadeh@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Looks like it lost the connection to Spark cluster.
> > > > >>>>>
> > > > >>>>> What mode you are using with Spark, Standalone, Yarn or others.
> > The
> > > > >>> issue
> > > > >>>>> looks like a resource manager issue.
> > > > >>>>>
> > > > >>>>> I have seen this when running Zeppelin with Spark on Hbase.
> > > > >>>>>
> > > > >>>>> HTH
> > > > >>>>>
> > > > >>>>> Dr Mich Talebzadeh
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > >>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > >>>>> <https://www.linkedin.com/profile/view?id=
> > > > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > > >>>>> OABUrV8Pw>*
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> http://talebzadehmich.wordpress.com
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> *Disclaimer:* Use it at your own risk. Any and all
> responsibility
> > > for
> > > > >>> any
> > > > >>>>> loss, damage or destruction of data or any other property which
> > may
> > > > >>> arise
> > > > >>>>> from relying on this email's technical content is explicitly
> > > > >> disclaimed.
> > > > >>>>> The author will in no case be liable for any monetary damages
> > > arising
> > > > >>>> from
> > > > >>>>> such loss, damage or destruction.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 28 October 2016 at 16:38, Pat Ferrel <pat@occamsmachete.com
> >
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> I’m getting data from HBase using a large Spark cluster with
> > > > >>> parallelism
> > > > >>>>>> of near 400. The query fails quire often with the message
> below.
> > > > >>>>> Sometimes
> > > > >>>>>> a retry will work and sometimes the ultimate failure results
> > > > (below).
> > > > >>>>>>
> > > > >>>>>> If I reduce parallelism in Spark it slows other parts of the
> > > > >> algorithm
> > > > >>>>>> unacceptably. I have also experimented with very large
> > RPC/Scanner
> > > > >>>>> timeouts
> > > > >>>>>> of many minutes—to no avail.
> > > > >>>>>>
> > > > >>>>>> Any clues about what to look for or what may be setup wrong in
> > my
> > > > >>>> tables?
> > > > >>>>>>
> > > > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0
> failed
> > 4
> > > > >>> times,
> > > > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > > > >> org.apache.hadoop.hbase.
> > > > >>>>> DoNotRetryIOException:
> > > > >>>>>> Failed after retry of OutOfOrderScannerNextException: was
> > there a
> > > > >> rpc
> > > > >>>>>> timeout?+details
> > > > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0
> failed
> > 4
> > > > >>> times,
> > > > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > > > >> org.apache.hadoop.hbase.
> > > > >>>>> DoNotRetryIOException:
> > > > >>>>>> Failed after retry of OutOfOrderScannerNextException: was
> > there a
> > > > >> rpc
> > > > >>>>>> timeout? at org.apache.hadoop.hbase.
> client.ClientScanner.next(
> > > > >>>>> ClientScanner.java:403)
> > > > >>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > > > >>>> nextKeyValue(
> > > > >>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > > > >>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > > > TableRecordReader.java:138)
> > > > >>> at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

That's another way of using hbase.

Watch out for PHOENIX-3333
<https://issues.apache.org/jira/browse/PHOENIX-3333> if you're running
queries with Spark 2.0

On Fri, Oct 28, 2016 at 2:38 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hbase does not have indexes but Phoenix will allow one to create secondary
> indexes on Hbase. The index structure will be created on Hbase itself and
> you can maintain it from Phoenix.
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 19:29, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. with 400 threads hitting HBase at the same time
> >
> > How many regions are serving the 400 threads ?
> > How many region servers do you have ?
> >
> > If the regions are spread relatively evenly across the cluster, the above
> > may not be big issue.
> >
> > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> > > Ok, will do.
> > >
> > > So the scanner does not indicate of itself that I’ve missed something
> in
> > > handling the data. If not index, then made a fast lookup “key”? I ask
> > > because the timeout change may work but not be the optimal solution.
> The
> > > stage that fails is very long compared to other stages. And with 400
> > > threads hitting HBase at the same time, this seems like something I may
> > > need to restructure and any advice about that would be welcome.
> > >
> > > HBase is 1.2.3
> > >
> > >
> > > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > For your first question, you need to pass hbase-site.xml which has
> config
> > > parameters affecting client operations to Spark  executors.
> > >
> > > bq. missed indexing some column
> > >
> > > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> > >
> > > Let's see what happens after hbase-site.xml is passed to executors.
> > >
> > > BTW Can you tell us the release of hbase you're using ?
> > >
> > >
> > >
> > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > >
> > > > So to clarify there are some values in hbase/conf/hbase-site.xml that
> > are
> > > > needed by the calling code in the Spark driver and executors and so
> > must
> > > be
> > > > passed using --files to spark-submit? If so I can do this.
> > > >
> > > > But do I have a deeper issue? Is it typical to need a scan like this?
> > > Have
> > > > I missed indexing some column maybe?
> > > >
> > > >
> > > > On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > Mich:
> > > > bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> > > >
> > > > What you observed was different issue.
> > > > The above looks like trouble with locating region(s) during scan.
> > > >
> > > > On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > > > mich.talebzadeh@gmail.com>
> > > > wrote:
> > > >
> > > >> This is an example I got
> > > >>
> > > >> warning: there were two deprecation warnings; re-run with
> -deprecation
> > > > for
> > > >> details
> > > >> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> > MapPartitionsRDD[77]
> > > > at
> > > >> map at <console>:151
> > > >> defined class columns
> > > >> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> > TICKER:
> > > >> string]
> > > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > after
> > > >> attempts=36, exceptions:
> > > >> *Fri Oct 28 13:13:46 BST 2016, null, java.net.
> SocketTimeoutException:
> > > >> callTimeout=60000, callDuration=68411: row
> > > >> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > > >> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,
> > 1477246132044,
> > > >> seqNum=0
> > > >> at
> > > >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > > >> cas.throwEnrichedException(RpcRetryingCallerWithReadRepli
> > cas.java:276)
> > > >> at
> > > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > > >> ScannerCallableWithReplicas.java:210)
> > > >> at
> > > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > > >> ScannerCallableWithReplicas.java:60)
> > > >> at
> > > >> org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithoutRetries(
> > > >> RpcRetryingCaller.java:210)
> > > >>
> > > >>
> > > >>
> > > >> Dr Mich Talebzadeh
> > > >>
> > > >>
> > > >>
> > > >> LinkedIn * https://www.linkedin.com/profile/view?id=
> > > >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >> <https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > >> OABUrV8Pw>*
> > > >>
> > > >>
> > > >>
> > > >> http://talebzadehmich.wordpress.com
> > > >>
> > > >>
> > > >> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > > any
> > > >> loss, damage or destruction of data or any other property which may
> > > arise
> > > >> from relying on this email's technical content is explicitly
> > disclaimed.
> > > >> The author will in no case be liable for any monetary damages
> arising
> > > > from
> > > >> such loss, damage or destruction.
> > > >>
> > > >>
> > > >>
> > > >> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> > > >>
> > > >>> I will check that, but if that is a server startup thing I was not
> > > aware
> > > >> I
> > > >>> had to send it to the executors. So it’s like a connection timeout
> > from
> > > >>> executor code?
> > > >>>
> > > >>>
> > > >>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >>>
> > > >>> How did you change the timeout(s) ?
> > > >>>
> > > >>> bq. timeout is currently set to 60000
> > > >>>
> > > >>> Did you pass hbase-site.xml using --files to Spark job ?
> > > >>>
> > > >>> Cheers
> > > >>>
> > > >>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pat@occamsmachete.com
> >
> > > >> wrote:
> > > >>>
> > > >>>> Using standalone Spark. I don’t recall seeing connection lost
> > errors,
> > > >> but
> > > >>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> > large
> > > >>>> numbers on the servers.
> > > >>>>
> > > >>>> But I also saw in the logs:
> > > >>>>
> > > >>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > > >>>> passed since the last invocation, timeout is currently set to
> 60000
> > > >>>>
> > > >>>> Not sure where that is coming from. Does the driver machine making
> > > >>> queries
> > > >>>> need to have the timeout config also?
> > > >>>>
> > > >>>> And why so large, am I doing something wrong?
> > > >>>>
> > > >>>>
> > > >>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >>>>
> > > >>>> Mich:
> > > >>>> The OutOfOrderScannerNextException indicated problem with read
> from
> > > >>> hbase.
> > > >>>>
> > > >>>> How did you know connection to Spark cluster was lost ?
> > > >>>>
> > > >>>> Cheers
> > > >>>>
> > > >>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > > >>>> mich.talebzadeh@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Looks like it lost the connection to Spark cluster.
> > > >>>>>
> > > >>>>> What mode you are using with Spark, Standalone, Yarn or others.
> The
> > > >>> issue
> > > >>>>> looks like a resource manager issue.
> > > >>>>>
> > > >>>>> I have seen this when running Zeppelin with Spark on Hbase.
> > > >>>>>
> > > >>>>> HTH
> > > >>>>>
> > > >>>>> Dr Mich Talebzadeh
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> > > >>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >>>>> <https://www.linkedin.com/profile/view?id=
> > > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > >>>>> OABUrV8Pw>*
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> http://talebzadehmich.wordpress.com
> > > >>>>>
> > > >>>>>
> > > >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> > for
> > > >>> any
> > > >>>>> loss, damage or destruction of data or any other property which
> may
> > > >>> arise
> > > >>>>> from relying on this email's technical content is explicitly
> > > >> disclaimed.
> > > >>>>> The author will in no case be liable for any monetary damages
> > arising
> > > >>>> from
> > > >>>>> such loss, damage or destruction.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> I’m getting data from HBase using a large Spark cluster with
> > > >>> parallelism
> > > >>>>>> of near 400. The query fails quire often with the message below.
> > > >>>>> Sometimes
> > > >>>>>> a retry will work and sometimes the ultimate failure results
> > > (below).
> > > >>>>>>
> > > >>>>>> If I reduce parallelism in Spark it slows other parts of the
> > > >> algorithm
> > > >>>>>> unacceptably. I have also experimented with very large
> RPC/Scanner
> > > >>>>> timeouts
> > > >>>>>> of many minutes—to no avail.
> > > >>>>>>
> > > >>>>>> Any clues about what to look for or what may be setup wrong in
> my
> > > >>>> tables?
> > > >>>>>>
> > > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
> 4
> > > >>> times,
> > > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > > >> org.apache.hadoop.hbase.
> > > >>>>> DoNotRetryIOException:
> > > >>>>>> Failed after retry of OutOfOrderScannerNextException: was
> there a
> > > >> rpc
> > > >>>>>> timeout?+details
> > > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
> 4
> > > >>> times,
> > > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > > >> org.apache.hadoop.hbase.
> > > >>>>> DoNotRetryIOException:
> > > >>>>>> Failed after retry of OutOfOrderScannerNextException: was
> there a
> > > >> rpc
> > > >>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > > >>>>> ClientScanner.java:403)
> > > >>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > > >>>> nextKeyValue(
> > > >>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > > >>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > > TableRecordReader.java:138)
> > > >>> at
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hbase does not have indexes but Phoenix will allow one to create secondary
indexes on Hbase. The index structure will be created on Hbase itself and
you can maintain it from Phoenix.

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 19:29, Ted Yu <yu...@gmail.com> wrote:

> bq. with 400 threads hitting HBase at the same time
>
> How many regions are serving the 400 threads ?
> How many region servers do you have ?
>
> If the regions are spread relatively evenly across the cluster, the above
> may not be big issue.
>
> On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > Ok, will do.
> >
> > So the scanner does not indicate of itself that I’ve missed something in
> > handling the data. If not index, then made a fast lookup “key”? I ask
> > because the timeout change may work but not be the optimal solution. The
> > stage that fails is very long compared to other stages. And with 400
> > threads hitting HBase at the same time, this seems like something I may
> > need to restructure and any advice about that would be welcome.
> >
> > HBase is 1.2.3
> >
> >
> > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > For your first question, you need to pass hbase-site.xml which has config
> > parameters affecting client operations to Spark  executors.
> >
> > bq. missed indexing some column
> >
> > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> >
> > Let's see what happens after hbase-site.xml is passed to executors.
> >
> > BTW Can you tell us the release of hbase you're using ?
> >
> >
> >
> > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> > > So to clarify there are some values in hbase/conf/hbase-site.xml that
> are
> > > needed by the calling code in the Spark driver and executors and so
> must
> > be
> > > passed using --files to spark-submit? If so I can do this.
> > >
> > > But do I have a deeper issue? Is it typical to need a scan like this?
> > Have
> > > I missed indexing some column maybe?
> > >
> > >
> > > On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > Mich:
> > > bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> > >
> > > What you observed was different issue.
> > > The above looks like trouble with locating region(s) during scan.
> > >
> > > On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > > mich.talebzadeh@gmail.com>
> > > wrote:
> > >
> > >> This is an example I got
> > >>
> > >> warning: there were two deprecation warnings; re-run with -deprecation
> > > for
> > >> details
> > >> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> MapPartitionsRDD[77]
> > > at
> > >> map at <console>:151
> > >> defined class columns
> > >> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> TICKER:
> > >> string]
> > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> after
> > >> attempts=36, exceptions:
> > >> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> > >> callTimeout=60000, callDuration=68411: row
> > >> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > >> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,
> 1477246132044,
> > >> seqNum=0
> > >> at
> > >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > >> cas.throwEnrichedException(RpcRetryingCallerWithReadRepli
> cas.java:276)
> > >> at
> > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > >> ScannerCallableWithReplicas.java:210)
> > >> at
> > >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > >> ScannerCallableWithReplicas.java:60)
> > >> at
> > >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> > >> RpcRetryingCaller.java:210)
> > >>
> > >>
> > >>
> > >> Dr Mich Talebzadeh
> > >>
> > >>
> > >>
> > >> LinkedIn * https://www.linkedin.com/profile/view?id=
> > >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >> <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > >> OABUrV8Pw>*
> > >>
> > >>
> > >>
> > >> http://talebzadehmich.wordpress.com
> > >>
> > >>
> > >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > >> loss, damage or destruction of data or any other property which may
> > arise
> > >> from relying on this email's technical content is explicitly
> disclaimed.
> > >> The author will in no case be liable for any monetary damages arising
> > > from
> > >> such loss, damage or destruction.
> > >>
> > >>
> > >>
> > >> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > >>
> > >>> I will check that, but if that is a server startup thing I was not
> > aware
> > >> I
> > >>> had to send it to the executors. So it’s like a connection timeout
> from
> > >>> executor code?
> > >>>
> > >>>
> > >>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>>
> > >>> How did you change the timeout(s) ?
> > >>>
> > >>> bq. timeout is currently set to 60000
> > >>>
> > >>> Did you pass hbase-site.xml using --files to Spark job ?
> > >>>
> > >>> Cheers
> > >>>
> > >>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> > >> wrote:
> > >>>
> > >>>> Using standalone Spark. I don’t recall seeing connection lost
> errors,
> > >> but
> > >>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> large
> > >>>> numbers on the servers.
> > >>>>
> > >>>> But I also saw in the logs:
> > >>>>
> > >>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > >>>> passed since the last invocation, timeout is currently set to 60000
> > >>>>
> > >>>> Not sure where that is coming from. Does the driver machine making
> > >>> queries
> > >>>> need to have the timeout config also?
> > >>>>
> > >>>> And why so large, am I doing something wrong?
> > >>>>
> > >>>>
> > >>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>>>
> > >>>> Mich:
> > >>>> The OutOfOrderScannerNextException indicated problem with read from
> > >>> hbase.
> > >>>>
> > >>>> How did you know connection to Spark cluster was lost ?
> > >>>>
> > >>>> Cheers
> > >>>>
> > >>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > >>>> mich.talebzadeh@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Looks like it lost the connection to Spark cluster.
> > >>>>>
> > >>>>> What mode you are using with Spark, Standalone, Yarn or others. The
> > >>> issue
> > >>>>> looks like a resource manager issue.
> > >>>>>
> > >>>>> I have seen this when running Zeppelin with Spark on Hbase.
> > >>>>>
> > >>>>> HTH
> > >>>>>
> > >>>>> Dr Mich Talebzadeh
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> > >>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >>>>> <https://www.linkedin.com/profile/view?id=
> > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > >>>>> OABUrV8Pw>*
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> http://talebzadehmich.wordpress.com
> > >>>>>
> > >>>>>
> > >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > >>> any
> > >>>>> loss, damage or destruction of data or any other property which may
> > >>> arise
> > >>>>> from relying on this email's technical content is explicitly
> > >> disclaimed.
> > >>>>> The author will in no case be liable for any monetary damages
> arising
> > >>>> from
> > >>>>> such loss, damage or destruction.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> > >> wrote:
> > >>>>>
> > >>>>>> I’m getting data from HBase using a large Spark cluster with
> > >>> parallelism
> > >>>>>> of near 400. The query fails quire often with the message below.
> > >>>>> Sometimes
> > >>>>>> a retry will work and sometimes the ultimate failure results
> > (below).
> > >>>>>>
> > >>>>>> If I reduce parallelism in Spark it slows other parts of the
> > >> algorithm
> > >>>>>> unacceptably. I have also experimented with very large RPC/Scanner
> > >>>>> timeouts
> > >>>>>> of many minutes—to no avail.
> > >>>>>>
> > >>>>>> Any clues about what to look for or what may be setup wrong in my
> > >>>> tables?
> > >>>>>>
> > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > >>> times,
> > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > >> org.apache.hadoop.hbase.
> > >>>>> DoNotRetryIOException:
> > >>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> > >> rpc
> > >>>>>> timeout?+details
> > >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > >>> times,
> > >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > >> org.apache.hadoop.hbase.
> > >>>>> DoNotRetryIOException:
> > >>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> > >> rpc
> > >>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > >>>>> ClientScanner.java:403)
> > >>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > >>>> nextKeyValue(
> > >>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > >>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > TableRecordReader.java:138)
> > >>> at
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

ok what it says that it was discussed before and there is Jira on hbase
side.

it is not a showstopper anyway

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 23:53, Ted Yu <yu...@gmail.com> wrote:

> Mich:
> The image didn't go through.
>
> Consider using third party website.
>
> On Fri, Oct 28, 2016 at 3:52 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Gentle reminder :)
> >
> > [image: Inline images 1]
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 28 October 2016 at 23:05, Ted Yu <yu...@gmail.com> wrote:
> >
> >> You should have written to the mailing list earlier :-)
> >>
> >> hbase community is very responsive.
> >>
> >>
> >> On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>
> >> > After passing in hbase-site.xml with the increased timeout it
> completes
> >> > pretty fast with no errors.
> >> >
> >> > Thanks Ted, we’ve been going crazy trying to figure what was going on.
> >> We
> >> > moved from having Hbase installed on the Spark driver machine (though
> >> not
> >> > used) to containerized installation, where the config was left default
> >> on
> >> > the driver and only existed in the containers. We were passing in the
> >> empty
> >> > config to the spark-submit but it didn’t match the containers and
> fixing
> >> > that has made the system much happier.
> >> >
> >> > Anyway good call, we will be more aware of this with other services
> now.
> >> > Thanks for ending our weeks long fight!  :-)
> >> >
> >> >
> >> > On Oct 28, 2016, at 11:29 AM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > bq. with 400 threads hitting HBase at the same time
> >> >
> >> > How many regions are serving the 400 threads ?
> >> > How many region servers do you have ?
> >> >
> >> > If the regions are spread relatively evenly across the cluster, the
> >> above
> >> > may not be big issue.
> >> >
> >> > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> > wrote:
> >> >
> >> > > Ok, will do.
> >> > >
> >> > > So the scanner does not indicate of itself that I’ve missed
> something
> >> in
> >> > > handling the data. If not index, then made a fast lookup “key”? I
> ask
> >> > > because the timeout change may work but not be the optimal solution.
> >> The
> >> > > stage that fails is very long compared to other stages. And with 400
> >> > > threads hitting HBase at the same time, this seems like something I
> >> may
> >> > > need to restructure and any advice about that would be welcome.
> >> > >
> >> > > HBase is 1.2.3
> >> > >
> >> > >
> >> > > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > >
> >> > > For your first question, you need to pass hbase-site.xml which has
> >> config
> >> > > parameters affecting client operations to Spark  executors.
> >> > >
> >> > > bq. missed indexing some column
> >> > >
> >> > > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> >> > >
> >> > > Let's see what happens after hbase-site.xml is passed to executors.
> >> > >
> >> > > BTW Can you tell us the release of hbase you're using ?
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pat@occamsmachete.com
> >
> >> > > wrote:
> >> > >
> >> > >> So to clarify there are some values in hbase/conf/hbase-site.xml
> that
> >> > are
> >> > >> needed by the calling code in the Spark driver and executors and so
> >> must
> >> > > be
> >> > >> passed using --files to spark-submit? If so I can do this.
> >> > >>
> >> > >> But do I have a deeper issue? Is it typical to need a scan like
> this?
> >> > > Have
> >> > >> I missed indexing some column maybe?
> >> > >>
> >> > >>
> >> > >> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > >>
> >> > >> Mich:
> >> > >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> >> > >>
> >> > >> What you observed was different issue.
> >> > >> The above looks like trouble with locating region(s) during scan.
> >> > >>
> >> > >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> >> > >> mich.talebzadeh@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >>> This is an example I got
> >> > >>>
> >> > >>> warning: there were two deprecation warnings; re-run with
> >> -deprecation
> >> > >> for
> >> > >>> details
> >> > >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> >> > MapPartitionsRDD[77]
> >> > >> at
> >> > >>> map at <console>:151
> >> > >>> defined class columns
> >> > >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> >> > TICKER:
> >> > >>> string]
> >> > >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> >> after
> >> > >>> attempts=36, exceptions:
> >> > >>> *Fri Oct 28 13:13:46 BST 2016, null,
> java.net.SocketTimeoutExceptio
> >> n:
> >> > >>> callTimeout=60000, callDuration=68411: row
> >> > >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> >> > >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246
> >> 132044,
> >> > >>> seqNum=0
> >> > >>> at
> >> > >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> >> > >>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas
> >> .java:276)
> >> > >>> at
> >> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> > >>> ScannerCallableWithReplicas.java:210)
> >> > >>> at
> >> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> > >>> ScannerCallableWithReplicas.java:60)
> >> > >>> at
> >> > >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithout
> >> Retries(
> >> > >>> RpcRetryingCaller.java:210)
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> Dr Mich Talebzadeh
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> > >>> <https://www.linkedin.com/profile/view?id=
> >> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >> > >>> OABUrV8Pw>*
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> http://talebzadehmich.wordpress.com
> >> > >>>
> >> > >>>
> >> > >>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> >> for
> >> > > any
> >> > >>> loss, damage or destruction of data or any other property which
> may
> >> > > arise
> >> > >>> from relying on this email's technical content is explicitly
> >> > disclaimed.
> >> > >>> The author will in no case be liable for any monetary damages
> >> arising
> >> > >> from
> >> > >>> such loss, damage or destruction.
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >> > >>>
> >> > >>>> I will check that, but if that is a server startup thing I was
> not
> >> > > aware
> >> > >>> I
> >> > >>>> had to send it to the executors. So it’s like a connection
> timeout
> >> > from
> >> > >>>> executor code?
> >> > >>>>
> >> > >>>>
> >> > >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > >>>>
> >> > >>>> How did you change the timeout(s) ?
> >> > >>>>
> >> > >>>> bq. timeout is currently set to 60000
> >> > >>>>
> >> > >>>> Did you pass hbase-site.xml using --files to Spark job ?
> >> > >>>>
> >> > >>>> Cheers
> >> > >>>>
> >> > >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <
> pat@occamsmachete.com
> >> >
> >> > >>> wrote:
> >> > >>>>
> >> > >>>>> Using standalone Spark. I don’t recall seeing connection lost
> >> errors,
> >> > >>> but
> >> > >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> >> > large
> >> > >>>>> numbers on the servers.
> >> > >>>>>
> >> > >>>>> But I also saw in the logs:
> >> > >>>>>
> >> > >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException:
> 381788ms
> >> > >>>>> passed since the last invocation, timeout is currently set to
> >> 60000
> >> > >>>>>
> >> > >>>>> Not sure where that is coming from. Does the driver machine
> making
> >> > >>>> queries
> >> > >>>>> need to have the timeout config also?
> >> > >>>>>
> >> > >>>>> And why so large, am I doing something wrong?
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >> > >>>>>
> >> > >>>>> Mich:
> >> > >>>>> The OutOfOrderScannerNextException indicated problem with read
> >> from
> >> > >>>> hbase.
> >> > >>>>>
> >> > >>>>> How did you know connection to Spark cluster was lost ?
> >> > >>>>>
> >> > >>>>> Cheers
> >> > >>>>>
> >> > >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >> > >>>>> mich.talebzadeh@gmail.com>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Looks like it lost the connection to Spark cluster.
> >> > >>>>>>
> >> > >>>>>> What mode you are using with Spark, Standalone, Yarn or others.
> >> The
> >> > >>>> issue
> >> > >>>>>> looks like a resource manager issue.
> >> > >>>>>>
> >> > >>>>>> I have seen this when running Zeppelin with Spark on Hbase.
> >> > >>>>>>
> >> > >>>>>> HTH
> >> > >>>>>>
> >> > >>>>>> Dr Mich Talebzadeh
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> > >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> > >>>>>> <https://www.linkedin.com/profile/view?id=
> >> > >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >> > >>>>>> OABUrV8Pw>*
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> http://talebzadehmich.wordpress.com
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> *Disclaimer:* Use it at your own risk. Any and all
> responsibility
> >> > for
> >> > >>>> any
> >> > >>>>>> loss, damage or destruction of data or any other property which
> >> may
> >> > >>>> arise
> >> > >>>>>> from relying on this email's technical content is explicitly
> >> > >>> disclaimed.
> >> > >>>>>> The author will in no case be liable for any monetary damages
> >> > arising
> >> > >>>>> from
> >> > >>>>>> such loss, damage or destruction.
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pat@occamsmachete.com
> >
> >> > >>> wrote:
> >> > >>>>>>
> >> > >>>>>>> I’m getting data from HBase using a large Spark cluster with
> >> > >>>> parallelism
> >> > >>>>>>> of near 400. The query fails quire often with the message
> below.
> >> > >>>>>> Sometimes
> >> > >>>>>>> a retry will work and sometimes the ultimate failure results
> >> > > (below).
> >> > >>>>>>>
> >> > >>>>>>> If I reduce parallelism in Spark it slows other parts of the
> >> > >>> algorithm
> >> > >>>>>>> unacceptably. I have also experimented with very large
> >> RPC/Scanner
> >> > >>>>>> timeouts
> >> > >>>>>>> of many minutes—to no avail.
> >> > >>>>>>>
> >> > >>>>>>> Any clues about what to look for or what may be setup wrong in
> >> my
> >> > >>>>> tables?
> >> > >>>>>>>
> >> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0
> failed
> >> 4
> >> > >>>> times,
> >> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> > >>> org.apache.hadoop.hbase.
> >> > >>>>>> DoNotRetryIOException:
> >> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was
> >> there a
> >> > >>> rpc
> >> > >>>>>>> timeout?+details
> >> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0
> failed
> >> 4
> >> > >>>> times,
> >> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> > >>> org.apache.hadoop.hbase.
> >> > >>>>>> DoNotRetryIOException:
> >> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was
> >> there a
> >> > >>> rpc
> >> > >>>>>>> timeout? at org.apache.hadoop.hbase.
> client.ClientScanner.next(
> >> > >>>>>> ClientScanner.java:403)
> >> > >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >> > >>>>> nextKeyValue(
> >> > >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >> > >>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> >> > > TableRecordReader.java:138)
> >> > >>>> at
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> >
> >> >
> >>
> >
> >
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

Mich:
The image didn't go through.

Consider using third party website.

On Fri, Oct 28, 2016 at 3:52 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Gentle reminder :)
>
> [image: Inline images 1]
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 23:05, Ted Yu <yu...@gmail.com> wrote:
>
>> You should have written to the mailing list earlier :-)
>>
>> hbase community is very responsive.
>>
>>
>> On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>> > After passing in hbase-site.xml with the increased timeout it completes
>> > pretty fast with no errors.
>> >
>> > Thanks Ted, we’ve been going crazy trying to figure what was going on.
>> We
>> > moved from having Hbase installed on the Spark driver machine (though
>> not
>> > used) to containerized installation, where the config was left default
>> on
>> > the driver and only existed in the containers. We were passing in the
>> empty
>> > config to the spark-submit but it didn’t match the containers and fixing
>> > that has made the system much happier.
>> >
>> > Anyway good call, we will be more aware of this with other services now.
>> > Thanks for ending our weeks long fight!  :-)
>> >
>> >
>> > On Oct 28, 2016, at 11:29 AM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > bq. with 400 threads hitting HBase at the same time
>> >
>> > How many regions are serving the 400 threads ?
>> > How many region servers do you have ?
>> >
>> > If the regions are spread relatively evenly across the cluster, the
>> above
>> > may not be big issue.
>> >
>> > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
>> > wrote:
>> >
>> > > Ok, will do.
>> > >
>> > > So the scanner does not indicate of itself that I’ve missed something
>> in
>> > > handling the data. If not index, then made a fast lookup “key”? I ask
>> > > because the timeout change may work but not be the optimal solution.
>> The
>> > > stage that fails is very long compared to other stages. And with 400
>> > > threads hitting HBase at the same time, this seems like something I
>> may
>> > > need to restructure and any advice about that would be welcome.
>> > >
>> > > HBase is 1.2.3
>> > >
>> > >
>> > > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > > For your first question, you need to pass hbase-site.xml which has
>> config
>> > > parameters affecting client operations to Spark  executors.
>> > >
>> > > bq. missed indexing some column
>> > >
>> > > hbase doesn't have indexing (as in the sense of traditional RDBMS).
>> > >
>> > > Let's see what happens after hbase-site.xml is passed to executors.
>> > >
>> > > BTW Can you tell us the release of hbase you're using ?
>> > >
>> > >
>> > >
>> > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
>> > > wrote:
>> > >
>> > >> So to clarify there are some values in hbase/conf/hbase-site.xml that
>> > are
>> > >> needed by the calling code in the Spark driver and executors and so
>> must
>> > > be
>> > >> passed using --files to spark-submit? If so I can do this.
>> > >>
>> > >> But do I have a deeper issue? Is it typical to need a scan like this?
>> > > Have
>> > >> I missed indexing some column maybe?
>> > >>
>> > >>
>> > >> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >>
>> > >> Mich:
>> > >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
>> > >>
>> > >> What you observed was different issue.
>> > >> The above looks like trouble with locating region(s) during scan.
>> > >>
>> > >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
>> > >> mich.talebzadeh@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> This is an example I got
>> > >>>
>> > >>> warning: there were two deprecation warnings; re-run with
>> -deprecation
>> > >> for
>> > >>> details
>> > >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
>> > MapPartitionsRDD[77]
>> > >> at
>> > >>> map at <console>:151
>> > >>> defined class columns
>> > >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
>> > TICKER:
>> > >>> string]
>> > >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>> after
>> > >>> attempts=36, exceptions:
>> > >>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutExceptio
>> n:
>> > >>> callTimeout=60000, callDuration=68411: row
>> > >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
>> > >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246
>> 132044,
>> > >>> seqNum=0
>> > >>> at
>> > >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
>> > >>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas
>> .java:276)
>> > >>> at
>> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> > >>> ScannerCallableWithReplicas.java:210)
>> > >>> at
>> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> > >>> ScannerCallableWithReplicas.java:60)
>> > >>> at
>> > >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithout
>> Retries(
>> > >>> RpcRetryingCaller.java:210)
>> > >>>
>> > >>>
>> > >>>
>> > >>> Dr Mich Talebzadeh
>> > >>>
>> > >>>
>> > >>>
>> > >>> LinkedIn * https://www.linkedin.com/profile/view?id=
>> > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> > >>> <https://www.linkedin.com/profile/view?id=
>> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>> > >>> OABUrV8Pw>*
>> > >>>
>> > >>>
>> > >>>
>> > >>> http://talebzadehmich.wordpress.com
>> > >>>
>> > >>>
>> > >>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for
>> > > any
>> > >>> loss, damage or destruction of data or any other property which may
>> > > arise
>> > >>> from relying on this email's technical content is explicitly
>> > disclaimed.
>> > >>> The author will in no case be liable for any monetary damages
>> arising
>> > >> from
>> > >>> such loss, damage or destruction.
>> > >>>
>> > >>>
>> > >>>
>> > >>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> > >>>
>> > >>>> I will check that, but if that is a server startup thing I was not
>> > > aware
>> > >>> I
>> > >>>> had to send it to the executors. So it’s like a connection timeout
>> > from
>> > >>>> executor code?
>> > >>>>
>> > >>>>
>> > >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >>>>
>> > >>>> How did you change the timeout(s) ?
>> > >>>>
>> > >>>> bq. timeout is currently set to 60000
>> > >>>>
>> > >>>> Did you pass hbase-site.xml using --files to Spark job ?
>> > >>>>
>> > >>>> Cheers
>> > >>>>
>> > >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pat@occamsmachete.com
>> >
>> > >>> wrote:
>> > >>>>
>> > >>>>> Using standalone Spark. I don’t recall seeing connection lost
>> errors,
>> > >>> but
>> > >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
>> > large
>> > >>>>> numbers on the servers.
>> > >>>>>
>> > >>>>> But I also saw in the logs:
>> > >>>>>
>> > >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
>> > >>>>> passed since the last invocation, timeout is currently set to
>> 60000
>> > >>>>>
>> > >>>>> Not sure where that is coming from. Does the driver machine making
>> > >>>> queries
>> > >>>>> need to have the timeout config also?
>> > >>>>>
>> > >>>>> And why so large, am I doing something wrong?
>> > >>>>>
>> > >>>>>
>> > >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >>>>>
>> > >>>>> Mich:
>> > >>>>> The OutOfOrderScannerNextException indicated problem with read
>> from
>> > >>>> hbase.
>> > >>>>>
>> > >>>>> How did you know connection to Spark cluster was lost ?
>> > >>>>>
>> > >>>>> Cheers
>> > >>>>>
>> > >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
>> > >>>>> mich.talebzadeh@gmail.com>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Looks like it lost the connection to Spark cluster.
>> > >>>>>>
>> > >>>>>> What mode you are using with Spark, Standalone, Yarn or others.
>> The
>> > >>>> issue
>> > >>>>>> looks like a resource manager issue.
>> > >>>>>>
>> > >>>>>> I have seen this when running Zeppelin with Spark on Hbase.
>> > >>>>>>
>> > >>>>>> HTH
>> > >>>>>>
>> > >>>>>> Dr Mich Talebzadeh
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>> > >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> > >>>>>> <https://www.linkedin.com/profile/view?id=
>> > >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>> > >>>>>> OABUrV8Pw>*
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> http://talebzadehmich.wordpress.com
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> > for
>> > >>>> any
>> > >>>>>> loss, damage or destruction of data or any other property which
>> may
>> > >>>> arise
>> > >>>>>> from relying on this email's technical content is explicitly
>> > >>> disclaimed.
>> > >>>>>> The author will in no case be liable for any monetary damages
>> > arising
>> > >>>>> from
>> > >>>>>> such loss, damage or destruction.
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
>> > >>> wrote:
>> > >>>>>>
>> > >>>>>>> I’m getting data from HBase using a large Spark cluster with
>> > >>>> parallelism
>> > >>>>>>> of near 400. The query fails quire often with the message below.
>> > >>>>>> Sometimes
>> > >>>>>>> a retry will work and sometimes the ultimate failure results
>> > > (below).
>> > >>>>>>>
>> > >>>>>>> If I reduce parallelism in Spark it slows other parts of the
>> > >>> algorithm
>> > >>>>>>> unacceptably. I have also experimented with very large
>> RPC/Scanner
>> > >>>>>> timeouts
>> > >>>>>>> of many minutes—to no avail.
>> > >>>>>>>
>> > >>>>>>> Any clues about what to look for or what may be setup wrong in
>> my
>> > >>>>> tables?
>> > >>>>>>>
>> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
>> 4
>> > >>>> times,
>> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> > >>> org.apache.hadoop.hbase.
>> > >>>>>> DoNotRetryIOException:
>> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was
>> there a
>> > >>> rpc
>> > >>>>>>> timeout?+details
>> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
>> 4
>> > >>>> times,
>> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> > >>> org.apache.hadoop.hbase.
>> > >>>>>> DoNotRetryIOException:
>> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was
>> there a
>> > >>> rpc
>> > >>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>> > >>>>>> ClientScanner.java:403)
>> > >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
>> > >>>>> nextKeyValue(
>> > >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>> > >>>>>>> mapreduce.TableRecordReader.nextKeyValue(
>> > > TableRecordReader.java:138)
>> > >>>> at
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>
>> > >>
>> > >
>> > >
>> >
>> >
>>
>
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

Gentle reminder :)

[image: Inline images 1]

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 23:05, Ted Yu <yu...@gmail.com> wrote:

> You should have written to the mailing list earlier :-)
>
> hbase community is very responsive.
>
> On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > After passing in hbase-site.xml with the increased timeout it completes
> > pretty fast with no errors.
> >
> > Thanks Ted, we’ve been going crazy trying to figure what was going on. We
> > moved from having Hbase installed on the Spark driver machine (though not
> > used) to containerized installation, where the config was left default on
> > the driver and only existed in the containers. We were passing in the
> empty
> > config to the spark-submit but it didn’t match the containers and fixing
> > that has made the system much happier.
> >
> > Anyway good call, we will be more aware of this with other services now.
> > Thanks for ending our weeks long fight!  :-)
> >
> >
> > On Oct 28, 2016, at 11:29 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > bq. with 400 threads hitting HBase at the same time
> >
> > How many regions are serving the 400 threads ?
> > How many region servers do you have ?
> >
> > If the regions are spread relatively evenly across the cluster, the above
> > may not be big issue.
> >
> > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> > > Ok, will do.
> > >
> > > So the scanner does not indicate of itself that I’ve missed something
> in
> > > handling the data. If not index, then made a fast lookup “key”? I ask
> > > because the timeout change may work but not be the optimal solution.
> The
> > > stage that fails is very long compared to other stages. And with 400
> > > threads hitting HBase at the same time, this seems like something I may
> > > need to restructure and any advice about that would be welcome.
> > >
> > > HBase is 1.2.3
> > >
> > >
> > > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > For your first question, you need to pass hbase-site.xml which has
> config
> > > parameters affecting client operations to Spark  executors.
> > >
> > > bq. missed indexing some column
> > >
> > > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> > >
> > > Let's see what happens after hbase-site.xml is passed to executors.
> > >
> > > BTW Can you tell us the release of hbase you're using ?
> > >
> > >
> > >
> > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > >
> > >> So to clarify there are some values in hbase/conf/hbase-site.xml that
> > are
> > >> needed by the calling code in the Spark driver and executors and so
> must
> > > be
> > >> passed using --files to spark-submit? If so I can do this.
> > >>
> > >> But do I have a deeper issue? Is it typical to need a scan like this?
> > > Have
> > >> I missed indexing some column maybe?
> > >>
> > >>
> > >> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>
> > >> Mich:
> > >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> > >>
> > >> What you observed was different issue.
> > >> The above looks like trouble with locating region(s) during scan.
> > >>
> > >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > >> mich.talebzadeh@gmail.com>
> > >> wrote:
> > >>
> > >>> This is an example I got
> > >>>
> > >>> warning: there were two deprecation warnings; re-run with
> -deprecation
> > >> for
> > >>> details
> > >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> > MapPartitionsRDD[77]
> > >> at
> > >>> map at <console>:151
> > >>> defined class columns
> > >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> > TICKER:
> > >>> string]
> > >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> after
> > >>> attempts=36, exceptions:
> > >>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.
> SocketTimeoutException:
> > >>> callTimeout=60000, callDuration=68411: row
> > >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,
> 1477246132044,
> > >>> seqNum=0
> > >>> at
> > >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > >>> cas.throwEnrichedException(RpcRetryingCallerWithReadRepli
> cas.java:276)
> > >>> at
> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > >>> ScannerCallableWithReplicas.java:210)
> > >>> at
> > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > >>> ScannerCallableWithReplicas.java:60)
> > >>> at
> > >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> > >>> RpcRetryingCaller.java:210)
> > >>>
> > >>>
> > >>>
> > >>> Dr Mich Talebzadeh
> > >>>
> > >>>
> > >>>
> > >>> LinkedIn * https://www.linkedin.com/profile/view?id=
> > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >>> <https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > >>> OABUrV8Pw>*
> > >>>
> > >>>
> > >>>
> > >>> http://talebzadehmich.wordpress.com
> > >>>
> > >>>
> > >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > > any
> > >>> loss, damage or destruction of data or any other property which may
> > > arise
> > >>> from relying on this email's technical content is explicitly
> > disclaimed.
> > >>> The author will in no case be liable for any monetary damages arising
> > >> from
> > >>> such loss, damage or destruction.
> > >>>
> > >>>
> > >>>
> > >>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > >>>
> > >>>> I will check that, but if that is a server startup thing I was not
> > > aware
> > >>> I
> > >>>> had to send it to the executors. So it’s like a connection timeout
> > from
> > >>>> executor code?
> > >>>>
> > >>>>
> > >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>>>
> > >>>> How did you change the timeout(s) ?
> > >>>>
> > >>>> bq. timeout is currently set to 60000
> > >>>>
> > >>>> Did you pass hbase-site.xml using --files to Spark job ?
> > >>>>
> > >>>> Cheers
> > >>>>
> > >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> > >>> wrote:
> > >>>>
> > >>>>> Using standalone Spark. I don’t recall seeing connection lost
> errors,
> > >>> but
> > >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> > large
> > >>>>> numbers on the servers.
> > >>>>>
> > >>>>> But I also saw in the logs:
> > >>>>>
> > >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > >>>>> passed since the last invocation, timeout is currently set to 60000
> > >>>>>
> > >>>>> Not sure where that is coming from. Does the driver machine making
> > >>>> queries
> > >>>>> need to have the timeout config also?
> > >>>>>
> > >>>>> And why so large, am I doing something wrong?
> > >>>>>
> > >>>>>
> > >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>>>>
> > >>>>> Mich:
> > >>>>> The OutOfOrderScannerNextException indicated problem with read from
> > >>>> hbase.
> > >>>>>
> > >>>>> How did you know connection to Spark cluster was lost ?
> > >>>>>
> > >>>>> Cheers
> > >>>>>
> > >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > >>>>> mich.talebzadeh@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Looks like it lost the connection to Spark cluster.
> > >>>>>>
> > >>>>>> What mode you are using with Spark, Standalone, Yarn or others.
> The
> > >>>> issue
> > >>>>>> looks like a resource manager issue.
> > >>>>>>
> > >>>>>> I have seen this when running Zeppelin with Spark on Hbase.
> > >>>>>>
> > >>>>>> HTH
> > >>>>>>
> > >>>>>> Dr Mich Talebzadeh
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> > >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >>>>>> <https://www.linkedin.com/profile/view?id=
> > >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > >>>>>> OABUrV8Pw>*
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> http://talebzadehmich.wordpress.com
> > >>>>>>
> > >>>>>>
> > >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> > for
> > >>>> any
> > >>>>>> loss, damage or destruction of data or any other property which
> may
> > >>>> arise
> > >>>>>> from relying on this email's technical content is explicitly
> > >>> disclaimed.
> > >>>>>> The author will in no case be liable for any monetary damages
> > arising
> > >>>>> from
> > >>>>>> such loss, damage or destruction.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> > >>> wrote:
> > >>>>>>
> > >>>>>>> I’m getting data from HBase using a large Spark cluster with
> > >>>> parallelism
> > >>>>>>> of near 400. The query fails quire often with the message below.
> > >>>>>> Sometimes
> > >>>>>>> a retry will work and sometimes the ultimate failure results
> > > (below).
> > >>>>>>>
> > >>>>>>> If I reduce parallelism in Spark it slows other parts of the
> > >>> algorithm
> > >>>>>>> unacceptably. I have also experimented with very large
> RPC/Scanner
> > >>>>>> timeouts
> > >>>>>>> of many minutes—to no avail.
> > >>>>>>>
> > >>>>>>> Any clues about what to look for or what may be setup wrong in my
> > >>>>> tables?
> > >>>>>>>
> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > >>>> times,
> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > >>> org.apache.hadoop.hbase.
> > >>>>>> DoNotRetryIOException:
> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there
> a
> > >>> rpc
> > >>>>>>> timeout?+details
> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > >>>> times,
> > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > >>> org.apache.hadoop.hbase.
> > >>>>>> DoNotRetryIOException:
> > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there
> a
> > >>> rpc
> > >>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > >>>>>> ClientScanner.java:403)
> > >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > >>>>> nextKeyValue(
> > >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > >>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > > TableRecordReader.java:138)
> > >>>> at
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

You should have written to the mailing list earlier :-)

hbase community is very responsive.

On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> After passing in hbase-site.xml with the increased timeout it completes
> pretty fast with no errors.
>
> Thanks Ted, we’ve been going crazy trying to figure what was going on. We
> moved from having Hbase installed on the Spark driver machine (though not
> used) to containerized installation, where the config was left default on
> the driver and only existed in the containers. We were passing in the empty
> config to the spark-submit but it didn’t match the containers and fixing
> that has made the system much happier.
>
> Anyway good call, we will be more aware of this with other services now.
> Thanks for ending our weeks long fight!  :-)
>
>
> On Oct 28, 2016, at 11:29 AM, Ted Yu <yu...@gmail.com> wrote:
>
> bq. with 400 threads hitting HBase at the same time
>
> How many regions are serving the 400 threads ?
> How many region servers do you have ?
>
> If the regions are spread relatively evenly across the cluster, the above
> may not be big issue.
>
> On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > Ok, will do.
> >
> > So the scanner does not indicate of itself that I’ve missed something in
> > handling the data. If not index, then made a fast lookup “key”? I ask
> > because the timeout change may work but not be the optimal solution. The
> > stage that fails is very long compared to other stages. And with 400
> > threads hitting HBase at the same time, this seems like something I may
> > need to restructure and any advice about that would be welcome.
> >
> > HBase is 1.2.3
> >
> >
> > On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > For your first question, you need to pass hbase-site.xml which has config
> > parameters affecting client operations to Spark  executors.
> >
> > bq. missed indexing some column
> >
> > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> >
> > Let's see what happens after hbase-site.xml is passed to executors.
> >
> > BTW Can you tell us the release of hbase you're using ?
> >
> >
> >
> > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> >> So to clarify there are some values in hbase/conf/hbase-site.xml that
> are
> >> needed by the calling code in the Spark driver and executors and so must
> > be
> >> passed using --files to spark-submit? If so I can do this.
> >>
> >> But do I have a deeper issue? Is it typical to need a scan like this?
> > Have
> >> I missed indexing some column maybe?
> >>
> >>
> >> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> Mich:
> >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> >>
> >> What you observed was different issue.
> >> The above looks like trouble with locating region(s) during scan.
> >>
> >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> >> mich.talebzadeh@gmail.com>
> >> wrote:
> >>
> >>> This is an example I got
> >>>
> >>> warning: there were two deprecation warnings; re-run with -deprecation
> >> for
> >>> details
> >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> MapPartitionsRDD[77]
> >> at
> >>> map at <console>:151
> >>> defined class columns
> >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> TICKER:
> >>> string]
> >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> >>> attempts=36, exceptions:
> >>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> >>> callTimeout=60000, callDuration=68411: row
> >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> >>> seqNum=0
> >>> at
> >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> >>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >>> at
> >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >>> ScannerCallableWithReplicas.java:210)
> >>> at
> >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >>> ScannerCallableWithReplicas.java:60)
> >>> at
> >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> >>> RpcRetryingCaller.java:210)
> >>>
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>> OABUrV8Pw>*
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> >>> loss, damage or destruction of data or any other property which may
> > arise
> >>> from relying on this email's technical content is explicitly
> disclaimed.
> >>> The author will in no case be liable for any monetary damages arising
> >> from
> >>> such loss, damage or destruction.
> >>>
> >>>
> >>>
> >>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>
> >>>> I will check that, but if that is a server startup thing I was not
> > aware
> >>> I
> >>>> had to send it to the executors. So it’s like a connection timeout
> from
> >>>> executor code?
> >>>>
> >>>>
> >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>> How did you change the timeout(s) ?
> >>>>
> >>>> bq. timeout is currently set to 60000
> >>>>
> >>>> Did you pass hbase-site.xml using --files to Spark job ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>
> >>>>> Using standalone Spark. I don’t recall seeing connection lost errors,
> >>> but
> >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> large
> >>>>> numbers on the servers.
> >>>>>
> >>>>> But I also saw in the logs:
> >>>>>
> >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> >>>>> passed since the last invocation, timeout is currently set to 60000
> >>>>>
> >>>>> Not sure where that is coming from. Does the driver machine making
> >>>> queries
> >>>>> need to have the timeout config also?
> >>>>>
> >>>>> And why so large, am I doing something wrong?
> >>>>>
> >>>>>
> >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>
> >>>>> Mich:
> >>>>> The OutOfOrderScannerNextException indicated problem with read from
> >>>> hbase.
> >>>>>
> >>>>> How did you know connection to Spark cluster was lost ?
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >>>>> mich.talebzadeh@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Looks like it lost the connection to Spark cluster.
> >>>>>>
> >>>>>> What mode you are using with Spark, Standalone, Yarn or others. The
> >>>> issue
> >>>>>> looks like a resource manager issue.
> >>>>>>
> >>>>>> I have seen this when running Zeppelin with Spark on Hbase.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>> Dr Mich Talebzadeh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>>> <https://www.linkedin.com/profile/view?id=
> >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>>>>> OABUrV8Pw>*
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://talebzadehmich.wordpress.com
> >>>>>>
> >>>>>>
> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> >>>> any
> >>>>>> loss, damage or destruction of data or any other property which may
> >>>> arise
> >>>>>> from relying on this email's technical content is explicitly
> >>> disclaimed.
> >>>>>> The author will in no case be liable for any monetary damages
> arising
> >>>>> from
> >>>>>> such loss, damage or destruction.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>>>
> >>>>>>> I’m getting data from HBase using a large Spark cluster with
> >>>> parallelism
> >>>>>>> of near 400. The query fails quire often with the message below.
> >>>>>> Sometimes
> >>>>>>> a retry will work and sometimes the ultimate failure results
> > (below).
> >>>>>>>
> >>>>>>> If I reduce parallelism in Spark it slows other parts of the
> >>> algorithm
> >>>>>>> unacceptably. I have also experimented with very large RPC/Scanner
> >>>>>> timeouts
> >>>>>>> of many minutes—to no avail.
> >>>>>>>
> >>>>>>> Any clues about what to look for or what may be setup wrong in my
> >>>>> tables?
> >>>>>>>
> >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>>> times,
> >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >>> org.apache.hadoop.hbase.
> >>>>>> DoNotRetryIOException:
> >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >>> rpc
> >>>>>>> timeout?+details
> >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>>> times,
> >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >>> org.apache.hadoop.hbase.
> >>>>>> DoNotRetryIOException:
> >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >>> rpc
> >>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >>>>>> ClientScanner.java:403)
> >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >>>>> nextKeyValue(
> >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > TableRecordReader.java:138)
> >>>> at
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

After passing in hbase-site.xml with the increased timeout it completes pretty fast with no errors.

Thanks Ted, we’ve been going crazy trying to figure what was going on. We moved from having Hbase installed on the Spark driver machine (though not used) to containerized installation, where the config was left default on the driver and only existed in the containers. We were passing in the empty config to the spark-submit but it didn’t match the containers and fixing that has made the system much happier. 

Anyway good call, we will be more aware of this with other services now. Thanks for ending our weeks long fight!  :-)


On Oct 28, 2016, at 11:29 AM, Ted Yu <yu...@gmail.com> wrote:

bq. with 400 threads hitting HBase at the same time

How many regions are serving the 400 threads ?
How many region servers do you have ?

If the regions are spread relatively evenly across the cluster, the above
may not be big issue.

On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, will do.
> 
> So the scanner does not indicate of itself that I’ve missed something in
> handling the data. If not index, then made a fast lookup “key”? I ask
> because the timeout change may work but not be the optimal solution. The
> stage that fails is very long compared to other stages. And with 400
> threads hitting HBase at the same time, this seems like something I may
> need to restructure and any advice about that would be welcome.
> 
> HBase is 1.2.3
> 
> 
> On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> For your first question, you need to pass hbase-site.xml which has config
> parameters affecting client operations to Spark  executors.
> 
> bq. missed indexing some column
> 
> hbase doesn't have indexing (as in the sense of traditional RDBMS).
> 
> Let's see what happens after hbase-site.xml is passed to executors.
> 
> BTW Can you tell us the release of hbase you're using ?
> 
> 
> 
> On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> So to clarify there are some values in hbase/conf/hbase-site.xml that are
>> needed by the calling code in the Spark driver and executors and so must
> be
>> passed using --files to spark-submit? If so I can do this.
>> 
>> But do I have a deeper issue? Is it typical to need a scan like this?
> Have
>> I missed indexing some column maybe?
>> 
>> 
>> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
>> 
>> Mich:
>> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
>> 
>> What you observed was different issue.
>> The above looks like trouble with locating region(s) during scan.
>> 
>> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>
>> wrote:
>> 
>>> This is an example I got
>>> 
>>> warning: there were two deprecation warnings; re-run with -deprecation
>> for
>>> details
>>> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
>> at
>>> map at <console>:151
>>> defined class columns
>>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
>>> string]
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>>> attempts=36, exceptions:
>>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
>>> callTimeout=60000, callDuration=68411: row
>>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
>>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
>>> seqNum=0
>>> at
>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
>>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>>> at
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>>> ScannerCallableWithReplicas.java:210)
>>> at
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>>> ScannerCallableWithReplicas.java:60)
>>> at
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
>>> RpcRetryingCaller.java:210)
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>> 
>>> 
>>> 
>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>>> OABUrV8Pw>*
>>> 
>>> 
>>> 
>>> http://talebzadehmich.wordpress.com
>>> 
>>> 
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
>>> loss, damage or destruction of data or any other property which may
> arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising
>> from
>>> such loss, damage or destruction.
>>> 
>>> 
>>> 
>>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>>> I will check that, but if that is a server startup thing I was not
> aware
>>> I
>>>> had to send it to the executors. So it’s like a connection timeout from
>>>> executor code?
>>>> 
>>>> 
>>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>> How did you change the timeout(s) ?
>>>> 
>>>> bq. timeout is currently set to 60000
>>>> 
>>>> Did you pass hbase-site.xml using --files to Spark job ?
>>>> 
>>>> Cheers
>>>> 
>>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Using standalone Spark. I don’t recall seeing connection lost errors,
>>> but
>>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
>>>>> numbers on the servers.
>>>>> 
>>>>> But I also saw in the logs:
>>>>> 
>>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
>>>>> passed since the last invocation, timeout is currently set to 60000
>>>>> 
>>>>> Not sure where that is coming from. Does the driver machine making
>>>> queries
>>>>> need to have the timeout config also?
>>>>> 
>>>>> And why so large, am I doing something wrong?
>>>>> 
>>>>> 
>>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>> 
>>>>> Mich:
>>>>> The OutOfOrderScannerNextException indicated problem with read from
>>>> hbase.
>>>>> 
>>>>> How did you know connection to Spark cluster was lost ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Looks like it lost the connection to Spark cluster.
>>>>>> 
>>>>>> What mode you are using with Spark, Standalone, Yarn or others. The
>>>> issue
>>>>>> looks like a resource manager issue.
>>>>>> 
>>>>>> I have seen this when running Zeppelin with Spark on Hbase.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=
>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>>>>>> OABUrV8Pw>*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> http://talebzadehmich.wordpress.com
>>>>>> 
>>>>>> 
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any
>>>>>> loss, damage or destruction of data or any other property which may
>>>> arise
>>>>>> from relying on this email's technical content is explicitly
>>> disclaimed.
>>>>>> The author will in no case be liable for any monetary damages arising
>>>>> from
>>>>>> such loss, damage or destruction.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>>> 
>>>>>>> I’m getting data from HBase using a large Spark cluster with
>>>> parallelism
>>>>>>> of near 400. The query fails quire often with the message below.
>>>>>> Sometimes
>>>>>>> a retry will work and sometimes the ultimate failure results
> (below).
>>>>>>> 
>>>>>>> If I reduce parallelism in Spark it slows other parts of the
>>> algorithm
>>>>>>> unacceptably. I have also experimented with very large RPC/Scanner
>>>>>> timeouts
>>>>>>> of many minutes—to no avail.
>>>>>>> 
>>>>>>> Any clues about what to look for or what may be setup wrong in my
>>>>> tables?
>>>>>>> 
>>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>>> times,
>>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>>> org.apache.hadoop.hbase.
>>>>>> DoNotRetryIOException:
>>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>>> rpc
>>>>>>> timeout?+details
>>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>>> times,
>>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>>> org.apache.hadoop.hbase.
>>>>>> DoNotRetryIOException:
>>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>>> rpc
>>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>>>>>> ClientScanner.java:403)
>>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
>>>>> nextKeyValue(
>>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> TableRecordReader.java:138)
>>>> at
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

bq. with 400 threads hitting HBase at the same time

How many regions are serving the 400 threads ?
How many region servers do you have ?

If the regions are spread relatively evenly across the cluster, the above
may not be big issue.

On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, will do.
>
> So the scanner does not indicate of itself that I’ve missed something in
> handling the data. If not index, then made a fast lookup “key”? I ask
> because the timeout change may work but not be the optimal solution. The
> stage that fails is very long compared to other stages. And with 400
> threads hitting HBase at the same time, this seems like something I may
> need to restructure and any advice about that would be welcome.
>
> HBase is 1.2.3
>
>
> On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:
>
> For your first question, you need to pass hbase-site.xml which has config
> parameters affecting client operations to Spark  executors.
>
> bq. missed indexing some column
>
> hbase doesn't have indexing (as in the sense of traditional RDBMS).
>
> Let's see what happens after hbase-site.xml is passed to executors.
>
> BTW Can you tell us the release of hbase you're using ?
>
>
>
> On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > So to clarify there are some values in hbase/conf/hbase-site.xml that are
> > needed by the calling code in the Spark driver and executors and so must
> be
> > passed using --files to spark-submit? If so I can do this.
> >
> > But do I have a deeper issue? Is it typical to need a scan like this?
> Have
> > I missed indexing some column maybe?
> >
> >
> > On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > Mich:
> > bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> >
> > What you observed was different issue.
> > The above looks like trouble with locating region(s) during scan.
> >
> > On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > mich.talebzadeh@gmail.com>
> > wrote:
> >
> >> This is an example I got
> >>
> >> warning: there were two deprecation warnings; re-run with -deprecation
> > for
> >> details
> >> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> > at
> >> map at <console>:151
> >> defined class columns
> >> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> >> string]
> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> >> attempts=36, exceptions:
> >> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> >> callTimeout=60000, callDuration=68411: row
> >> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> >> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> >> seqNum=0
> >> at
> >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> >> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >> at
> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> ScannerCallableWithReplicas.java:210)
> >> at
> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> ScannerCallableWithReplicas.java:60)
> >> at
> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> >> RpcRetryingCaller.java:210)
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >> OABUrV8Pw>*
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly disclaimed.
> >> The author will in no case be liable for any monetary damages arising
> > from
> >> such loss, damage or destruction.
> >>
> >>
> >>
> >> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>
> >>> I will check that, but if that is a server startup thing I was not
> aware
> >> I
> >>> had to send it to the executors. So it’s like a connection timeout from
> >>> executor code?
> >>>
> >>>
> >>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>> How did you change the timeout(s) ?
> >>>
> >>> bq. timeout is currently set to 60000
> >>>
> >>> Did you pass hbase-site.xml using --files to Spark job ?
> >>>
> >>> Cheers
> >>>
> >>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>
> >>>> Using standalone Spark. I don’t recall seeing connection lost errors,
> >> but
> >>>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
> >>>> numbers on the servers.
> >>>>
> >>>> But I also saw in the logs:
> >>>>
> >>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> >>>> passed since the last invocation, timeout is currently set to 60000
> >>>>
> >>>> Not sure where that is coming from. Does the driver machine making
> >>> queries
> >>>> need to have the timeout config also?
> >>>>
> >>>> And why so large, am I doing something wrong?
> >>>>
> >>>>
> >>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>> Mich:
> >>>> The OutOfOrderScannerNextException indicated problem with read from
> >>> hbase.
> >>>>
> >>>> How did you know connection to Spark cluster was lost ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >>>> mich.talebzadeh@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Looks like it lost the connection to Spark cluster.
> >>>>>
> >>>>> What mode you are using with Spark, Standalone, Yarn or others. The
> >>> issue
> >>>>> looks like a resource manager issue.
> >>>>>
> >>>>> I have seen this when running Zeppelin with Spark on Hbase.
> >>>>>
> >>>>> HTH
> >>>>>
> >>>>> Dr Mich Talebzadeh
> >>>>>
> >>>>>
> >>>>>
> >>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>> <https://www.linkedin.com/profile/view?id=
> >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>>>> OABUrV8Pw>*
> >>>>>
> >>>>>
> >>>>>
> >>>>> http://talebzadehmich.wordpress.com
> >>>>>
> >>>>>
> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> >>> any
> >>>>> loss, damage or destruction of data or any other property which may
> >>> arise
> >>>>> from relying on this email's technical content is explicitly
> >> disclaimed.
> >>>>> The author will in no case be liable for any monetary damages arising
> >>>> from
> >>>>> such loss, damage or destruction.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>>>
> >>>>>> I’m getting data from HBase using a large Spark cluster with
> >>> parallelism
> >>>>>> of near 400. The query fails quire often with the message below.
> >>>>> Sometimes
> >>>>>> a retry will work and sometimes the ultimate failure results
> (below).
> >>>>>>
> >>>>>> If I reduce parallelism in Spark it slows other parts of the
> >> algorithm
> >>>>>> unacceptably. I have also experimented with very large RPC/Scanner
> >>>>> timeouts
> >>>>>> of many minutes—to no avail.
> >>>>>>
> >>>>>> Any clues about what to look for or what may be setup wrong in my
> >>>> tables?
> >>>>>>
> >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>> times,
> >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> org.apache.hadoop.hbase.
> >>>>> DoNotRetryIOException:
> >>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >> rpc
> >>>>>> timeout?+details
> >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>> times,
> >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> org.apache.hadoop.hbase.
> >>>>> DoNotRetryIOException:
> >>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >> rpc
> >>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >>>>> ClientScanner.java:403)
> >>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >>>> nextKeyValue(
> >>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>>>>> mapreduce.TableRecordReader.nextKeyValue(
> TableRecordReader.java:138)
> >>> at
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, will do.

So the scanner does not indicate of itself that I’ve missed something in handling the data. If not index, then made a fast lookup “key”? I ask because the timeout change may work but not be the optimal solution. The stage that fails is very long compared to other stages. And with 400 threads hitting HBase at the same time, this seems like something I may need to restructure and any advice about that would be welcome.

HBase is 1.2.3


On Oct 28, 2016, at 10:36 AM, Ted Yu <yu...@gmail.com> wrote:

For your first question, you need to pass hbase-site.xml which has config
parameters affecting client operations to Spark  executors.

bq. missed indexing some column

hbase doesn't have indexing (as in the sense of traditional RDBMS).

Let's see what happens after hbase-site.xml is passed to executors.

BTW Can you tell us the release of hbase you're using ?



On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> So to clarify there are some values in hbase/conf/hbase-site.xml that are
> needed by the calling code in the Spark driver and executors and so must be
> passed using --files to spark-submit? If so I can do this.
> 
> But do I have a deeper issue? Is it typical to need a scan like this? Have
> I missed indexing some column maybe?
> 
> 
> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> Mich:
> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> 
> What you observed was different issue.
> The above looks like trouble with locating region(s) during scan.
> 
> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
> 
>> This is an example I got
>> 
>> warning: there were two deprecation warnings; re-run with -deprecation
> for
>> details
>> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> at
>> map at <console>:151
>> defined class columns
>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
>> string]
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>> attempts=36, exceptions:
>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
>> callTimeout=60000, callDuration=68411: row
>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
>> seqNum=0
>> at
>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>> at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> ScannerCallableWithReplicas.java:210)
>> at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> ScannerCallableWithReplicas.java:60)
>> at
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
>> RpcRetryingCaller.java:210)
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn * https://www.linkedin.com/profile/view?id=
>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>> OABUrV8Pw>*
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising
> from
>> such loss, damage or destruction.
>> 
>> 
>> 
>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> I will check that, but if that is a server startup thing I was not aware
>> I
>>> had to send it to the executors. So it’s like a connection timeout from
>>> executor code?
>>> 
>>> 
>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>> How did you change the timeout(s) ?
>>> 
>>> bq. timeout is currently set to 60000
>>> 
>>> Did you pass hbase-site.xml using --files to Spark job ?
>>> 
>>> Cheers
>>> 
>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Using standalone Spark. I don’t recall seeing connection lost errors,
>> but
>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
>>>> numbers on the servers.
>>>> 
>>>> But I also saw in the logs:
>>>> 
>>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
>>>> passed since the last invocation, timeout is currently set to 60000
>>>> 
>>>> Not sure where that is coming from. Does the driver machine making
>>> queries
>>>> need to have the timeout config also?
>>>> 
>>>> And why so large, am I doing something wrong?
>>>> 
>>>> 
>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>> Mich:
>>>> The OutOfOrderScannerNextException indicated problem with read from
>>> hbase.
>>>> 
>>>> How did you know connection to Spark cluster was lost ?
>>>> 
>>>> Cheers
>>>> 
>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com>
>>>> wrote:
>>>> 
>>>>> Looks like it lost the connection to Spark cluster.
>>>>> 
>>>>> What mode you are using with Spark, Standalone, Yarn or others. The
>>> issue
>>>>> looks like a resource manager issue.
>>>>> 
>>>>> I have seen this when running Zeppelin with Spark on Hbase.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>> 
>>>>> 
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=
>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>>>>> OABUrV8Pw>*
>>>>> 
>>>>> 
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> 
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any
>>>>> loss, damage or destruction of data or any other property which may
>>> arise
>>>>> from relying on this email's technical content is explicitly
>> disclaimed.
>>>>> The author will in no case be liable for any monetary damages arising
>>>> from
>>>>> such loss, damage or destruction.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> 
>>>>>> I’m getting data from HBase using a large Spark cluster with
>>> parallelism
>>>>>> of near 400. The query fails quire often with the message below.
>>>>> Sometimes
>>>>>> a retry will work and sometimes the ultimate failure results (below).
>>>>>> 
>>>>>> If I reduce parallelism in Spark it slows other parts of the
>> algorithm
>>>>>> unacceptably. I have also experimented with very large RPC/Scanner
>>>>> timeouts
>>>>>> of many minutes—to no avail.
>>>>>> 
>>>>>> Any clues about what to look for or what may be setup wrong in my
>>>> tables?
>>>>>> 
>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>> times,
>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> org.apache.hadoop.hbase.
>>>>> DoNotRetryIOException:
>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>> rpc
>>>>>> timeout?+details
>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>> times,
>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> org.apache.hadoop.hbase.
>>>>> DoNotRetryIOException:
>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>> rpc
>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>>>>> ClientScanner.java:403)
>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
>>>> nextKeyValue(
>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>>>>>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
>>> at
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

For your first question, you need to pass hbase-site.xml which has config
parameters affecting client operations to Spark  executors.

bq. missed indexing some column

hbase doesn't have indexing (as in the sense of traditional RDBMS).

Let's see what happens after hbase-site.xml is passed to executors.

BTW Can you tell us the release of hbase you're using ?



On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> So to clarify there are some values in hbase/conf/hbase-site.xml that are
> needed by the calling code in the Spark driver and executors and so must be
> passed using --files to spark-submit? If so I can do this.
>
> But do I have a deeper issue? Is it typical to need a scan like this? Have
> I missed indexing some column maybe?
>
>
> On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:
>
> Mich:
> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
>
> What you observed was different issue.
> The above looks like trouble with locating region(s) during scan.
>
> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > This is an example I got
> >
> > warning: there were two deprecation warnings; re-run with -deprecation
> for
> > details
> > rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> at
> > map at <console>:151
> > defined class columns
> > dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> > string]
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=36, exceptions:
> > *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> > callTimeout=60000, callDuration=68411: row
> > 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> > seqNum=0
> >  at
> > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >  at
> > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > ScannerCallableWithReplicas.java:210)
> >  at
> > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > ScannerCallableWithReplicas.java:60)
> >  at
> > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> > RpcRetryingCaller.java:210)
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> >> I will check that, but if that is a server startup thing I was not aware
> > I
> >> had to send it to the executors. So it’s like a connection timeout from
> >> executor code?
> >>
> >>
> >> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> How did you change the timeout(s) ?
> >>
> >> bq. timeout is currently set to 60000
> >>
> >> Did you pass hbase-site.xml using --files to Spark job ?
> >>
> >> Cheers
> >>
> >> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>
> >>> Using standalone Spark. I don’t recall seeing connection lost errors,
> > but
> >>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
> >>> numbers on the servers.
> >>>
> >>> But I also saw in the logs:
> >>>
> >>>   org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> >>> passed since the last invocation, timeout is currently set to 60000
> >>>
> >>> Not sure where that is coming from. Does the driver machine making
> >> queries
> >>> need to have the timeout config also?
> >>>
> >>> And why so large, am I doing something wrong?
> >>>
> >>>
> >>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>> Mich:
> >>> The OutOfOrderScannerNextException indicated problem with read from
> >> hbase.
> >>>
> >>> How did you know connection to Spark cluster was lost ?
> >>>
> >>> Cheers
> >>>
> >>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >>> mich.talebzadeh@gmail.com>
> >>> wrote:
> >>>
> >>>> Looks like it lost the connection to Spark cluster.
> >>>>
> >>>> What mode you are using with Spark, Standalone, Yarn or others. The
> >> issue
> >>>> looks like a resource manager issue.
> >>>>
> >>>> I have seen this when running Zeppelin with Spark on Hbase.
> >>>>
> >>>> HTH
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>> <https://www.linkedin.com/profile/view?id=
> >> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>>> OABUrV8Pw>*
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> >> any
> >>>> loss, damage or destruction of data or any other property which may
> >> arise
> >>>> from relying on this email's technical content is explicitly
> > disclaimed.
> >>>> The author will in no case be liable for any monetary damages arising
> >>> from
> >>>> such loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>>>
> >>>>> I’m getting data from HBase using a large Spark cluster with
> >> parallelism
> >>>>> of near 400. The query fails quire often with the message below.
> >>>> Sometimes
> >>>>> a retry will work and sometimes the ultimate failure results (below).
> >>>>>
> >>>>> If I reduce parallelism in Spark it slows other parts of the
> > algorithm
> >>>>> unacceptably. I have also experimented with very large RPC/Scanner
> >>>> timeouts
> >>>>> of many minutes—to no avail.
> >>>>>
> >>>>> Any clues about what to look for or what may be setup wrong in my
> >>> tables?
> >>>>>
> >>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >> times,
> >>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > org.apache.hadoop.hbase.
> >>>> DoNotRetryIOException:
> >>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> > rpc
> >>>>> timeout?+details
> >>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >> times,
> >>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> > org.apache.hadoop.hbase.
> >>>> DoNotRetryIOException:
> >>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> > rpc
> >>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >>>> ClientScanner.java:403)
> >>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >>> nextKeyValue(
> >>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>>>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> >> at
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
>
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

So to clarify there are some values in hbase/conf/hbase-site.xml that are needed by the calling code in the Spark driver and executors and so must be passed using --files to spark-submit? If so I can do this.

But do I have a deeper issue? Is it typical to need a scan like this? Have I missed indexing some column maybe?


On Oct 28, 2016, at 9:59 AM, Ted Yu <yu...@gmail.com> wrote:

Mich:
bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740

What you observed was different issue.
The above looks like trouble with locating region(s) during scan.

On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> This is an example I got
> 
> warning: there were two deprecation warnings; re-run with -deprecation for
> details
> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77] at
> map at <console>:151
> defined class columns
> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> string]
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=36, exceptions:
> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=68411: row
> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> seqNum=0
>  at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>  at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:210)
>  at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:60)
>  at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> RpcRetryingCaller.java:210)
> 
> 
> 
> Dr Mich Talebzadeh
> 
> 
> 
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
> 
> 
> 
> http://talebzadehmich.wordpress.com
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> I will check that, but if that is a server startup thing I was not aware
> I
>> had to send it to the executors. So it’s like a connection timeout from
>> executor code?
>> 
>> 
>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>> 
>> How did you change the timeout(s) ?
>> 
>> bq. timeout is currently set to 60000
>> 
>> Did you pass hbase-site.xml using --files to Spark job ?
>> 
>> Cheers
>> 
>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> Using standalone Spark. I don’t recall seeing connection lost errors,
> but
>>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
>>> numbers on the servers.
>>> 
>>> But I also saw in the logs:
>>> 
>>>   org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
>>> passed since the last invocation, timeout is currently set to 60000
>>> 
>>> Not sure where that is coming from. Does the driver machine making
>> queries
>>> need to have the timeout config also?
>>> 
>>> And why so large, am I doing something wrong?
>>> 
>>> 
>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>> Mich:
>>> The OutOfOrderScannerNextException indicated problem with read from
>> hbase.
>>> 
>>> How did you know connection to Spark cluster was lost ?
>>> 
>>> Cheers
>>> 
>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>
>>> wrote:
>>> 
>>>> Looks like it lost the connection to Spark cluster.
>>>> 
>>>> What mode you are using with Spark, Standalone, Yarn or others. The
>> issue
>>>> looks like a resource manager issue.
>>>> 
>>>> I have seen this when running Zeppelin with Spark on Hbase.
>>>> 
>>>> HTH
>>>> 
>>>> Dr Mich Talebzadeh
>>>> 
>>>> 
>>>> 
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=
>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>>>> OABUrV8Pw>*
>>>> 
>>>> 
>>>> 
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> 
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any
>>>> loss, damage or destruction of data or any other property which may
>> arise
>>>> from relying on this email's technical content is explicitly
> disclaimed.
>>>> The author will in no case be liable for any monetary damages arising
>>> from
>>>> such loss, damage or destruction.
>>>> 
>>>> 
>>>> 
>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>>> 
>>>>> I’m getting data from HBase using a large Spark cluster with
>> parallelism
>>>>> of near 400. The query fails quire often with the message below.
>>>> Sometimes
>>>>> a retry will work and sometimes the ultimate failure results (below).
>>>>> 
>>>>> If I reduce parallelism in Spark it slows other parts of the
> algorithm
>>>>> unacceptably. I have also experimented with very large RPC/Scanner
>>>> timeouts
>>>>> of many minutes—to no avail.
>>>>> 
>>>>> Any clues about what to look for or what may be setup wrong in my
>>> tables?
>>>>> 
>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>> times,
>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> org.apache.hadoop.hbase.
>>>> DoNotRetryIOException:
>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> rpc
>>>>> timeout?+details
>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>> times,
>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> org.apache.hadoop.hbase.
>>>> DoNotRetryIOException:
>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> rpc
>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>>>> ClientScanner.java:403)
>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
>>> nextKeyValue(
>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>>>>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
>> at
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Can you start a separate thread? 


On Oct 28, 2016, at 10:29 AM, Mich Talebzadeh <mi...@gmail.com> wrote:

sorry do you mean in my error case the issue was locating regions during
scan.

in that case I do not know why it works through spark shell but not
Zeppelin?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 17:59, Ted Yu <yu...@gmail.com> wrote:

> Mich:
> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> 
> What you observed was different issue.
> The above looks like trouble with locating region(s) during scan.
> 
> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
> 
>> This is an example I got
>> 
>> warning: there were two deprecation warnings; re-run with -deprecation
> for
>> details
>> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> at
>> map at <console>:151
>> defined class columns
>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
>> string]
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>> attempts=36, exceptions:
>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
>> callTimeout=60000, callDuration=68411: row
>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
>> seqNum=0
>>  at
>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>>  at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> ScannerCallableWithReplicas.java:210)
>>  at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
>> ScannerCallableWithReplicas.java:60)
>>  at
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
>> RpcRetryingCaller.java:210)
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn * https://www.linkedin.com/profile/view?id=
>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>> OABUrV8Pw>*
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising
> from
>> such loss, damage or destruction.
>> 
>> 
>> 
>> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> I will check that, but if that is a server startup thing I was not
> aware
>> I
>>> had to send it to the executors. So it’s like a connection timeout from
>>> executor code?
>>> 
>>> 
>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>> How did you change the timeout(s) ?
>>> 
>>> bq. timeout is currently set to 60000
>>> 
>>> Did you pass hbase-site.xml using --files to Spark job ?
>>> 
>>> Cheers
>>> 
>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Using standalone Spark. I don’t recall seeing connection lost errors,
>> but
>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> large
>>>> numbers on the servers.
>>>> 
>>>> But I also saw in the logs:
>>>> 
>>>>   org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
>>>> passed since the last invocation, timeout is currently set to 60000
>>>> 
>>>> Not sure where that is coming from. Does the driver machine making
>>> queries
>>>> need to have the timeout config also?
>>>> 
>>>> And why so large, am I doing something wrong?
>>>> 
>>>> 
>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>> Mich:
>>>> The OutOfOrderScannerNextException indicated problem with read from
>>> hbase.
>>>> 
>>>> How did you know connection to Spark cluster was lost ?
>>>> 
>>>> Cheers
>>>> 
>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com>
>>>> wrote:
>>>> 
>>>>> Looks like it lost the connection to Spark cluster.
>>>>> 
>>>>> What mode you are using with Spark, Standalone, Yarn or others. The
>>> issue
>>>>> looks like a resource manager issue.
>>>>> 
>>>>> I have seen this when running Zeppelin with Spark on Hbase.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>> 
>>>>> 
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=
>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>>>>> OABUrV8Pw>*
>>>>> 
>>>>> 
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> 
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
>>> any
>>>>> loss, damage or destruction of data or any other property which may
>>> arise
>>>>> from relying on this email's technical content is explicitly
>> disclaimed.
>>>>> The author will in no case be liable for any monetary damages
> arising
>>>> from
>>>>> such loss, damage or destruction.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> 
>>>>>> I’m getting data from HBase using a large Spark cluster with
>>> parallelism
>>>>>> of near 400. The query fails quire often with the message below.
>>>>> Sometimes
>>>>>> a retry will work and sometimes the ultimate failure results
> (below).
>>>>>> 
>>>>>> If I reduce parallelism in Spark it slows other parts of the
>> algorithm
>>>>>> unacceptably. I have also experimented with very large RPC/Scanner
>>>>> timeouts
>>>>>> of many minutes—to no avail.
>>>>>> 
>>>>>> Any clues about what to look for or what may be setup wrong in my
>>>> tables?
>>>>>> 
>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>> times,
>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> org.apache.hadoop.hbase.
>>>>> DoNotRetryIOException:
>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>> rpc
>>>>>> timeout?+details
>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
>>> times,
>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
>> org.apache.hadoop.hbase.
>>>>> DoNotRetryIOException:
>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
>> rpc
>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>>>>> ClientScanner.java:403)
>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
>>>> nextKeyValue(
>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> TableRecordReader.java:138)
>>> at
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

sorry do you mean in my error case the issue was locating regions during
scan.

in that case I do not know why it works through spark shell but not
Zeppelin?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 17:59, Ted Yu <yu...@gmail.com> wrote:

> Mich:
> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
>
> What you observed was different issue.
> The above looks like trouble with locating region(s) during scan.
>
> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > This is an example I got
> >
> > warning: there were two deprecation warnings; re-run with -deprecation
> for
> > details
> > rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> at
> > map at <console>:151
> > defined class columns
> > dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> > string]
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=36, exceptions:
> > *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> > callTimeout=60000, callDuration=68411: row
> > 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> > region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> > seqNum=0
> >   at
> > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> > cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >   at
> > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > ScannerCallableWithReplicas.java:210)
> >   at
> > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> > ScannerCallableWithReplicas.java:60)
> >   at
> > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> > RpcRetryingCaller.java:210)
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > > I will check that, but if that is a server startup thing I was not
> aware
> > I
> > > had to send it to the executors. So it’s like a connection timeout from
> > > executor code?
> > >
> > >
> > > On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > How did you change the timeout(s) ?
> > >
> > > bq. timeout is currently set to 60000
> > >
> > > Did you pass hbase-site.xml using --files to Spark job ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> > >
> > > > Using standalone Spark. I don’t recall seeing connection lost errors,
> > but
> > > > there are lots of logs. I’ve set the scanner and RPC timeouts to
> large
> > > > numbers on the servers.
> > > >
> > > > But I also saw in the logs:
> > > >
> > > >    org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > > > passed since the last invocation, timeout is currently set to 60000
> > > >
> > > > Not sure where that is coming from. Does the driver machine making
> > > queries
> > > > need to have the timeout config also?
> > > >
> > > > And why so large, am I doing something wrong?
> > > >
> > > >
> > > > On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > Mich:
> > > > The OutOfOrderScannerNextException indicated problem with read from
> > > hbase.
> > > >
> > > > How did you know connection to Spark cluster was lost ?
> > > >
> > > > Cheers
> > > >
> > > > On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > > > mich.talebzadeh@gmail.com>
> > > > wrote:
> > > >
> > > >> Looks like it lost the connection to Spark cluster.
> > > >>
> > > >> What mode you are using with Spark, Standalone, Yarn or others. The
> > > issue
> > > >> looks like a resource manager issue.
> > > >>
> > > >> I have seen this when running Zeppelin with Spark on Hbase.
> > > >>
> > > >> HTH
> > > >>
> > > >> Dr Mich Talebzadeh
> > > >>
> > > >>
> > > >>
> > > >> LinkedIn * https://www.linkedin.com/profile/view?id=
> > > >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >> <https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > >> OABUrV8Pw>*
> > > >>
> > > >>
> > > >>
> > > >> http://talebzadehmich.wordpress.com
> > > >>
> > > >>
> > > >> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > > any
> > > >> loss, damage or destruction of data or any other property which may
> > > arise
> > > >> from relying on this email's technical content is explicitly
> > disclaimed.
> > > >> The author will in no case be liable for any monetary damages
> arising
> > > > from
> > > >> such loss, damage or destruction.
> > > >>
> > > >>
> > > >>
> > > >> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> > > >>
> > > >>> I’m getting data from HBase using a large Spark cluster with
> > > parallelism
> > > >>> of near 400. The query fails quire often with the message below.
> > > >> Sometimes
> > > >>> a retry will work and sometimes the ultimate failure results
> (below).
> > > >>>
> > > >>> If I reduce parallelism in Spark it slows other parts of the
> > algorithm
> > > >>> unacceptably. I have also experimented with very large RPC/Scanner
> > > >> timeouts
> > > >>> of many minutes—to no avail.
> > > >>>
> > > >>> Any clues about what to look for or what may be setup wrong in my
> > > > tables?
> > > >>>
> > > >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > > times,
> > > >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > >>> ip-172-16-3-9.eu-central-1.compute.internal):
> > org.apache.hadoop.hbase.
> > > >> DoNotRetryIOException:
> > > >>> Failed after retry of OutOfOrderScannerNextException: was there a
> > rpc
> > > >>> timeout?+details
> > > >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > > times,
> > > >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > > >>> ip-172-16-3-9.eu-central-1.compute.internal):
> > org.apache.hadoop.hbase.
> > > >> DoNotRetryIOException:
> > > >>> Failed after retry of OutOfOrderScannerNextException: was there a
> > rpc
> > > >>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > > >> ClientScanner.java:403)
> > > >>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > > > nextKeyValue(
> > > >>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > > >>> mapreduce.TableRecordReader.nextKeyValue(
> TableRecordReader.java:138)
> > > at
> > > >>>
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

Mich:
bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740

What you observed was different issue.
The above looks like trouble with locating region(s) during scan.

On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> This is an example I got
>
> warning: there were two deprecation warnings; re-run with -deprecation for
> details
> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77] at
> map at <console>:151
> defined class columns
> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> string]
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=36, exceptions:
> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=68411: row
> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> seqNum=0
>   at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>   at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:210)
>   at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:60)
>   at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> RpcRetryingCaller.java:210)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > I will check that, but if that is a server startup thing I was not aware
> I
> > had to send it to the executors. So it’s like a connection timeout from
> > executor code?
> >
> >
> > On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > How did you change the timeout(s) ?
> >
> > bq. timeout is currently set to 60000
> >
> > Did you pass hbase-site.xml using --files to Spark job ?
> >
> > Cheers
> >
> > On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> > > Using standalone Spark. I don’t recall seeing connection lost errors,
> but
> > > there are lots of logs. I’ve set the scanner and RPC timeouts to large
> > > numbers on the servers.
> > >
> > > But I also saw in the logs:
> > >
> > >    org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > > passed since the last invocation, timeout is currently set to 60000
> > >
> > > Not sure where that is coming from. Does the driver machine making
> > queries
> > > need to have the timeout config also?
> > >
> > > And why so large, am I doing something wrong?
> > >
> > >
> > > On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > Mich:
> > > The OutOfOrderScannerNextException indicated problem with read from
> > hbase.
> > >
> > > How did you know connection to Spark cluster was lost ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > > mich.talebzadeh@gmail.com>
> > > wrote:
> > >
> > >> Looks like it lost the connection to Spark cluster.
> > >>
> > >> What mode you are using with Spark, Standalone, Yarn or others. The
> > issue
> > >> looks like a resource manager issue.
> > >>
> > >> I have seen this when running Zeppelin with Spark on Hbase.
> > >>
> > >> HTH
> > >>
> > >> Dr Mich Talebzadeh
> > >>
> > >>
> > >>
> > >> LinkedIn * https://www.linkedin.com/profile/view?id=
> > >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >> <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > >> OABUrV8Pw>*
> > >>
> > >>
> > >>
> > >> http://talebzadehmich.wordpress.com
> > >>
> > >>
> > >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > >> loss, damage or destruction of data or any other property which may
> > arise
> > >> from relying on this email's technical content is explicitly
> disclaimed.
> > >> The author will in no case be liable for any monetary damages arising
> > > from
> > >> such loss, damage or destruction.
> > >>
> > >>
> > >>
> > >> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > >>
> > >>> I’m getting data from HBase using a large Spark cluster with
> > parallelism
> > >>> of near 400. The query fails quire often with the message below.
> > >> Sometimes
> > >>> a retry will work and sometimes the ultimate failure results (below).
> > >>>
> > >>> If I reduce parallelism in Spark it slows other parts of the
> algorithm
> > >>> unacceptably. I have also experimented with very large RPC/Scanner
> > >> timeouts
> > >>> of many minutes—to no avail.
> > >>>
> > >>> Any clues about what to look for or what may be setup wrong in my
> > > tables?
> > >>>
> > >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > times,
> > >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>> ip-172-16-3-9.eu-central-1.compute.internal):
> org.apache.hadoop.hbase.
> > >> DoNotRetryIOException:
> > >>> Failed after retry of OutOfOrderScannerNextException: was there a
> rpc
> > >>> timeout?+details
> > >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> > times,
> > >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > >>> ip-172-16-3-9.eu-central-1.compute.internal):
> org.apache.hadoop.hbase.
> > >> DoNotRetryIOException:
> > >>> Failed after retry of OutOfOrderScannerNextException: was there a
> rpc
> > >>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > >> ClientScanner.java:403)
> > >>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > > nextKeyValue(
> > >>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > >>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> > at
> > >>>
> > >>
> > >
> > >
> >
> >
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

This is an example I got

warning: there were two deprecation warnings; re-run with -deprecation for
details
rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77] at
map at <console>:151
defined class columns
dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
string]
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=36, exceptions:
*Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
callTimeout=60000, callDuration=68411: row
'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
seqNum=0
  at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
  at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210)
  at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
  at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 17:52, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I will check that, but if that is a server startup thing I was not aware I
> had to send it to the executors. So it’s like a connection timeout from
> executor code?
>
>
> On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:
>
> How did you change the timeout(s) ?
>
> bq. timeout is currently set to 60000
>
> Did you pass hbase-site.xml using --files to Spark job ?
>
> Cheers
>
> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Using standalone Spark. I don’t recall seeing connection lost errors, but
> > there are lots of logs. I’ve set the scanner and RPC timeouts to large
> > numbers on the servers.
> >
> > But I also saw in the logs:
> >
> >    org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> > passed since the last invocation, timeout is currently set to 60000
> >
> > Not sure where that is coming from. Does the driver machine making
> queries
> > need to have the timeout config also?
> >
> > And why so large, am I doing something wrong?
> >
> >
> > On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > Mich:
> > The OutOfOrderScannerNextException indicated problem with read from
> hbase.
> >
> > How did you know connection to Spark cluster was lost ?
> >
> > Cheers
> >
> > On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> > mich.talebzadeh@gmail.com>
> > wrote:
> >
> >> Looks like it lost the connection to Spark cluster.
> >>
> >> What mode you are using with Spark, Standalone, Yarn or others. The
> issue
> >> looks like a resource manager issue.
> >>
> >> I have seen this when running Zeppelin with Spark on Hbase.
> >>
> >> HTH
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >> OABUrV8Pw>*
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly disclaimed.
> >> The author will in no case be liable for any monetary damages arising
> > from
> >> such loss, damage or destruction.
> >>
> >>
> >>
> >> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>
> >>> I’m getting data from HBase using a large Spark cluster with
> parallelism
> >>> of near 400. The query fails quire often with the message below.
> >> Sometimes
> >>> a retry will work and sometimes the ultimate failure results (below).
> >>>
> >>> If I reduce parallelism in Spark it slows other parts of the algorithm
> >>> unacceptably. I have also experimented with very large RPC/Scanner
> >> timeouts
> >>> of many minutes—to no avail.
> >>>
> >>> Any clues about what to look for or what may be setup wrong in my
> > tables?
> >>>
> >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> times,
> >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> >> DoNotRetryIOException:
> >>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> >>> timeout?+details
> >>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> times,
> >>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> >> DoNotRetryIOException:
> >>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> >>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >> ClientScanner.java:403)
> >>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> > nextKeyValue(
> >>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> at
> >>>
> >>
> >
> >
>
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I will check that, but if that is a server startup thing I was not aware I had to send it to the executors. So it’s like a connection timeout from executor code?


On Oct 28, 2016, at 9:48 AM, Ted Yu <yu...@gmail.com> wrote:

How did you change the timeout(s) ?

bq. timeout is currently set to 60000

Did you pass hbase-site.xml using --files to Spark job ?

Cheers

On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Using standalone Spark. I don’t recall seeing connection lost errors, but
> there are lots of logs. I’ve set the scanner and RPC timeouts to large
> numbers on the servers.
> 
> But I also saw in the logs:
> 
>    org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> passed since the last invocation, timeout is currently set to 60000
> 
> Not sure where that is coming from. Does the driver machine making queries
> need to have the timeout config also?
> 
> And why so large, am I doing something wrong?
> 
> 
> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> Mich:
> The OutOfOrderScannerNextException indicated problem with read from hbase.
> 
> How did you know connection to Spark cluster was lost ?
> 
> Cheers
> 
> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
> 
>> Looks like it lost the connection to Spark cluster.
>> 
>> What mode you are using with Spark, Standalone, Yarn or others. The issue
>> looks like a resource manager issue.
>> 
>> I have seen this when running Zeppelin with Spark on Hbase.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn * https://www.linkedin.com/profile/view?id=
>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
>> OABUrV8Pw>*
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising
> from
>> such loss, damage or destruction.
>> 
>> 
>> 
>> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> I’m getting data from HBase using a large Spark cluster with parallelism
>>> of near 400. The query fails quire often with the message below.
>> Sometimes
>>> a retry will work and sometimes the ultimate failure results (below).
>>> 
>>> If I reduce parallelism in Spark it slows other parts of the algorithm
>>> unacceptably. I have also experimented with very large RPC/Scanner
>> timeouts
>>> of many minutes—to no avail.
>>> 
>>> Any clues about what to look for or what may be setup wrong in my
> tables?
>>> 
>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
>> DoNotRetryIOException:
>>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
>>> timeout?+details
>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
>> DoNotRetryIOException:
>>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
>> ClientScanner.java:403)
>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> nextKeyValue(
>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
>>> 
>> 
> 
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

How did you change the timeout(s) ?

bq. timeout is currently set to 60000

Did you pass hbase-site.xml using --files to Spark job ?

Cheers

On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Using standalone Spark. I don’t recall seeing connection lost errors, but
> there are lots of logs. I’ve set the scanner and RPC timeouts to large
> numbers on the servers.
>
> But I also saw in the logs:
>
>     org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> passed since the last invocation, timeout is currently set to 60000
>
> Not sure where that is coming from. Does the driver machine making queries
> need to have the timeout config also?
>
> And why so large, am I doing something wrong?
>
>
> On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:
>
> Mich:
> The OutOfOrderScannerNextException indicated problem with read from hbase.
>
> How did you know connection to Spark cluster was lost ?
>
> Cheers
>
> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Looks like it lost the connection to Spark cluster.
> >
> > What mode you are using with Spark, Standalone, Yarn or others. The issue
> > looks like a resource manager issue.
> >
> > I have seen this when running Zeppelin with Spark on Hbase.
> >
> > HTH
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> >> I’m getting data from HBase using a large Spark cluster with parallelism
> >> of near 400. The query fails quire often with the message below.
> > Sometimes
> >> a retry will work and sometimes the ultimate failure results (below).
> >>
> >> If I reduce parallelism in Spark it slows other parts of the algorithm
> >> unacceptably. I have also experimented with very large RPC/Scanner
> > timeouts
> >> of many minutes—to no avail.
> >>
> >> Any clues about what to look for or what may be setup wrong in my
> tables?
> >>
> >> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> >> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> > DoNotRetryIOException:
> >> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> >> timeout?+details
> >> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> >> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> > DoNotRetryIOException:
> >> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> >> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> > ClientScanner.java:403)
> >> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> nextKeyValue(
> >> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
> >>
> >
>
>

Re: Scanner timeouts

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Using standalone Spark. I don’t recall seeing connection lost errors, but there are lots of logs. I’ve set the scanner and RPC timeouts to large numbers on the servers. 

But I also saw in the logs:

    org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms passed since the last invocation, timeout is currently set to 60000

Not sure where that is coming from. Does the driver machine making queries need to have the timeout config also?

And why so large, am I doing something wrong? 


On Oct 28, 2016, at 8:50 AM, Ted Yu <yu...@gmail.com> wrote:

Mich:
The OutOfOrderScannerNextException indicated problem with read from hbase.

How did you know connection to Spark cluster was lost ?

Cheers

On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Looks like it lost the connection to Spark cluster.
> 
> What mode you are using with Spark, Standalone, Yarn or others. The issue
> looks like a resource manager issue.
> 
> I have seen this when running Zeppelin with Spark on Hbase.
> 
> HTH
> 
> Dr Mich Talebzadeh
> 
> 
> 
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
> 
> 
> 
> http://talebzadehmich.wordpress.com
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> I’m getting data from HBase using a large Spark cluster with parallelism
>> of near 400. The query fails quire often with the message below.
> Sometimes
>> a retry will work and sometimes the ultimate failure results (below).
>> 
>> If I reduce parallelism in Spark it slows other parts of the algorithm
>> unacceptably. I have also experimented with very large RPC/Scanner
> timeouts
>> of many minutes—to no avail.
>> 
>> Any clues about what to look for or what may be setup wrong in my tables?
>> 
>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> DoNotRetryIOException:
>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
>> timeout?+details
>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
>> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> DoNotRetryIOException:
>> Failed after retry of OutOfOrderScannerNextException: was there a rpc
>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> ClientScanner.java:403)
>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(
>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
>> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
>> 
>

Re: Scanner timeouts

Posted by Ted Yu <yu...@gmail.com>.

Mich:
The OutOfOrderScannerNextException indicated problem with read from hbase.

How did you know connection to Spark cluster was lost ?

Cheers

On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Looks like it lost the connection to Spark cluster.
>
> What mode you are using with Spark, Standalone, Yarn or others. The issue
> looks like a resource manager issue.
>
> I have seen this when running Zeppelin with Spark on Hbase.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > I’m getting data from HBase using a large Spark cluster with parallelism
> > of near 400. The query fails quire often with the message below.
> Sometimes
> > a retry will work and sometimes the ultimate failure results (below).
> >
> > If I reduce parallelism in Spark it slows other parts of the algorithm
> > unacceptably. I have also experimented with very large RPC/Scanner
> timeouts
> > of many minutes—to no avail.
> >
> > Any clues about what to look for or what may be setup wrong in my tables?
> >
> > Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> > most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> DoNotRetryIOException:
> > Failed after retry of OutOfOrderScannerNextException: was there a rpc
> > timeout?+details
> > Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> > most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> > ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.
> DoNotRetryIOException:
> > Failed after retry of OutOfOrderScannerNextException: was there a rpc
> > timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> ClientScanner.java:403)
> > at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(
> > TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> > mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
> >
>

Re: Scanner timeouts

Posted by Mich Talebzadeh <mi...@gmail.com>.

Looks like it lost the connection to Spark cluster.

What mode you are using with Spark, Standalone, Yarn or others. The issue
looks like a resource manager issue.

I have seen this when running Zeppelin with Spark on Hbase.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 16:38, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I’m getting data from HBase using a large Spark cluster with parallelism
> of near 400. The query fails quire often with the message below. Sometimes
> a retry will work and sometimes the ultimate failure results (below).
>
> If I reduce parallelism in Spark it slows other parts of the algorithm
> unacceptably. I have also experimented with very large RPC/Scanner timeouts
> of many minutes—to no avail.
>
> Any clues about what to look for or what may be setup wrong in my tables?
>
> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException:
> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> timeout?+details
> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times,
> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> ip-172-16-3-9.eu-central-1.compute.internal): org.apache.hadoop.hbase.DoNotRetryIOException:
> Failed after retry of OutOfOrderScannerNextException: was there a rpc
> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:403)
> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(
> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at
>