You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ruijing Li <li...@gmail.com> on 2020/05/06 07:18:24 UTC

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

Wanted to update everyone on this, thanks for all the responses. I was able
to solve this issue after doing a jstack dump - I found out this was the
cause

https://github.com/scala/bug/issues/10436

Lesson learned - I’ll use a safer json parser like json4s, seems like that
one should be able to be thread-safe hopefully.

On Fri, Apr 24, 2020 at 4:34 AM Waleed Fateem <wa...@gmail.com>
wrote:

> Are you running this in local mode? If not, are you even sure that the
> hanging is occurring on the driver's side?
>
> Did you check the Spark UI to see if there is a straggler task or not? If
> you do have a straggler/hanging task, and in case this is not an
> application running in local mode then you need to get the Java thread dump
> of the executor's JVM process. Once you do, you'll want to review the "Executor
> task launch worker for task XYZ" thread, whee XYZ is some integer value
> representing the task ID that was launched on that executor. In case you're running
> this is local mode that thread would be located in the same Java thread
> dump that you have already collected.
>
>
> On Tue, Apr 21, 2020 at 9:51 PM Ruijing Li <li...@gmail.com> wrote:
>
>> I apologize, but I cannot share it, even if it is just typical spark
>> libraries. I definitely understand that limits debugging help, but wanted
>> to understand if anyone has encountered a similar issue.
>>
>> On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> If there's no third party libraries in the dump then why not share the
>>> thread dump? (I mean, the output of jstack)
>>>
>>> stack trace would be more helpful to find which thing acquired lock and
>>> which other things are waiting for acquiring lock, if we suspect deadlock.
>>>
>>> On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li <li...@gmail.com>
>>> wrote:
>>>
>>>> After refreshing a couple of times, I notice the lock is being swapped
>>>> between these 3. The other 2 will be blocked by whoever gets this lock, in
>>>> a cycle of 160 has lock -> 161 -> 159 -> 160
>>>>
>>>> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> In thread dump, I do see this
>>>>> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
>>>>> Monitor
>>>>> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>>>>> Blocked by Thread(Some(160)) Lock
>>>>> -  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>>>>> Blocked by Thread(Some(160)) Lock
>>>>>
>>>>> Could the fact that 160 has the monitor but is not running be causing
>>>>> a deadlock preventing the job from finishing?
>>>>>
>>>>> I do see my Finalizer and main method are waiting. I don’t see any
>>>>> other threads from 3rd party libraries or my code in the dump. I do see
>>>>> spark context cleaner has timed waiting.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Strangely enough I found an old issue that is the exact same issue as
>>>>>> mine
>>>>>>
>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>>>>>>
>>>>>> However I’m using spark 2.4.4 so the issue should have been solved by
>>>>>> now.
>>>>>>
>>>>>> Like the user in the jira issue I am using mesos, but I am reading
>>>>>> from oracle instead of writing to Cassandra and S3.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei <we...@outlook.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The Thread dump result table of Spark UI can provide some clues to
>>>>>>> find out thread locks issue, such as:
>>>>>>>
>>>>>>>   Thread ID | Thread Name                  | Thread State | Thread
>>>>>>> Locks
>>>>>>>   13        | NonBlockingInputStreamThread | WAITING      | Blocked
>>>>>>> by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951
>>>>>>> })
>>>>>>>   48        | Thread-16                    | RUNNABLE     |
>>>>>>> Monitor(jline.internal.NonBlockingInputStream@103008951})
>>>>>>>
>>>>>>> And echo thread row can show the call stacks after being clicked,
>>>>>>> then you can check the root cause of holding locks like this(Thread 48 of
>>>>>>> above):
>>>>>>>
>>>>>>>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native
>>>>>>> Method)
>>>>>>>
>>>>>>> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>>>>>>>
>>>>>>> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>>>>>>>
>>>>>>> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>>>>>>>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>>>>>>>   <snip...>
>>>>>>>
>>>>>>> Hope it can help you.
>>>>>>>
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> -z
>>>>>>>
>>>>>>> On Thu, 16 Apr 2020 16:36:42 +0900
>>>>>>> Jungtaek Lim <ka...@gmail.com> wrote:
>>>>>>>
>>>>>>> > Do thread dump continuously, per specific period (like 1s) and see
>>>>>>> the
>>>>>>> > change of stack / lock for each thread. (This is not easy to be
>>>>>>> done in UI
>>>>>>> > so maybe doing manually would be the only option. Not sure Spark
>>>>>>> UI will
>>>>>>> > provide the same, haven't used at all.)
>>>>>>> >
>>>>>>> > It will tell which thread is being blocked (even it's shown as
>>>>>>> running) and
>>>>>>> > which point to look at.
>>>>>>> >
>>>>>>> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li <li...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Once I do. thread dump, what should I be looking for to tell
>>>>>>> where it is
>>>>>>> > > hanging? Seeing a lot of timed_waiting and waiting on driver.
>>>>>>> Driver is
>>>>>>> > > also being blocked by spark UI. If there are no tasks, is there
>>>>>>> a point to
>>>>>>> > > do thread dump of executors?
>>>>>>> > >
>>>>>>> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
>>>>>>> gabor.g.somogyi@gmail.com>
>>>>>>> > > wrote:
>>>>>>> > >
>>>>>>> > >> The simplest way is to do thread dump which doesn't require any
>>>>>>> fancy
>>>>>>> > >> tool (it's available on Spark UI).
>>>>>>> > >> Without thread dump it's hard to say anything...
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
>>>>>>> <ja...@aol.com.invalid>
>>>>>>> > >> wrote:
>>>>>>> > >>
>>>>>>> > >>> Here a is another tool I use Logic Analyser  7:55
>>>>>>> > >>> https://youtu.be/LnzuMJLZRdU
>>>>>>> > >>>
>>>>>>> > >>> you could take some suggestions for improving performance
>>>>>>> queries.
>>>>>>> > >>>
>>>>>>> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> Jane thorpe
>>>>>>> > >>> janethorpe1@aol.com
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> -----Original Message-----
>>>>>>> > >>> From: jane thorpe <ja...@aol.com.INVALID>
>>>>>>> > >>> To: janethorpe1 <ja...@aol.com>; mich.talebzadeh <
>>>>>>> > >>> mich.talebzadeh@gmail.com>; liruijing09 <li...@gmail.com>;
>>>>>>> user <
>>>>>>> > >>> user@spark.apache.org>
>>>>>>> > >>> Sent: Mon, 13 Apr 2020 8:32
>>>>>>> > >>> Subject: Re: Spark hangs while reading from jdbc - does
>>>>>>> nothing Removing
>>>>>>> > >>> Guess work from trouble shooting
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> This tool may be useful for you to trouble shoot your problems
>>>>>>> away.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> "APM tools typically use a waterfall-type view to show the
>>>>>>> blocking
>>>>>>> > >>> time of different components cascading through the control
>>>>>>> flow within an
>>>>>>> > >>> application.
>>>>>>> > >>> These types of visualizations are useful, and AppOptics has
>>>>>>> them, but
>>>>>>> > >>> they can be difficult to understand for those of us without a
>>>>>>> PhD."
>>>>>>> > >>>
>>>>>>> > >>> Especially  helpful if you want to understand through
>>>>>>> visualisation and
>>>>>>> > >>> you do not have a phD.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> Jane thorpe
>>>>>>> > >>> janethorpe1@aol.com
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> -----Original Message-----
>>>>>>> > >>> From: jane thorpe <ja...@aol.com.INVALID>
>>>>>>> > >>> To: mich.talebzadeh <mi...@gmail.com>; liruijing09 <
>>>>>>> > >>> liruijing09@gmail.com>; user <us...@spark.apache.org>
>>>>>>> > >>> CC: user <us...@spark.apache.org>
>>>>>>> > >>> Sent: Sun, 12 Apr 2020 4:35
>>>>>>> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
>>>>>>> > >>>
>>>>>>> > >>> You seem to be implying the error is intermittent.
>>>>>>> > >>> You seem to be implying data is being ingested  via JDBC. So
>>>>>>> the
>>>>>>> > >>> connection has proven itself to be working unless no data is
>>>>>>> arriving from
>>>>>>> > >>> the  JDBC channel at all.  If no data is arriving then one
>>>>>>> could say it
>>>>>>> > >>> could be  the JDBC.
>>>>>>> > >>> If the error is intermittent  then it is likely a resource
>>>>>>> involved in
>>>>>>> > >>> processing is filling to capacity.
>>>>>>> > >>> Try reducing the data ingestion volume and see if that
>>>>>>> completes, then
>>>>>>> > >>> increase the data ingested  incrementally.
>>>>>>> > >>> I assume you have  run the job on small amount of data so you
>>>>>>> have
>>>>>>> > >>> completed your prototype stage successfully.
>>>>>>> > >>>
>>>>>>> > >>> ------------------------------
>>>>>>> > >>> On Saturday, 11 April 2020 Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com>
>>>>>>> > >>> wrote:
>>>>>>> > >>> Hi,
>>>>>>> > >>>
>>>>>>> > >>> Have you checked your JDBC connections from Spark to Oracle.
>>>>>>> What is
>>>>>>> > >>> Oracle saying? Is it doing anything or hanging?
>>>>>>> > >>>
>>>>>>> > >>> set pagesize 9999
>>>>>>> > >>> set linesize 140
>>>>>>> > >>> set heading off
>>>>>>> > >>> select SUBSTR(name,1,8) || ' sessions as on
>>>>>>> '||TO_CHAR(CURRENT_DATE,
>>>>>>> > >>> 'MON DD YYYY HH:MI AM') from v$database;
>>>>>>> > >>> set heading on
>>>>>>> > >>> column spid heading "OS PID" format a6
>>>>>>> > >>> column process format a13 heading "Client ProcID"
>>>>>>> > >>> column username  format a15
>>>>>>> > >>> column sid       format 999
>>>>>>> > >>> column serial#   format 99999
>>>>>>> > >>> column STATUS    format a3 HEADING 'ACT'
>>>>>>> > >>> column last      format 9,999.99
>>>>>>> > >>> column TotGets   format 999,999,999,999 HEADING 'Logical I/O'
>>>>>>> > >>> column phyRds    format 999,999,999 HEADING 'Physical I/O'
>>>>>>> > >>> column total_memory format 999,999,999 HEADING 'MEM/KB'
>>>>>>> > >>> --
>>>>>>> > >>> SELECT
>>>>>>> > >>>           substr(a.username,1,15) "LOGIN"
>>>>>>> > >>>         , substr(a.sid,1,5) || ','||substr(a.serial#,1,5) AS
>>>>>>> > >>> "SID/serial#"
>>>>>>> > >>>         , TO_CHAR(a.logon_time, 'DD/MM HH:MI') "LOGGED IN
>>>>>>> SINCE"
>>>>>>> > >>>         , substr(a.machine,1,10) HOST
>>>>>>> > >>>         , substr(p.username,1,8)||'/'||substr(p.spid,1,5) "OS
>>>>>>> PID"
>>>>>>> > >>>         , substr(a.osuser,1,8)||'/'||substr(a.process,1,5)
>>>>>>> "Client PID"
>>>>>>> > >>>         , substr(a.program,1,15) PROGRAM
>>>>>>> > >>>         --,ROUND((CURRENT_DATE-a.logon_time)*24) AS
>>>>>>> "Logged/Hours"
>>>>>>> > >>>         , (
>>>>>>> > >>>                 select round(sum(ss.value)/1024) from
>>>>>>> v$sesstat ss,
>>>>>>> > >>> v$statname sn
>>>>>>> > >>>                 where ss.sid = a.sid and
>>>>>>> > >>>                         sn.statistic# = ss.statistic# and
>>>>>>> > >>>                         -- sn.name in ('session pga memory')
>>>>>>> > >>>                         sn.name in ('session pga
>>>>>>> memory','session uga
>>>>>>> > >>> memory')
>>>>>>> > >>>           ) AS total_memory
>>>>>>> > >>>         , (b.block_gets + b.consistent_gets) TotGets
>>>>>>> > >>>         , b.physical_reads phyRds
>>>>>>> > >>>         , decode(a.status, 'ACTIVE', 'Y','INACTIVE', 'N')
>>>>>>> STATUS
>>>>>>> > >>>         , CASE WHEN a.sid in (select sid from v$mystat where
>>>>>>> rownum = 1)
>>>>>>> > >>> THEN '<-- YOU' ELSE ' ' END "INFO"
>>>>>>> > >>> FROM
>>>>>>> > >>>          v$process p
>>>>>>> > >>>         ,v$session a
>>>>>>> > >>>         ,v$sess_io b
>>>>>>> > >>> WHERE
>>>>>>> > >>> a.paddr = p.addr
>>>>>>> > >>> AND p.background IS NULL
>>>>>>> > >>> --AND  a.sid NOT IN (select sid from v$mystat where rownum = 1)
>>>>>>> > >>> AND a.sid = b.sid
>>>>>>> > >>> AND a.username is not null
>>>>>>> > >>> --AND (a.last_call_et < 3600 or a.status = 'ACTIVE')
>>>>>>> > >>> --AND CURRENT_DATE - logon_time > 0
>>>>>>> > >>> --AND a.sid NOT IN ( select sid from v$mystat where rownum=1)
>>>>>>> --
>>>>>>> > >>> exclude me
>>>>>>> > >>> --AND (b.block_gets + b.consistent_gets) > 0
>>>>>>> > >>> ORDER BY a.username;
>>>>>>> > >>> exit
>>>>>>> > >>>
>>>>>>> > >>> HTH
>>>>>>> > >>>
>>>>>>> > >>> Dr Mich Talebzadeh
>>>>>>> > >>>
>>>>>>> > >>> LinkedIn *
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> > >>> <
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> >*
>>>>>>> > >>>
>>>>>>> > >>> http://talebzadehmich.wordpress.com
>>>>>>> > >>>
>>>>>>> > >>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>> responsibility for
>>>>>>> > >>> any loss, damage or destruction of data or any other property
>>>>>>> which may
>>>>>>> > >>> arise from relying on this email's technical content is
>>>>>>> explicitly
>>>>>>> > >>> disclaimed. The author will in no case be liable for any
>>>>>>> monetary damages
>>>>>>> > >>> arising from such loss, damage or destruction.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> On Fri, 10 Apr 2020 at 17:37, Ruijing Li <
>>>>>>> liruijing09@gmail.com> wrote:
>>>>>>> > >>>
>>>>>>> > >>> Hi all,
>>>>>>> > >>>
>>>>>>> > >>> I am on spark 2.4.4 and using scala 2.11.12, and running
>>>>>>> cluster mode on
>>>>>>> > >>> mesos. I am ingesting from an oracle database using
>>>>>>> spark.read.jdbc. I am
>>>>>>> > >>> seeing a strange issue where spark just hangs and does
>>>>>>> nothing, not
>>>>>>> > >>> starting any new tasks. Normally this job finishes in 30
>>>>>>> stages but
>>>>>>> > >>> sometimes it stops at 29 completed stages and doesn’t start
>>>>>>> the last stage.
>>>>>>> > >>> The spark job is idling and there is no pending or active
>>>>>>> task. What could
>>>>>>> > >>> be the problem? Thanks.
>>>>>>> > >>> --
>>>>>>> > >>> Cheers,
>>>>>>> > >>> Ruijing Li
>>>>>>> > >>>
>>>>>>> > >>> --
>>>>>>> > > Cheers,
>>>>>>> > > Ruijing Li
>>>>>>> > >
>>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>> Ruijing Li
>>>>>>
>>>>> --
>>>>> Cheers,
>>>>> Ruijing Li
>>>>>
>>>> --
>>>> Cheers,
>>>> Ruijing Li
>>>>
>>> --
>> Cheers,
>> Ruijing Li
>>
> --
Cheers,
Ruijing Li