You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alexey Romanchuk <al...@gmail.com> on 2014/09/25 09:09:47 UTC

Log hdfs blocks sending

Hello again spark users and developers!

I have standalone spark cluster (1.1.0) and spark sql running on it. My
cluster consists of 4 datanodes and replication factor of files is 3.

I use thrift server to access spark sql and have 1 table with 30+
partitions. When I run query on whole table (something simple like select
count(*) from t) spark produces a lot of network activity filling all
available 1gb link. Looks like spark sent data by network instead of local
reading.

Is it any way to log which blocks were accessed locally and which are not?

Thanks!

Re: Log hdfs blocks sending

Posted by Andrew Ash <an...@andrewash.com>.
Hi Alexey,

You're looking in the right place in the first log from the driver.
Specifically the locality is on the TaskSetManager INFO log level and looks
like this:

14/09/26 16:57:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0
(TID 10, 10.54.255.191, ANY, 1341 bytes)


The ANY there means you're not getting locality.  The big flag for me is
that you have an IP address for the host in that line as well.  Do you have
Spark configured to use hostnames instead of IP addresses?  You need to
check the Spark master webui and the Hadoo Namenode UI to make sure that
hosts appear exactly the same in both.  Most likely, you want both to have
the fqdn of each host.

Cheers!
Andrew

On Fri, Sep 26, 2014 at 3:14 AM, Alexey Romanchuk <
alexey.romanchuk@gmail.com> wrote:

> Hello Andrew!
>
> Thanks for reply. Which logs and on what level should I check? Driver,
> master or worker?
>
> I found this on master node, but there is only ANY locality requirement.
> Here it is the driver (spark sql) log -
> https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
> log - https://gist.github.com/13h3r/6e5053cf0dbe33f2aaaa
>
> Do you have any idea where to look at?
>
> Thanks!
>
> On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Hi Alexey,
>>
>> You should see in the logs a locality measure like NODE_LOCAL,
>> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
>> on them and you're reading out of HDFS, then you should be seeing almost
>> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
>> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
>> think the data is local and does remote reads which really kills
>> performance.
>>
>> Hope that helps!
>> Andrew
>>
>> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
>> alexey.romanchuk@gmail.com> wrote:
>>
>>> Hello again spark users and developers!
>>>
>>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>>> cluster consists of 4 datanodes and replication factor of files is 3.
>>>
>>> I use thrift server to access spark sql and have 1 table with 30+
>>> partitions. When I run query on whole table (something simple like select
>>> count(*) from t) spark produces a lot of network activity filling all
>>> available 1gb link. Looks like spark sent data by network instead of local
>>> reading.
>>>
>>> Is it any way to log which blocks were accessed locally and which are
>>> not?
>>>
>>> Thanks!
>>>
>>
>>
>

Re: Log hdfs blocks sending

Posted by Alexey Romanchuk <al...@gmail.com>.
Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log - https://gist.github.com/13h3r/6e5053cf0dbe33f2aaaa

Do you have any idea where to look at?

Thanks!

On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <an...@andrewash.com> wrote:

> Hi Alexey,
>
> You should see in the logs a locality measure like NODE_LOCAL,
> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
> on them and you're reading out of HDFS, then you should be seeing almost
> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
> think the data is local and does remote reads which really kills
> performance.
>
> Hope that helps!
> Andrew
>
> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
> alexey.romanchuk@gmail.com> wrote:
>
>> Hello again spark users and developers!
>>
>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>> cluster consists of 4 datanodes and replication factor of files is 3.
>>
>> I use thrift server to access spark sql and have 1 table with 30+
>> partitions. When I run query on whole table (something simple like select
>> count(*) from t) spark produces a lot of network activity filling all
>> available 1gb link. Looks like spark sent data by network instead of local
>> reading.
>>
>> Is it any way to log which blocks were accessed locally and which are not?
>>
>> Thanks!
>>
>
>

Re: Log hdfs blocks sending

Posted by Andrew Ash <an...@andrewash.com>.
Hi Alexey,

You should see in the logs a locality measure like NODE_LOCAL,
PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
on them and you're reading out of HDFS, then you should be seeing almost
all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
think the data is local and does remote reads which really kills
performance.

Hope that helps!
Andrew

On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
alexey.romanchuk@gmail.com> wrote:

> Hello again spark users and developers!
>
> I have standalone spark cluster (1.1.0) and spark sql running on it. My
> cluster consists of 4 datanodes and replication factor of files is 3.
>
> I use thrift server to access spark sql and have 1 table with 30+
> partitions. When I run query on whole table (something simple like select
> count(*) from t) spark produces a lot of network activity filling all
> available 1gb link. Looks like spark sent data by network instead of local
> reading.
>
> Is it any way to log which blocks were accessed locally and which are not?
>
> Thanks!
>