You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by morenn520 <gi...@git.apache.org> on 2017/03/10 07:34:26 UTC

[GitHub] spark pull request #17238: getRackForHost returns None if host is unknown by...

GitHub user morenn520 opened a pull request:

    https://github.com/apache/spark/pull/17238

    getRackForHost returns None if host is unknown by driver

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-19894
    
    ## How was this patch tested?
    
    It tests on our production cluster(YARN) by YARN-cluster mode, and resolve user rack-local problems by applying this patch.
    Problem:
    In our production cluster(YARN), one node(called missing-rack-info node) miss some rack information for other nodes. One Spark Streaming program(Datasource: Kafka, Mode: Yarn-cluster), runs driver on this missing-rack-info node.
    The nodes whose host is missed on Driver node, and the Kafka broker node whose host is also unknown by YARN, would both be recognized as "/default-rack" by YARN scheduler, so that all tasks would be assigned to the nodes for RACK_LOCAL.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/morenn520/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17238
    
----
commit 6630e747efa52bff5ca48bb0a5610357c7754c10
Author: Chen Yuechen <ch...@qiyi.com>
Date:   2017-03-10T07:24:48Z

    getRackForHost returns None if host is unknown by driver

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    @squito I looked at yarn bits about 3-4 years back, so I am sure you and @tgravescs know better :-)
    But yes, you did summarize what I was attempting to get to.
    
    I am not sure what the impact of SPARK-18886 here might be ...
    Another way to look at this change is, what does MR/Tez do in this scenario w.r.t locality computation ? My understanding is that they do not fudge the value - but if I am wrong, we could do the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    Please change the title per http://spark.apache.org/contributing.html
    @squito or @vanzin do you know enough to evaluate this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    This is something to be configured appropriately at the lower level (namely topology config), and should be handled there Special casing this in spark is not right.
    Two scenarios : 
    - Default rack for cpu nodes, and a special rack for gpu nodes (specifically for rdma comm) - will cause rack locality to be disabled for the cpu nodes (since they are on the default rack).
    - Specializations of NetworkTopology need not honor DEFAULT_RACK - resulting in limited value in those scenarios (IIRC the api does not require unknown rack to be specified as DEFAULT_RACK) - is that incorrect assessment @squito ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    @mridulm you know the yarn bits involved better than I do.  It sounds like you are saying this is the wrong change, and instead its just a misconfiguration in yarn -- I'll buy that argument.  Essentially you are saying that "DEFAULT_RACK" may be used legitimately for one of the racks.
    
    Even with this miconfiguration, the incorrect locality assignments don't seem horrible, except that for very short tasks its exacerbated by https://issues.apache.org/jira/browse/SPARK-18886


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    > Actually, to play devil's advocate, the problem @morenn520 is describing is a little more involved. You have a driver running, which has its own view of what the cluster topology is, and then the cluster topology changes underneath it. Deploying a new configuration on the new nodes being added does not fix the driver, unless your "topology discovery script" is fully dynamic and always goes to a central location to figure out what's the current topology.
    
    Maybe I'm misunderstanding what you are saying, but the only way the AM gets a bad topology is if its wrong in the first place.   Or are you just saying app starts and host is in one rack, host gets moved to another rack and brought back up? I guess that is possible, but  I'm not really sure that  applies to this case here anyway with default_rack. Any existing executors would have gone away on it when the host was moved so yarn should re-resolve when it gets a new container anyway.  If your script isn't dynamic to handle that then its also a configuration issue to update all the other hosts and you should do that before bringing the host up.  Again unless you aren't using HDFS the rack resolve affects more then just spark on yarn here.  Its going to affect HDFS block placements and other types of apps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    I don't think you're missing anything @tgravescs , it sounds like this is just a misconfiguration and we shouldn't be doing anything special for it (since it could hurt correct configurations).  I wanted to see if this was a particularly common / easy misconfiguration in yarn, but you both have convinced me its not.
    
    my point about SPARK-18886 was to think through how bad the effect from the misconfiguration is.  In the originally described scenario, if a few nodes are accidentally labelled as belonging to their own rack, then sometimes task will get preferentially assigned to this false-rack at first.  If tasks are long, its not that big a deal -- after a short delay, you'll then assign the tasks to the rest of the cluster.  But if tasks are short, b/c of SPARK-18886, you may just keep assigning to the nodes in your false-rack, and leave the rest of the cluster idle.
    
    Nonetheless, I think its still just a misconfiguration.  sounds like this issue is a "won't fix", unless @morenn520 makes the case otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    Ok checked tez and mr and they don't do this. 
    Actually in a couple of the input formats it actually adds DEFAULT_RACK if there wasn't any topology information so you would end up assuming everything is rack local in those cases. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    > If you aren't adding in machines to rack and configuring yarn properly before adding it to your cluster that is a process issue you should fix on your end
    
    Actually, to play devil's advocate, the problem @morenn520 is describing is a little more involved. You have a driver running, which has its own view of what the cluster topology is, and then the cluster topology changes underneath it. Deploying a new configuration on the new nodes being added does not fix the driver, unless your "topology discovery script" is fully dynamic and always goes to a central location to figure out what's the current topology.
    
    So, basically, even if the new nodes know about the updated topology, the existing driver instance might not, and there's no easy way to fix that I can see.
    
    (That being said, I haven't looked at the changes here.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by morenn520 <gi...@git.apache.org>.
Github user morenn520 commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    Thanks @tgravescs , we has used a dynamic topology discovery script to avoid this problem. Since tez and mr don\u2018t do this, maybe spark could not fix this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    If you aren't adding in machines to rack and configuring yarn properly before adding it to your cluster that is a process issue you should fix on your end.    I would assume a unracking/racking a node means putting in a new node?  If that is the case you have to install hadoop and hadoop configuration on that node.  I would expect you to fix the configuration or have a generic rack aware script/java class that would be able to just figure it out, but that is going to be pretty configuration specific.   I would also assume if you have that configuration wrong then your HDFS is also not being optimal as it could get the replication wrong.
    
    You can specify your own class/script to do the rack resolution so you could change that to handle this case:  see https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/RackAwareness.html
    
    I'll try to check on tez/mr today and get back to you (was to busy yesterday). I know tez didn't have any explicit references to DEFAULT_RACK in the code but want to look a bit more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    Sorry if I'm missing something here but I don't see why this is a problem?   If you have YARN misconfigured or not configured everything is going to default to DEFAULT_RACK.  If you want them to be on different racks then fix the yarn configuration.
    
    The topology mapping in hadoop is a plugin so someone could write a plugin that just returned DEFAULT_RACK for anything, if you don't want that behavior you can write a plugin that does something different.
    
    @morenn520 I'm assuming in this case the tasks had some host locality level before looking at the rack level?
    
    I don't believe tez/MR do anything special with respect to the DEFAULT_RACK but I'll double check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17238: getRackForHost returns None if host is unknown by...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by morenn520 <gi...@git.apache.org>.
Github user morenn520 commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    @squito Actually it's a misconfiguration in our yarn. There are thousands of nodes in our production yarn. It will be quite frequent to rack/unrack nodes, so as to make something missing easily. It would be better if spark does something special to avoid our misconfiguration.
    In addition, I think unknown host should be recognized as different racks. But if MR/tez may do nothing special to this problem, it is reasonable for this problem not to be fix.
    I hope @tgravescs would double check whether tez/MR do anything special to DEFAULT_RACK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17238
  
    The description of the problem makes sense for how it would effect task locality.  I will need to look more closely at some yarn bits to make sure its the right change, but looks reasonable.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org