You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sitalkedia <gi...@git.apache.org> on 2016/05/06 00:55:22 UTC

[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/12944

    [SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch

    ## What changes were proposed in this pull request?
    
    Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch
    
    
    ## How was this patch tested?
    
    Tested by running a job on the cluster and the shuffle read time was reduced by 50%.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark shuffle_service

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12944
    
----
commit 9ca03093bb7d3eef8bdb25fe41172dfce259d496
Author: Sital Kedia <sk...@fb.com>
Date:   2016-05-04T01:34:05Z

    [SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r73612252
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java ---
    @@ -0,0 +1,63 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import sun.nio.ch.IOUtil;
    --- End diff --
    
    Doh, it looks like this unused import is breaking things. Let me hotfix to remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r62331757
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +/**
    + * Contains offset and length of the shuffle block data.
    + */
    +public class ShuffleIndexRecord {
    +  private final long offset;
    +  private final long length;
    +
    +  public ShuffleIndexRecord(long offset, long length) {
    +    this.offset = offset;
    +    this.length = length;
    +  }
    +
    +  public long getOffset() {
    +    return offset;
    +  }
    +
    +  public long getLength() {
    +    return length;
    +  }
    +}
    --- End diff --
    
    And a newline here maybe :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71253134
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    +
    +  LoadingCache<String, ShuffleIndexInformation> indexCache;
    +
    +  public ShuffleIndexCache(long cacheSize) {
    +    CacheLoader<String, ShuffleIndexInformation> loader =
    --- End diff --
    
    nit: does it work to have `File` as a key? That seems cleaner than a raw string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71602983
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -53,7 +53,7 @@ private[spark] class CoarseGrainedExecutorBackend(
       private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()
     
       override def onStart() {
    -    logInfo("Connecting to driver: " + driverUrl)
    +    logInfo("Connecting to driver, skedia1: " + driverUrl)
    --- End diff --
    
    my bad, removed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418404
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import sun.nio.ch.IOUtil;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +
    +/**
    + * Keeps the index information for a particular map output
    + * as an in-memory LongBuffer.
    + */
    +public class ShuffleIndexInformation {
    +  /** offsets as long buffer */
    +  private final LongBuffer offsets;
    +
    +  public ShuffleIndexInformation(File indexFile) throws IOException{
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71420918
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -95,6 +109,9 @@ public ExternalShuffleBlockResolver(TransportConf conf, File registeredExecutorF
           Executor directoryCleaner) throws IOException {
         this.conf = conf;
         this.registeredExecutorFile = registeredExecutorFile;
    +    int indexCacheEntries = conf.getInt(SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES,
    +                                        DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES);
    +    this.shuffleIndexCache = new ShuffleIndexCache(indexCacheEntries);
    --- End diff --
    
    `= new ShuffleIndexCache(conf.getInt("spark.shuffle.service.index.cache.size", 1024))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71600853
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -53,7 +53,7 @@ private[spark] class CoarseGrainedExecutorBackend(
       private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()
     
       override def onStart() {
    -    logInfo("Connecting to driver: " + driverUrl)
    +    logInfo("Connecting to driver, skedia1: " + driverUrl)
    --- End diff --
    
    ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418288
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -66,6 +67,16 @@
       @VisibleForTesting
       final ConcurrentMap<AppExecId, ExecutorShuffleInfo> executors;
     
    +  /**
    +   *  Caches index file information so that we can avoid open/close the index files
    +   *  for each block fetch.
    +   */
    +  private final ShuffleIndexCache shuffleIndexCache;
    +
    +  // Max number of entries to keep in the index cache.
    +  private static final String SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = "spark.shuffle.service.index.cache.entries";
    +  private static final int DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = 1024;
    --- End diff --
    
    I don't find a common place where spark.shuffle.service.* configs are defined. For example if you check configs like spark.shuffle.service.enabled or spark.shuffle.service.port, they are defined all over the code base. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71420863
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -66,6 +67,16 @@
       @VisibleForTesting
       final ConcurrentMap<AppExecId, ExecutorShuffleInfo> executors;
     
    +  /**
    +   *  Caches index file information so that we can avoid open/close the index files
    +   *  for each block fetch.
    +   */
    +  private final ShuffleIndexCache shuffleIndexCache;
    +
    +  // Max number of entries to keep in the index cache.
    +  private static final String SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = "spark.shuffle.service.index.cache.entries";
    +  private static final int DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = 1024;
    --- End diff --
    
    I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    LGTM. I suppose we could also add similar functionality to the non-shuffle-service version of IndexShuffleBlockResolver, but I think that's a much lower priority because I suspect that most folks who want to optimize production shuffle performance will be using the external shuffle service anyways.
    
    I've re-tested this locally and have confirmed that it still compiles and passes relevant tests, so I'm going to merge this to master. Thanks @sitalkedia!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r62334156
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +/**
    + * Contains offset and length of the shuffle block data.
    + */
    +public class ShuffleIndexRecord {
    +  private final long offset;
    +  private final long length;
    +
    +  public ShuffleIndexRecord(long offset, long length) {
    +    this.offset = offset;
    +    this.length = length;
    +  }
    +
    +  public long getOffset() {
    +    return offset;
    +  }
    +
    +  public long getLength() {
    +    return length;
    +  }
    +}
    --- End diff --
    
    will fix, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251177
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +/**
    + * Contains offset and length of the shuffle block data.
    + */
    +public class ShuffleIndexRecord {
    +  private final long offset;
    +  private final long length;
    +
    +  public ShuffleIndexRecord(long offset, long length) {
    +    this.offset = offset;
    +    this.length = length;
    +  }
    +
    +  public long getOffset() {
    +    return offset;
    +  }
    +
    +  public long getLength() {
    +    return length;
    +  }
    +}
    --- End diff --
    
    The convention seems to be to not have newlines at the end of files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71570543
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    --- End diff --
    
    changed accordingly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    @JoshRosen, can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    Can anyone take a look at it?
    cc - @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418310
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    --- End diff --
    
    Okay, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71254024
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +/**
    + * Contains offset and length of the shuffle block data.
    + */
    +public class ShuffleIndexRecord {
    +  private final long offset;
    +  private final long length;
    +
    +  public ShuffleIndexRecord(long offset, long length) {
    +    this.offset = offset;
    +    this.length = length;
    +  }
    +
    +  public long getOffset() {
    +    return offset;
    +  }
    +
    +  public long getLength() {
    +    return length;
    +  }
    +}
    --- End diff --
    
    It seems it was rebased but I guess I meant below:
    
    ```
    -}
    +}
    \ No newline at end of file
    ```
    
    meaning
    
    ![2016-07-19 9 11 19](https://cloud.githubusercontent.com/assets/6477701/16934391/d5711740-4d90-11e6-9012-f604747ad4d2.png)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the pull request:

    https://github.com/apache/spark/pull/12944#issuecomment-217450001
  
    @holdenk -  `TransportConf` is not specific to the , it is used to create Transport client in other modules as well. Since number of index cache entry is very specific to the `ShuffleService`, I did not want to expose that as an api in the  `TransportConf`.  Let me know what you think about it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r73609047
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -261,24 +280,17 @@ private ManagedBuffer getSortBasedShuffleBlockData(
         File indexFile = getFile(executor.localDirs, executor.subDirsPerLocalDir,
           "shuffle_" + shuffleId + "_" + mapId + "_0.index");
     
    -    DataInputStream in = null;
         try {
    -      in = new DataInputStream(new FileInputStream(indexFile));
    -      in.skipBytes(reduceId * 8);
    -      long offset = in.readLong();
    -      long nextOffset = in.readLong();
    +      ShuffleIndexInformation shuffleIndexInformation = shuffleIndexCache.get(indexFile);
    +      ShuffleIndexRecord shuffleIndexRecord = shuffleIndexInformation.getIndex(reduceId);
    --- End diff --
    
    It turns out that this call will fail with `ArrayIndexOutOfBoundsException` if the `reduceId` is too large. In the old code, an invalid `reduceId` would lead to an `IOException` because we'd skip past the end of the input stream and try to read.
    
    However, I don't think that this subtle change in behavior is going to necessarily cause problems from the caller's perspective since `ArrayIndexOutOfBoundsException` is also a `RuntimeException` and this code was already throwing `RuntimeException` in the "index file is missing" error case. Therefore, this looks good to me!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the pull request:

    https://github.com/apache/spark/pull/12944#issuecomment-221283774
  
    cc - @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418432
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +/**
    + * Contains offset and length of the shuffle block data.
    + */
    +public class ShuffleIndexRecord {
    +  private final long offset;
    +  private final long length;
    +
    +  public ShuffleIndexRecord(long offset, long length) {
    +    this.offset = offset;
    +    this.length = length;
    +  }
    +
    +  public long getOffset() {
    +    return offset;
    +  }
    +
    +  public long getLength() {
    +    return length;
    +  }
    +}
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251123
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import sun.nio.ch.IOUtil;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +
    +/**
    + * Keeps the index information for a particular map output
    + * as an in-memory LongBuffer.
    + */
    +public class ShuffleIndexInformation {
    +  /** offsets as long buffer */
    +  private final LongBuffer offsets;
    +
    +  public ShuffleIndexInformation(File indexFile) throws IOException{
    +    int size = (int)indexFile.length();
    +    ByteBuffer buffer = ByteBuffer.allocate(size);
    +    DataInputStream dis = new DataInputStream(new FileInputStream(indexFile));
    +    dis.readFully(buffer.array());
    +    dis.close();
    --- End diff --
    
    close() in finally block?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251332
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import sun.nio.ch.IOUtil;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +
    +/**
    + * Keeps the index information for a particular map output
    + * as an in-memory LongBuffer.
    + */
    +public class ShuffleIndexInformation {
    +  /** offsets as long buffer */
    +  private final LongBuffer offsets;
    +
    +  public ShuffleIndexInformation(File indexFile) throws IOException{
    --- End diff --
    
    nit: space before the {


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251280
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    +
    +  LoadingCache<String, ShuffleIndexInformation> indexCache;
    +
    +  public ShuffleIndexCache(long cacheSize) {
    +    CacheLoader<String, ShuffleIndexInformation> loader =
    +      new CacheLoader<String, ShuffleIndexInformation>() {
    +        public ShuffleIndexInformation load(String file) throws IOException {
    +          return new ShuffleIndexInformation(new File(file));
    +        }
    +      };
    +    indexCache = CacheBuilder.newBuilder()
    +        .maximumSize(cacheSize).build(loader);
    +
    --- End diff --
    
    nit: extra newline


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12944#issuecomment-217321362
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12944


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the pull request:

    https://github.com/apache/spark/pull/12944#issuecomment-218248308
  
    cc - @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251245
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -66,6 +67,16 @@
       @VisibleForTesting
       final ConcurrentMap<AppExecId, ExecutorShuffleInfo> executors;
     
    +  /**
    +   *  Caches index file information so that we can avoid open/close the index files
    +   *  for each block fetch.
    +   */
    +  private final ShuffleIndexCache shuffleIndexCache;
    +
    +  // Max number of entries to keep in the index cache.
    +  private static final String SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = "spark.shuffle.service.index.cache.entries";
    +  private static final int DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES = 1024;
    --- End diff --
    
    This conf seems a little out of place. Can it be co-located with the other spark.shuffle.* configs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71570503
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -95,6 +109,9 @@ public ExternalShuffleBlockResolver(TransportConf conf, File registeredExecutorF
           Executor directoryCleaner) throws IOException {
         this.conf = conf;
         this.registeredExecutorFile = registeredExecutorFile;
    +    int indexCacheEntries = conf.getInt(SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES,
    +                                        DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES);
    +    this.shuffleIndexCache = new ShuffleIndexCache(indexCacheEntries);
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    Thanks @JoshRosen !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71251461
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    --- End diff --
    
    Can we just inline this cache into ExternalShuffleBlockResolver? It doesn't seem necessary to add a new class file that just wraps the guava cache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418418
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import sun.nio.ch.IOUtil;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +
    +/**
    + * Keeps the index information for a particular map output
    + * as an in-memory LongBuffer.
    + */
    +public class ShuffleIndexInformation {
    +  /** offsets as long buffer */
    +  private final LongBuffer offsets;
    +
    +  public ShuffleIndexInformation(File indexFile) throws IOException{
    +    int size = (int)indexFile.length();
    +    ByteBuffer buffer = ByteBuffer.allocate(size);
    +    DataInputStream dis = new DataInputStream(new FileInputStream(indexFile));
    +    dis.readFully(buffer.array());
    +    dis.close();
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418366
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    +
    +  LoadingCache<String, ShuffleIndexInformation> indexCache;
    +
    +  public ShuffleIndexCache(long cacheSize) {
    +    CacheLoader<String, ShuffleIndexInformation> loader =
    --- End diff --
    
    good point, changed accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71418380
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    +
    +  LoadingCache<String, ShuffleIndexInformation> indexCache;
    +
    +  public ShuffleIndexCache(long cacheSize) {
    +    CacheLoader<String, ShuffleIndexInformation> loader =
    +      new CacheLoader<String, ShuffleIndexInformation>() {
    +        public ShuffleIndexInformation load(String file) throws IOException {
    +          return new ShuffleIndexInformation(new File(file));
    +        }
    +      };
    +    indexCache = CacheBuilder.newBuilder()
    +        .maximumSize(cacheSize).build(loader);
    +
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15074][Shuffle] Cache shuffle index fil...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/12944#issuecomment-217358557
  
    So a very minor style thing; it seems like the rest of the configuration values are exposed through accessor methods on TransportConf rather than directly exposing getInt, it might be better to try and expose this in the same was as `serverThreads()` or `numConnectionsPerPeer()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    lgtm @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12944: [SPARK-15074][Shuffle] Cache shuffle index file to speed...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/12944
  
    Hotfixing in #14499 to fix the build. My bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12944: [SPARK-15074][Shuffle] Cache shuffle index file t...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12944#discussion_r71421014
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexCache.java ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.network.shuffle;
    +
    +import com.google.common.cache.CacheBuilder;
    +import com.google.common.cache.CacheLoader;
    +import com.google.common.cache.LoadingCache;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.DataInputStream;
    +import java.io.File;
    +import java.io.FileInputStream;
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.nio.LongBuffer;
    +import java.util.concurrent.ExecutionException;
    +
    +/**
    + * Maintains an LRU cache of {@link ShuffleIndexInformation} so that
    + * we can avoid open/close of the index files for each block fetch.
    + */
    +public class ShuffleIndexCache {
    --- End diff --
    
    I mean, it doesn't need to be a class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org