You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhichao-li <gi...@git.apache.org> on 2015/10/13 11:13:02 UTC

[GitHub] spark pull request: [WIP]Combine splits by size

GitHub user zhichao-li opened a pull request:

    https://github.com/apache/spark/pull/9097

    [WIP]Combine splits by size

    The idea is simple and it try to solve this problem by combining splits by size which has been generated by the underlying inputformat, so it would support all of the inputformat in theory. 
    The combining size can be specified by `spark.sql.mapper.splitCombineSize`, the default value is: -1 meaning turn off the combining logic.
    i.e partition -> splits-> [combineSplit, combineSplit,...]-> RDD


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhichao-li/spark newCombine

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9097.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9097
    
----
commit f9392c3438c1a9172692d595b08fbb8b8ad0133d
Author: zhichao.li <zh...@intel.com>
Date:   2015-10-13T06:37:33Z

    combine split by specific size

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147659252
  
      [Test build #43641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43641/consoleFull) for   PR 9097 at commit [`f9392c3`](https://github.com/apache/spark/commit/f9392c3438c1a9172692d595b08fbb8b8ad0133d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188638099
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51929/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #9097: [SPARK-8813][SQL]Combine splits by size

Posted by KevinZwx <gi...@git.apache.org>.

Github user KevinZwx commented on the issue:

    https://github.com/apache/spark/pull/9097
  
    This issue was marked as fixed in spark 2.0.0, but "spark.sql.mapper.splitCombineSize" doesn't show up in the list of the SQL configuration when I run command "spark.sql("SET -v").show(numRows = 200, truncate = false)" in spark-sql session. Do I make something wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153961807
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45090/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188036191
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147898609
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.

Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153973454
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188603350
  
    **[Test build #51929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51929/consoleFull)** for PR 9097 at commit [`085ce5f`](https://github.com/apache/spark/commit/085ce5feca2294f81f9ec7a5660635be13c70a4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147690432
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by watermen <gi...@git.apache.org>.

Github user watermen commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154911936
  
    @zhichao-li Can this patch support all of formats(Text/ORC/Parquet)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147900129
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43691/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153974261
  
    cc/ @scwf @Sephiroth-Lin 
    
    @zhichao-li has posted the benchmark result that we've done, but it's based on the fake data, I know you guys have requirement on this improvement, too, can you please test it with some real world cases? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #9097: [SPARK-8813][SQL]Combine splits by size

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/9097
  
    I believe this has been fixed in Spark SQL in 2.0.0. Going to close this/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200588
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    --- End diff --
    
    `progress + curReader.getPos()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188109644
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51845/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-148599433
  
    It looks good in general, and can you also attach the benchmark result?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200659
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    +    return progress;
    +  }
    +
    +  public void close() throws IOException {
    +    if (curReader != null) {
    +      curReader.close();
    +      curReader = null;
    +    }
    +  }
    +
    +  /**
    +   * return progress based on the amount of data processed so far.
    +   */
    +  public float getProgress() throws IOException {
    +    return Math.min(1.0f,  progress/(float)(split.getLength()));
    +  }
    +  private InputFormat<K, V> inputFormat;
    --- End diff --
    
    Move this ahead? Put all of the class members together?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153933856
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153936737
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45089/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153934479
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.

Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188600562
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42201643
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    +
    +  private InputFormat<K, V> inputformat;
    +  private long splitSize = 0;
    +
    +  public CombineSplitInputFormat(InputFormat<K, V> inputformat, long splitSize) {
    +    this.inputformat = inputformat;
    +    this.splitSize = splitSize;
    +  }
    +
    +  /**
    +   * Create a single split from the list of blocks specified in validBlocks
    +   * Add this new split into splitList.
    +   */
    +  private void addCreatedSplit(List<CombineSplit> splitList,
    +                               long totalLen,
    +                               Collection<String> locations,
    +                               List<InputSplit> validSplits) {
    +    CombineSplit combineSparkSplit =
    +      new CombineSplit(validSplits.toArray(new InputSplit[0]),
    +        totalLen, locations.toArray(new String[0]));
    --- End diff --
    
    Leave a TODO, we probably can optimize this by providing the second/third optimal locations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153961727
  
    **[Test build #45090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45090/consoleFull)** for PR 9097 at commit [`5793af1`](https://github.com/apache/spark/commit/5793af16c662c648508fd1a41865fe86bfc9b4f9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class CombineSplit implements InputSplit `\n  * `public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> `\n  * `public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> `\n  * `class HadoopCombineRDD[K, V](`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org