You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zhichao-li <gi...@git.apache.org> on 2015/10/13 11:13:02 UTC

[GitHub] spark pull request: [WIP]Combine splits by size

GitHub user zhichao-li opened a pull request:

    https://github.com/apache/spark/pull/9097

    [WIP]Combine splits by size

    The idea is simple and it try to solve this problem by combining splits by size which has been generated by the underlying inputformat, so it would support all of the inputformat in theory. 
    The combining size can be specified by `spark.sql.mapper.splitCombineSize`, the default value is: -1 meaning turn off the combining logic.
    i.e partition -> splits-> [combineSplit, combineSplit,...]-> RDD


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhichao-li/spark newCombine

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9097.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9097
    
----
commit f9392c3438c1a9172692d595b08fbb8b8ad0133d
Author: zhichao.li <zh...@intel.com>
Date:   2015-10-13T06:37:33Z

    combine split by specific size

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147659252
  
      [Test build #43641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43641/consoleFull) for   PR 9097 at commit [`f9392c3`](https://github.com/apache/spark/commit/f9392c3438c1a9172692d595b08fbb8b8ad0133d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188638099
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51929/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #9097: [SPARK-8813][SQL]Combine splits by size

Posted by KevinZwx <gi...@git.apache.org>.
Github user KevinZwx commented on the issue:

    https://github.com/apache/spark/pull/9097
  
    This issue was marked as fixed in spark 2.0.0, but "spark.sql.mapper.splitCombineSize" doesn't show up in the list of the SQL configuration when I run command "spark.sql("SET -v").show(numRows = 200, truncate = false)" in spark-sql session. Do I make something wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153961807
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45090/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188036191
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147898609
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153973454
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188603350
  
    **[Test build #51929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51929/consoleFull)** for PR 9097 at commit [`085ce5f`](https://github.com/apache/spark/commit/085ce5feca2294f81f9ec7a5660635be13c70a4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147690432
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154911936
  
    @zhichao-li Can this patch support all of formats(Text/ORC/Parquet)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147900129
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43691/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153974261
  
    cc/ @scwf @Sephiroth-Lin 
    
    @zhichao-li has posted the benchmark result that we've done, but it's based on the fake data, I know you guys have requirement on this improvement, too, can you please test it with some real world cases? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #9097: [SPARK-8813][SQL]Combine splits by size

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/9097
  
    I believe this has been fixed in Spark SQL in 2.0.0. Going to close this/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200588
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    --- End diff --
    
    `progress + curReader.getPos()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188109644
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51845/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-148599433
  
    It looks good in general, and can you also attach the benchmark result?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200659
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    +    return progress;
    +  }
    +
    +  public void close() throws IOException {
    +    if (curReader != null) {
    +      curReader.close();
    +      curReader = null;
    +    }
    +  }
    +
    +  /**
    +   * return progress based on the amount of data processed so far.
    +   */
    +  public float getProgress() throws IOException {
    +    return Math.min(1.0f,  progress/(float)(split.getLength()));
    +  }
    +  private InputFormat<K, V> inputFormat;
    --- End diff --
    
    Move this ahead? Put all of the class members together?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153933856
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153936737
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45089/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153934479
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188600562
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42201643
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    +
    +  private InputFormat<K, V> inputformat;
    +  private long splitSize = 0;
    +
    +  public CombineSplitInputFormat(InputFormat<K, V> inputformat, long splitSize) {
    +    this.inputformat = inputformat;
    +    this.splitSize = splitSize;
    +  }
    +
    +  /**
    +   * Create a single split from the list of blocks specified in validBlocks
    +   * Add this new split into splitList.
    +   */
    +  private void addCreatedSplit(List<CombineSplit> splitList,
    +                               long totalLen,
    +                               Collection<String> locations,
    +                               List<InputSplit> validSplits) {
    +    CombineSplit combineSparkSplit =
    +      new CombineSplit(validSplits.toArray(new InputSplit[0]),
    +        totalLen, locations.toArray(new String[0]));
    --- End diff --
    
    Leave a TODO, we probably can optimize this by providing the second/third optimal locations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153961727
  
    **[Test build #45090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45090/consoleFull)** for PR 9097 at commit [`5793af1`](https://github.com/apache/spark/commit/5793af16c662c648508fd1a41865fe86bfc9b4f9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class CombineSplit implements InputSplit `\n  * `public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> `\n  * `public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> `\n  * `class HadoopCombineRDD[K, V](`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147658806
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r51503553
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplit.java ---
    @@ -0,0 +1,97 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +
    +import org.apache.hadoop.io.Writable;
    +import org.apache.hadoop.io.WritableFactories;
    +import org.apache.hadoop.mapred.InputSplit;
    +
    +public class CombineSplit implements InputSplit {
    --- End diff --
    
    Please add a comment here to point out which version of Hive/Hadoop this implementation is based on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188034361
  
    **[Test build #51839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51839/consoleFull)** for PR 9097 at commit [`701700b`](https://github.com/apache/spark/commit/701700b10078d48d8e9f0342e623e94243642de3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200104
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplit.java ---
    @@ -0,0 +1,95 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +
    +import org.apache.hadoop.io.Writable;
    +import org.apache.hadoop.io.WritableFactories;
    +import org.apache.hadoop.mapred.InputSplit;
    +
    +public class CombineSplit implements InputSplit {
    +  private InputSplit[] splits;
    +  private long totalLen;
    +  private String[] locations;
    +
    +  public CombineSplit() {
    +  }
    +
    +  public CombineSplit(InputSplit[] ss, long totalLen, String[] locations) {
    +    splits = ss;
    +    this.totalLen = totalLen;
    +    this.locations = locations;
    +  }
    +
    +  public InputSplit getSplit(int idx) {
    +    return splits[idx];
    +  }
    +
    +  public int getSplitNum() {
    +    return splits.length;
    +  }
    +
    +  @Override
    +  public long getLength() {
    +    return totalLen;
    +  }
    +
    +  @Override
    +  public String[] getLocations() throws IOException {
    +    return locations;
    +  }
    +
    +  @Override
    +  public void write(DataOutput out) throws IOException {
    +    out.writeLong(totalLen);
    +    out.writeInt(locations.length);
    +    for (String location : locations) {
    +      out.writeUTF(location);
    +    }
    +    out.writeInt(splits.length);
    +    out.writeUTF(splits[0].getClass().getCanonicalName());
    --- End diff --
    
    Can you add a comment says, we only process combination within a single table partition, so all of the class name of the splits should be exactly the identical.
    
    Nit: Writing the split class name in the very beginning? Instead of after all of the location info.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147900128
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188638097
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r51503559
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    --- End diff --
    
    Remove these two annotations since they are not true in the scope of Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42201254
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    +
    +  private InputFormat<K, V> inputformat;
    +  private long splitSize = 0;
    +
    +  public CombineSplitInputFormat(InputFormat<K, V> inputformat, long splitSize) {
    +    this.inputformat = inputformat;
    +    this.splitSize = splitSize;
    +  }
    +
    +  /**
    +   * Create a single split from the list of blocks specified in validBlocks
    +   * Add this new split into splitList.
    +   */
    +  private void addCreatedSplit(List<CombineSplit> splitList,
    +                               long totalLen,
    +                               Collection<String> locations,
    +                               List<InputSplit> validSplits) {
    +    CombineSplit combineSparkSplit =
    +      new CombineSplit(validSplits.toArray(new InputSplit[0]),
    +        totalLen, locations.toArray(new String[0]));
    +    splitList.add(combineSparkSplit);
    +  }
    +
    +  @Override
    +  public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
    +    InputSplit[] splits = inputformat.getSplits(job, numSplits);
    +    // populate nodeToSplits and splitsSet
    +    Map<String, List<InputSplit>> nodeToSplits = Maps.newHashMap();
    +    Set<InputSplit> splitsSet = Sets.newHashSet();
    +    for (InputSplit split: splits) {
    +      for (String node: split.getLocations()) {
    +        if (!nodeToSplits.containsKey(node)) {
    +          nodeToSplits.put(node, new ArrayList<InputSplit>());
    +        }
    +        nodeToSplits.get(node).add(split);
    +      }
    +      splitsSet.add(split);
    +    }
    +    // Iterate the nodes to combine in order to evenly distributing the splits
    +    List<CombineSplit> combineSparkSplits = Lists.newArrayList();
    +    List<InputSplit> oneCombinedSplits = Lists.newArrayList();
    +    long currentSplitSize = 0L;
    +    for (Map.Entry<String, List<InputSplit>> entry: nodeToSplits.entrySet()) {
    +      String node = entry.getKey();
    +      List<InputSplit> splitsPerNode = entry.getValue();
    --- End diff --
    
    Will that be more helpful if we sort the `splitsPerNode` by the split length?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r53276223
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,110 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    --- End diff --
    
    I think it's not strictly tie up with a specific Hive/Hadoop version. Just borrow some idea from `CombineFileInputformat` from Hadoop version 2.2.0. 
    And this PR share a similar idea with https://github.com/apache/spark/pull/10572, which try to create a new proxy inputformat only that there's little bit diff with the combination logic. 
    Will try to clean this code a bit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42201471
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    +    return progress;
    +  }
    +
    +  public void close() throws IOException {
    +    if (curReader != null) {
    +      curReader.close();
    +      curReader = null;
    +    }
    +  }
    +
    +  /**
    +   * return progress based on the amount of data processed so far.
    +   */
    +  public float getProgress() throws IOException {
    +    return Math.min(1.0f,  progress/(float)(split.getLength()));
    --- End diff --
    
    I think the current one can give a more fine-grained metric. i.e. combinedSplit(split1-10M, split2-10M, split3-80M), let's say we've consumed split1 and split2 if we only calc by index then the progress is (2/3), but it would be (20%) otherwise.  Maybe I need to rename `split` -> `combinedSplit` to make name more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153961805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42201403
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    +
    +  private InputFormat<K, V> inputformat;
    +  private long splitSize = 0;
    +
    +  public CombineSplitInputFormat(InputFormat<K, V> inputformat, long splitSize) {
    +    this.inputformat = inputformat;
    +    this.splitSize = splitSize;
    +  }
    +
    +  /**
    +   * Create a single split from the list of blocks specified in validBlocks
    +   * Add this new split into splitList.
    +   */
    +  private void addCreatedSplit(List<CombineSplit> splitList,
    +                               long totalLen,
    +                               Collection<String> locations,
    +                               List<InputSplit> validSplits) {
    +    CombineSplit combineSparkSplit =
    +      new CombineSplit(validSplits.toArray(new InputSplit[0]),
    +        totalLen, locations.toArray(new String[0]));
    +    splitList.add(combineSparkSplit);
    +  }
    +
    +  @Override
    +  public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
    +    InputSplit[] splits = inputformat.getSplits(job, numSplits);
    +    // populate nodeToSplits and splitsSet
    +    Map<String, List<InputSplit>> nodeToSplits = Maps.newHashMap();
    +    Set<InputSplit> splitsSet = Sets.newHashSet();
    +    for (InputSplit split: splits) {
    +      for (String node: split.getLocations()) {
    +        if (!nodeToSplits.containsKey(node)) {
    +          nodeToSplits.put(node, new ArrayList<InputSplit>());
    +        }
    +        nodeToSplits.get(node).add(split);
    +      }
    +      splitsSet.add(split);
    +    }
    +    // Iterate the nodes to combine in order to evenly distributing the splits
    +    List<CombineSplit> combineSparkSplits = Lists.newArrayList();
    +    List<InputSplit> oneCombinedSplits = Lists.newArrayList();
    +    long currentSplitSize = 0L;
    +    for (Map.Entry<String, List<InputSplit>> entry: nodeToSplits.entrySet()) {
    +      String node = entry.getKey();
    +      List<InputSplit> splitsPerNode = entry.getValue();
    +      for (InputSplit split: splitsPerNode) {
    +        if (splitSize != 0 && currentSplitSize > splitSize) {
    +          addCreatedSplit(combineSparkSplits,
    +            currentSplitSize, Collections.singleton(node), oneCombinedSplits);
    +          currentSplitSize = 0;
    +          oneCombinedSplits.clear();
    +        }
    +        // this split has been combined
    +        if (!splitsSet.contains(split)) {
    --- End diff --
    
    Put the checking a little bit ahead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #9097: [SPARK-8813][SQL]Combine splits by size

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9097


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154008845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45104/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188129892
  
    retest this please. 
    
    seems like it's not related to this pr: `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.JoinedRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153934785
  
    @chenghao-intel  Just tested with data which have 15w small files and 1000 partitions.
    1) SQL (select count(*) from test), only improve a little bit, I guess tasks scheduling is not the bottle neck here, so reducing tasks number would not have too much effect. 
    2) SQL (select count(*) from test group by a ), the performance would increase by 3 times. reducing the tasks would largely improve the shuffle performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188063017
  
    **[Test build #51845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51845/consoleFull)** for PR 9097 at commit [`085ce5f`](https://github.com/apache/spark/commit/085ce5feca2294f81f9ec7a5660635be13c70a4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147690436
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43641/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188029962
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51836/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188033437
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147898449
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188036194
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51839/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153937302
  
    **[Test build #45090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45090/consoleFull)** for PR 9097 at commit [`5793af1`](https://github.com/apache/spark/commit/5793af16c662c648508fd1a41865fe86bfc9b4f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42200523
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    +    return progress;
    +  }
    +
    +  public void close() throws IOException {
    +    if (curReader != null) {
    +      curReader.close();
    +      curReader = null;
    +    }
    +  }
    +
    +  /**
    +   * return progress based on the amount of data processed so far.
    +   */
    +  public float getProgress() throws IOException {
    +    return Math.min(1.0f,  progress/(float)(split.getLength()));
    --- End diff --
    
    Since the progress granularity is split length, can we simply use the idx / split.getSplitNum()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154008840
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147898629
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r51503571
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    --- End diff --
    
    Please add a comment here to point out which version of Hive/Hadoop this implementation is based on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r42207551
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitRecordReader.java ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.classification.InterfaceAudience;
    +import org.apache.hadoop.classification.InterfaceStability;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.mapred.InputFormat;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapred.RecordReader;
    +import org.apache.hadoop.mapred.Reporter;
    +
    +/**
    + * A generic RecordReader that can hand out different recordReaders
    + * for each split in a {@link org.apache.spark.sql.hive.mapred.CombineSplit}.
    + */
    +@InterfaceAudience.Public
    +@InterfaceStability.Stable
    +public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> {
    +  protected CombineSplit split;
    +  protected JobConf jc;
    +  protected FileSystem fs;
    +
    +  protected int idx;
    +  protected long progress;
    +  protected RecordReader<K, V> curReader;
    +
    +  @Override
    +  public boolean next(K key, V value) throws IOException {
    +    while ((curReader == null) || !curReader.next(key, value)) {
    +      if (!initNextRecordReader()) {
    +        return false;
    +      }
    +    }
    +    return true;
    +  }
    +
    +  public K createKey() {
    +    return curReader.createKey();
    +  }
    +
    +  public V createValue() {
    +    return curReader.createValue();
    +  }
    +
    +  /**
    +   * return the amount of data processed
    +   */
    +  public long getPos() throws IOException {
    +    return progress;
    +  }
    +
    +  public void close() throws IOException {
    +    if (curReader != null) {
    +      curReader.close();
    +      curReader = null;
    +    }
    +  }
    +
    +  /**
    +   * return progress based on the amount of data processed so far.
    +   */
    +  public float getProgress() throws IOException {
    +    return Math.min(1.0f,  progress/(float)(split.getLength()));
    --- End diff --
    
    OK, I see, maybe we can rename "progress" to something else as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153936734
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147690302
  
      [Test build #43641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43641/console) for   PR 9097 at commit [`f9392c3`](https://github.com/apache/spark/commit/f9392c3438c1a9172692d595b08fbb8b8ad0133d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class CombineSplit implements InputSplit `
      * `public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> `
      * `public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> `
      * `class HadoopCombineRDD[K, V](`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153974277
  
    **[Test build #45104 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45104/consoleFull)** for PR 9097 at commit [`5793af1`](https://github.com/apache/spark/commit/5793af16c662c648508fd1a41865fe86bfc9b4f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188036182
  
    **[Test build #51839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51839/consoleFull)** for PR 9097 at commit [`701700b`](https://github.com/apache/spark/commit/701700b10078d48d8e9f0342e623e94243642de3).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154919462
  
    `CombineHiveInputFormat` or `CombineFileInputFormat` would have the restriction that it would always suppose the combined inputformat is a subclass of FileInputformat, but would not the same case if we can combine on InputSplit.
    
    ``` scala
    +public class CombineSplit implements InputSplit {
    +  private InputSplit[] splits;
    +  private long totalLen;
    +  private String[] locations;
    ``` 
    
    VS
    
    ``` scala
    public class CombineFileSplit extends InputSplit implements Writable {
    
     private Path[] paths;
     private long[] startoffset;
     private long[] lengths;
     private String[] locations;
     private long totLength;
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147898425
  
    cc @chenghao-intel 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188109643
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153973993
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153973976
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153933871
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-147658822
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by zhichao-li <gi...@git.apache.org>.
Github user zhichao-li commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154917321
  
    @watermen , Yes. It should support all formats in theory, since it combine on `InputSplit` level which is the result of  `inputformat.getSplits`.  In other words, split is transparent to inputformat.  I've tested it with Sequence, ORC and LZO. but this patch sometimes may not suitable for Parquet since it would not always go through `TableReader`  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-154008603
  
    **[Test build #45104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45104/consoleFull)** for PR 9097 at commit [`5793af1`](https://github.com/apache/spark/commit/5793af16c662c648508fd1a41865fe86bfc9b4f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class CombineSplit implements InputSplit `\n  * `public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> `\n  * `public class CombineSplitRecordReader<K, V> implements RecordReader<K, V> `\n  * `class HadoopCombineRDD[K, V](`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9097#discussion_r51503595
  
    --- Diff: sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplitInputFormat.java ---
    @@ -0,0 +1,110 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.mapred;
    +
    +import java.io.IOException;
    +import java.util.*;
    +
    +import com.clearspring.analytics.util.Lists;
    +import com.google.common.collect.Maps;
    +import com.google.common.collect.Sets;
    +import org.apache.hadoop.mapred.*;
    +
    +public class CombineSplitInputFormat<K, V> implements InputFormat<K, V> {
    --- End diff --
    
    Please add a comment here to point out which version of Hive/Hadoop this implementation is based on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL][WIP]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-153934469
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188029959
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188109500
  
    **[Test build #51845 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51845/consoleFull)** for PR 9097 at commit [`085ce5f`](https://github.com/apache/spark/commit/085ce5feca2294f81f9ec7a5660635be13c70a4a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8813][SQL]Combine splits by size

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9097#issuecomment-188637662
  
    **[Test build #51929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51929/consoleFull)** for PR 9097 at commit [`085ce5f`](https://github.com/apache/spark/commit/085ce5feca2294f81f9ec7a5660635be13c70a4a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org