You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2014/03/27 07:01:02 UTC

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/252

    [SPARK-1133] Add whole text files reader in MLlib

    Here is a pointer to the former [PR164](https://github.com/apache/spark/pull/164).
    
    I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark whole-files-input

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #252
    
----
commit 28cb0fe78f38bb2aa794166fe5ae4f82b925b52d
Author: Xusen Yin <yi...@gmail.com>
Date:   2014-03-27T05:58:41Z

    add whole text files reader

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39623475
  
    Thanks @mateiz and @mengxr !
    
    I'll take care of the new issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38774014
  
     Merged build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286849
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13662/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38779159
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13505/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39602678
  
    BTW if you have time, this would also be a good addition to the API: https://issues.apache.org/jira/browse/SPARK-1415


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39045517
  
    @mateiz I don't think there exists such an input format in standard Hadoop libraries. This is an example from White's Hadoop book. I see usages like loading html/xml/json files. If we put it to `SparkContext`, I would still use the name `wholeTextFile` instead of `textFiles`.
    
    For your second question, we set `isSplittable` to false in `WholeTextFileInputFormat` to prevent splitting an existing file. Doc says:
    
    > Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.
    
    But I agree it is worth testing to find out the truth. @yinxusen Can you create a test for it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38772250
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13500/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39288995
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39290994
  
    @mateiz `SparkContext#textFile` can read multiple files as well. Do you think `wholeTextFile` is more consistent? If we move the method to `SparkContext`, there will be no collision with other PRs. So we can merge this one first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39047305
  
    Hi @mateiz , here is my explanation:
    
    * Hadoop has no such input formant, but Mahout has. It is called `org.apache.mahout.text.SequenceFilesFromDirectory`. However, it is hard to use, and I think it is not suitable to call it from Spark directly because it will import some heavy packages.
    
    * For HDFS, it is not a good practice to hold so many small files, it will occupy lots of NameNode entries, with many blanks in blocks of DataNode. So I do not know whether it is generally used in other programs. But it is useful in machine learning algorithms such as Latent Dirichlet Allocation. Indeed, it is the pre - pull request for my LDA implementation.
    
    * A 100MB single file will usually be hold in two blocks with 2 replicas each. The `CombineFileInputFormat` class in mapred, i.e. the `HadoopFile` API in `SparkContext` cannot handle the split problem, because it allocates blocks to splits without the single file semantic. But the `CombineFileInputFormat` class in mapreduce do that, if we set the `isSplit()` function to false, it will put a single file into the same split, no matter whether the file exceeds a block size or not.
    
    * I have tested the split problem in the former [test suite](https://github.com/yinxusen/spark/commit/78c0f259a848aadc168edd76f9992ed4404bc510#diff-3f8bae96199c64e746098bd7a6d143e1R72) `fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)`. Here I set the block size to 512, and I use three different file sizes to test it.
    
    @mengxr Sorry I forget to test the split problem when using local disk as input source. I will add it ASAP. I think it will also have the chance to adjust block size when reading from local disk, or I have to write a file whose size exceeds 32MB (default local disk block size in Hadoop).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38774028
  
    Merged build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286647
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38775661
  
    Merged build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39122031
  
    @mengxr I add a `sc.hadoopConfiguration.setLong("fs.local.block.size", 32)` in the test code, which can limit the block size to 32B, while the `fileLengths = Array(10, 100, 1000)` could cover the split scenario, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38871767
  
    I think it is OK, @rxin shall we merge it? :)
    2014-3-27 PM4:40ÓÚ "UCB AMPLab" <no...@github.com>дµÀ£º
    
    > All automated tests passed.
    > Refer to this link for build results:
    > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13506/
    >
    > ¡ª
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/252#issuecomment-38779161>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38773117
  
    It seems that the test process is suddenly aborted. Can we retest it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286848
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38774524
  
    Merged build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38775575
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39395852
  
    Alright, I'm okay keeping it as `wholeTextFiles`. I noticed a few problems with the Java API though, please fix those too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286843
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39396075
  
    Yep. Vote for `wholeTextFiles` too. Let me fix these now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286892
  
    Sorry for the misoperation just now, I almost deleted the wrong file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11193752
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -372,6 +373,37 @@ class SparkContext(
       }
     
       /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    --- End diff --
    
    Maybe a warning is better, even though it can handle both big and small files, but the big files will cause bad performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/252


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39293384
  
    `sc.textFileRecords(...)`, `sc.textFileElements()`? Guess that's not great either :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38779156
  
    Merged build finished. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39158673
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11193643
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -372,6 +373,37 @@ class SparkContext(
       }
     
       /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    +   *
    +   * <p> For example, if you have the following files:
    +   * {{{
    +   *   hdfs://a-hdfs-path/part-00000
    +   *   hdfs://a-hdfs-path/part-00001
    +   *   ...
    +   *   hdfs://a-hdfs-path/part-nnnnn
    +   * }}}
    +   *
    +   * Do `val rdd = mlContext.wholeTextFile("hdfs://a-hdfs-path")`,
    +   *
    +   * <p> then `rdd` contains
    +   * {{{
    +   *   (a-hdfs-path/part-00000, its content)
    +   *   (a-hdfs-path/part-00001, its content)
    +   *   ...
    +   *   (a-hdfs-path/part-nnnnn, its content)
    +   * }}}
    +   */
    +  def wholeTextFiles(path: String): RDD[(String, String)] = {
    +    newAPIHadoopFile(
    --- End diff --
    
    does this need to be wrapped? It looks like you could pull the lines up and still be < 100 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38772185
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39411308
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13714/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39130731
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39153392
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38773952
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38776956
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39208378
  
    Hi @mateiz @mengxr , what do you think about the test? Besides, we could also judge it from the hadoop-common code of [`CombineFileInputFormat`](https://github.com/apache/hadoop-common/blob/release-1.0.4/src/mapred/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java#L496).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38881827
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38772249
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39291772
  
    BTW if you are okay with this, Xiangrui, I can merge it -- the code looks good to me. But let me know if you have other ideas on names, I don't 100% like this name either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39041796
  
    Also, what happens here if you have a single file that's multiple HDFS blocks? For instance say your block size is 64 MB but your file is 100. I thought CombineFileInputFormat will still break that file down into 2 splits, which is not what we want. Have you tested with this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39153384
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39130734
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13599/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39379540
  
    I also suggested `textFiles` but it's only one character away from `textFile` so it might be confusing. I agree that maybe `wholeTextFiles` is the best. Could also consider `separateTextFiles`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39158674
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13611/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38774514
  
     Merged build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286838
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39408459
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11232991
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
    @@ -154,6 +154,31 @@ class JavaSparkContext(val sc: SparkContext) extends JavaSparkContextVarargsWork
        */
       def textFile(path: String, minSplits: Int): JavaRDD[String] = sc.textFile(path, minSplits)
     
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    +   *
    +   * <p> For example, if you have the following files:
    +   * {{{
    +   *   hdfs://a-hdfs-path/part-00000
    +   *   hdfs://a-hdfs-path/part-00001
    +   *   ...
    +   *   hdfs://a-hdfs-path/part-nnnnn
    +   * }}}
    +   *
    +   * Do `val rdd = mlContext.wholeTextFile("hdfs://a-hdfs-path")`,
    --- End diff --
    
    This should say `JavaPairRDD<String, String> rdd = context.wholeTextFiles("hdfs://...")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11193654
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -372,6 +373,37 @@ class SparkContext(
       }
     
       /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    --- End diff --
    
    Would it make sense to add a warning that says to only use this for small files? (maybe it's obvious? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38779161
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13506/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39294300
  
    How about textFiles() ? @liancheng recommended it just now.
    2014-4-2 PM2:40ÓÚ "Patrick Wendell" <no...@github.com>дµÀ£º
    
    > sc.textFileRecords(...), sc.textFileElements()? Guess that's not great
    > either :P
    >
    > ¡ª
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/252#issuecomment-39293384>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39291407
  
    The difference is that `textFile` can also operate on a single text file, while `wholeTextFiles` really only makes sense on a set of files. That's why I want it to be clear that it returns multiple files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39286651
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11233008
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
    @@ -154,6 +154,31 @@ class JavaSparkContext(val sc: SparkContext) extends JavaSparkContextVarargsWork
        */
       def textFile(path: String, minSplits: Int): JavaRDD[String] = sc.textFile(path, minSplits)
     
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    +   *
    +   * <p> For example, if you have the following files:
    +   * {{{
    +   *   hdfs://a-hdfs-path/part-00000
    +   *   hdfs://a-hdfs-path/part-00001
    +   *   ...
    +   *   hdfs://a-hdfs-path/part-nnnnn
    +   * }}}
    +   *
    +   * Do `val rdd = mlContext.wholeTextFile("hdfs://a-hdfs-path")`,
    +   *
    +   * <p> then `rdd` contains
    +   * {{{
    +   *   (a-hdfs-path/part-00000, its content)
    +   *   (a-hdfs-path/part-00001, its content)
    +   *   ...
    +   *   (a-hdfs-path/part-nnnnn, its content)
    +   * }}}
    +   */
    +  def wholeTextFiles(path: String): JavaRDD[(String, String)] = sc.wholeTextFiles(path)
    --- End diff --
    
    This should return a JavaPairRDD, see how those are created in the rest of the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39393472
  
    `textFiles` is definitely confusing because `textFile` can also read multiple files. I vote for `wholeTextFiles`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11012011
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/input/WholeTextFileInputFormat.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input
    +
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.mapreduce.InputSplit
    +import org.apache.hadoop.mapreduce.JobContext
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat
    +import org.apache.hadoop.mapreduce.RecordReader
    +import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
    +
    +/**
    + * A [[org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat CombineFileInputFormat]] for
    + * reading whole text files. Each file is read as key-value pair, where the key is the file path and
    + * the value is the entire content of file.
    + */
    +
    +private[mllib] class WholeTextFileInputFormat extends CombineFileInputFormat[String, String] {
    +  override protected def isSplitable(context: JobContext, file: Path): Boolean = false
    +
    +  override def createRecordReader(
    +      split: InputSplit,
    +      context: TaskAttemptContext): RecordReader[String, String] = {
    +
    +    new CombineFileRecordReader[String, String](
    +    split.asInstanceOf[CombineFileSplit],
    --- End diff --
    
    extra 2 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11232955
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -372,6 +373,37 @@ class SparkContext(
       }
     
       /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    --- End diff --
    
    Yeah, please add a warning at the end of this, and make sure to also put it in the Java API doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39034504
  
    @rxin , it looks good to me. I will address @mateiz 's comment about `MLContext` in https://github.com/apache/spark/pull/245 . Shall we merge this first? Thanks @yinxusen for your work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39288996
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13663/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38772125
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/252#discussion_r11232967
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -372,6 +373,37 @@ class SparkContext(
       }
     
       /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, the value is the content of each file.
    +   *
    +   * <p> For example, if you have the following files:
    +   * {{{
    +   *   hdfs://a-hdfs-path/part-00000
    +   *   hdfs://a-hdfs-path/part-00001
    +   *   ...
    +   *   hdfs://a-hdfs-path/part-nnnnn
    +   * }}}
    +   *
    +   * Do `val rdd = mlContext.wholeTextFile("hdfs://a-hdfs-path")`,
    --- End diff --
    
    Also this should say sparkContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38776959
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13503/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39411306
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38779157
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39121747
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39300760
  
    `swallowFiles`? It seems to me `wholeTextFiles` is still the best candidate, which also reminds me Whole Foods and makes me feel hungry ... This is why I recommended `swallowFiles`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39041737
  
    Wait, before we merge this, is there no Hadoop InputFormat that already provides this? Also, if we want it to be a general method, why not add `textFiles` to SparkContext? Non-ML programs will also use this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39281837
  
    Actually looking at this again I do think you should move the `wholeTextFiles` method to SparkContext, including JavaSparkContext. The reason is that this type of input is quite common beyond ML. If you do that, you won't clash with any PRs affecting MLContext either. Just add this InputFormat into the core module, and move the methods and tests to SparkContext.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39293084
  
    @mateiz I agree that neither `wholeTextFile` nor `wholeTextFiles` is a good name. Shall we wait for a day or two and see whether there are better suggestions? This PR is not blocking any others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38775654
  
     Merged build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-38775740
  
    Seems that it is running & testing in background: https://travis-ci.org/apache/spark/builds/21653655


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39281777
  
    Cool, thanks for testing it, both manually and with that Hadoop property. This looks fine except that maybe we should call it `wholeTextFiles` since it's expected to return multiple files. Also we were discussing in another PR (https://github.com/apache/spark/pull/245) that MLContext is better named MLUtils or InputUtils or something like that. I'll let Xiangrui decide on what order to merge and rename these in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39121758
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39408467
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39594639
  
    Thanks Xusen, I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---