You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2014/03/18 05:33:08 UTC

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/164

    [SPARK-1133] add small files input in MLlib

    I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.
    
    The interface exposes to end-user is `smallTextFiles()` in MLUtils.scala, which is similar with `textFile()` in SparkContext.scala. The interface reads a directory of text files from HDFS or native disk, then creates an `RDD[(String, String)]`, the former string is the file name, while the latter is the file content.
    
    This interface is really useful when you want to read a bunch of files in your application. Take Latent Dirichlet Allocation, a.k.a. LDA as an example, it is urgent to use this interface.
    
    The typical scenario here is reading a bunch of files from native disk, then processing it, and saving as a `sequenceFile` in the end. It can also read a directory of files in HDFS, though it is not a good practice. 
    
    Similar implementation in Mahout is named `sequenceFileFromDirectory`, but it is not that easy to use. Here `newAPIHadoopFile()` is used instead of `hadoopFile()`, because of the `CombineFileInputFormat` in the new Hadoop API has the opportunity to set `isSplitable()` to false, otherwise we should use shuffle to merge file content after reading blocks from HDFS. 
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark small-files-input

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #164
    
----
commit fd93e5915bf3679c1795881f2079c2db506fe22c
Author: Xusen Yin <yi...@gmail.com>
Date:   2014-03-18T00:19:42Z

    add small text files input API

commit 9bf87d443841c9db4bb9a1c595eecc04fe27d049
Author: Xusen Yin <yi...@gmail.com>
Date:   2014-03-18T00:21:37Z

    Merge branch 'master' into small-files-input

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38756255
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38150471
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38756257
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10785884
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    --- End diff --
    
    You said `bytes format` but it returns `Text`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10785949
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        length = (int) split.getLength(index);
    +        fs = path.getFileSystem(context.getConfiguration());
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        return processed ? 1.0f : 0.0f;
    --- End diff --
    
    This is not correct because there might be multiple files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38128541
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38025512
  
    If I understand the purpose correctly, this PR is for reading small text files. But most of the code is to handle the corner case when a file's size is greater than 2GB. You mentioned Mahout hit this problem. What was the use case there? If someone needs to concat several 2GB bytes buffers to create a single Text record, very likely he/she are not doing the right thing, IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691823
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    --- End diff --
    
    What is the value here? Is it a line from a text file or the whole file? If the value is an arbitrary Text converted from bytes, how do you split a file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10873434
  
    --- Diff: project/SparkBuild.scala ---
    @@ -358,7 +358,8 @@ object SparkBuild extends Build {
       def mllibSettings = sharedSettings ++ Seq(
         name := "spark-mllib",
         libraryDependencies ++= Seq(
    -      "org.jblas" % "jblas" % "1.2.3"
    +      "commons-io"               % "commons-io"         % "2.4",
    --- End diff --
    
    We don't need `commons-io` after we use Guava to read bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38667187
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38765336
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10912943
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -120,4 +122,17 @@ object MLUtils {
         }
         sum
       }
    +
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI, and return it as an RDD of (filename, content) both in String
    --- End diff --
    
    (path, content)
    
    Also need to mention the entire file is read as a single record. See `WholeTextInputFormat` for reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829575
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in <filename, content> format.
    + */
    +
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    +    }
    +
    +  @Override
    +  public Text getCurrentValue() throws IOException, InterruptedException{
    +        return value;
    +    }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException {
    +    if (!processed) {
    +      if (key == null) {
    +        key = path.getName();
    --- End diff --
    
    Let's use the full path here for easy dedup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38769629
  
    Yea - perhaps apply it to trunk. Feel free to create another PR if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965793
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
    @@ -0,0 +1,55 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.input.WholeTextFileInputFormat
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * Extra functions available on SparkContext of mllib through an implicit conversion. Import
    + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions.
    + */
    +class MLContext(self: SparkContext) {
    +
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    --- End diff --
    
    An example here would be great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38436605
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38671708
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38143977
  
    I think it is fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38143992
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786345
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +  private var dfs: MiniDFSCluster = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    sc.hadoopConfiguration.set("dfs.datanode.data.dir.perm", SmallTextFilesSuite.dirPermission())
    +    dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
    +                             Array("/rack0", "/rack0", "/rack1", "/rack1"),
    +                             Array("host0", "host1", "host2", "host3"))
    +  }
    +
    +  override def afterAll() {
    +    if (dfs != null) dfs.shutdown()
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +
    +  private def createHDFSFile(
    +      fs: FileSystem,
    +      inputDir: Path,
    +      fileName: String,
    +      contents: Array[Byte]) = {
    +    val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    System.out.println("Wrote HDFS file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on HDFS. There are three aspects to test:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || HDFS IO") {
    +    val fs: FileSystem = dfs.getFileSystem
    +    val dir = "/foo/"
    +    val inputDir: Path = new Path(dir)
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createHDFSFile(fs, inputDir, fname, contents)
    +    }
    +
    +    println(s"name node is ${dfs.getNameNode.getNameNodeAddress.getHostName}")
    +    println(s"name node port is ${dfs.getNameNodePort}")
    +
    +    val hdfsAddressDir =
    +      s"hdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir"
    +    println(s"HDFS address dir is $hdfsAddressDir")
    +
    +    val res = smallTextFiles(sc, hdfsAddressDir).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +  }
    +
    +  private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on native file system. There are three aspects:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || native disk IO") {
    +
    +    sc.hadoopConfiguration.clear()
    +
    +    val dir = Files.createTempDirectory("smallfiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = smallTextFiles(sc, dir.toString).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +
    +    SmallTextFilesSuite.fileNames.foreach { fname =>
    +      Files.deleteIfExists(JPaths.get(s"${dir.toString}/$fname"))
    +    }
    +    Files.deleteIfExists(dir)
    +  }
    +}
    +
    +/**
    + * Some final values are defined here. chineseWordsSpark is refer to the Chinese character version
    + * of "Spark", we use UTF-8 to encode the bytes together, with a '\n' in the end. fileNames and
    + * fileContents represent the test data that will be used later. hashCodeOfContents is a Map of
    + * fileName to the hashcode of contents, which is used for the comparison of contents, i.e. the
    + * "read in" contents should be same with the "read out" ones.
    + */
    +
    +object SmallTextFilesSuite {
    +  private val chineseWordsSpark = Array(
    +    0xe7.toByte, 0x81.toByte, 0xab.toByte,
    +    0xe8.toByte, 0x8a.toByte, 0xb1.toByte,
    +    '\n'.toByte)
    +
    +  private val fileNames = Array("part-00000", "part-00001", "part-00002")
    +
    +  private val filesContents = Array(7, 70, 700).map { upperBound =>
    +    Stream.continually(chineseWordsSpark.toList.toStream).flatten.take(upperBound).toArray
    +  }
    +
    +  private val hashCodeOfContents = fileNames.zip(filesContents).map { case (fname, contents) =>
    +    fname -> new Text(contents).toString.hashCode
    +  }.toMap
    +
    +  /**
    +   * Functions getUmask and getDirPermission are used for get the default directory permission on
    +   * current system, so as to give the MiniDFSCluster proper permissions. Or the test will be abort
    +   * due to the improper permission.
    +   */
    +  private def umask(): Int = {
    --- End diff --
    
    It is strange to see this function in the test. Does MiniDFSCluster read Hadoop configurations like `dfs.umaskmode`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786514
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        length = (int) split.getLength(index);
    +        fs = path.getFileSystem(context.getConfiguration());
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        return processed ? 1.0f : 0.0f;
    +    }
    +
    +    @Override
    +    public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    +    }
    +
    +    @Override
    +    public Text getCurrentValue() throws IOException, InterruptedException{
    +        return value;
    +    }
    +
    +    @Override
    +    public boolean nextKeyValue() throws IOException {
    +        if (!processed) {
    --- End diff --
    
    Okay. I misread the type of BatchFilesRecordReader. Its name is a little confusing because it only reads one file per instance. I still recommend using `WholeTextFileRecordReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965782
  
    --- Diff: project/SparkBuild.scala ---
    @@ -358,7 +358,7 @@ object SparkBuild extends Build {
       def mllibSettings = sharedSettings ++ Seq(
         name := "spark-mllib",
         libraryDependencies ++= Seq(
    -      "org.jblas" % "jblas" % "1.2.3"
    +      "org.jblas"                % "jblas"              % "1.2.3"
    --- End diff --
    
    Let's not change this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38769541
  
    @rxin @mengxr It seems that I do something wrong, could I close the branch and create a new one? I will use the latest trunk then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10993561
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
    @@ -0,0 +1,58 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib
    +
    +import org.apache.spark.mllib.input.WholeTextFileInputFormat
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Extra functions available on SparkContext of mllib through an implicit conversion. Import
    + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions.
    + */
    +class MLContext(self: SparkContext) {
    +
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. For example,
    +   *   {{{
    +   *   hdfs://a-hdfs-path/part-00000
    --- End diff --
    
    Try this ...
    ```
    /* If you have the following files:
    hdfs://a-hdfs-path/part-00000
    hdfs://a-hdfs-path/part-00001
    ...
    hdfs://a-hdfs-path/part-nnnnn
    */
    
    val rdd = mlContext.wholeTextFile("hdfs://a-hdfs-path")
    // RDD contains
    // ....   <---------------- explain what the content looks like
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865849
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    --- End diff --
    
    Insert an empty line or order imports alphabetically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691097
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    --- End diff --
    
    I think not. Here is a different meaning. `WholeTextFileInputFormat`, which extends `FileInputFormat`, is used to read an entire huge file from HDFS or disks, but here I called it `BatchFileInputFormat` because it will read a bunch of files, not just a single file, and it extends `CombineFileInputFormat`. I think I should rename it `BatchFilesInputFormat`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38441593
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38410991
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10912839
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    --- End diff --
    
    remove unused import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38761355
  
    Somehow this is no longer mergeable. Do you mind bring it up to date? Everything looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865876
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -17,12 +17,14 @@
     
     package org.apache.spark.mllib.util
     
    -import org.apache.spark.SparkContext
    -import org.apache.spark.rdd.RDD
    -import org.apache.spark.SparkContext._
    -
    +import org.apache.hadoop.io.Text
    --- End diff --
    
    okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38146292
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37901339
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13224/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786231
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        length = (int) split.getLength(index);
    +        fs = path.getFileSystem(context.getConfiguration());
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        return processed ? 1.0f : 0.0f;
    +    }
    +
    +    @Override
    +    public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    +    }
    +
    +    @Override
    +    public Text getCurrentValue() throws IOException, InterruptedException{
    +        return value;
    +    }
    +
    +    @Override
    +    public boolean nextKeyValue() throws IOException {
    +        if (!processed) {
    --- End diff --
    
    No, it doesn't. One instance of `BatchFilesRecordReader` just processes only one file. In the constructor of this class, it passes an `Integer index` in, which represents the current file in the file sets. Same reason for the above question, i.e. the `getProcess()` function..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38146374
  
    Btw, please fix style issues. We use 2-space indentation in Spark, and there are more at the Spark Code Style page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38520669
  
    @yinxusen I didn't mean putting it under `SparkContext` directly, but something like `MLContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38441596
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13394/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38347601
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10912844
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    --- End diff --
    
    change `<code>` to `{@link `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38146291
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38643353
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13453/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10861387
  
    --- Diff: project/SparkBuild.scala ---
    @@ -358,7 +358,11 @@ object SparkBuild extends Build {
       def mllibSettings = sharedSettings ++ Seq(
         name := "spark-mllib",
         libraryDependencies ++= Seq(
    -      "org.jblas" % "jblas" % "1.2.3"
    +      "com.sun.jersey"           % "jersey-core"        % "1.8"           % "test",
    --- End diff --
    
    Ah.. I forget to remove these useless packages. I'll fix it in the next commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965749
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.ByteStreams;
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a single whole text file out in a
    + * key-value pair, where the key is the file path and the value is the entire content of the file.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    --- End diff --
    
    can u add some inline comment to explain what "processed" means? You don't want the reader to go through nextKeyValue to find out what it does ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38640445
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10871469
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format.
    + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader.
    --- End diff --
    
    \<code\>org.apache.hadoop.mapreduce.lib.CombineFileInputFormat\</code\> should be used, I think. Java code file could not use [[]].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38759435
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10954776
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.input.WholeTextFileInputFormat
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * Extra functions available on SparkContext of mllib through an implicit conversion. Import
    + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions.
    + */
    +class MLContext(self: SparkContext) extends Logging {
    +
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, and the value is the content of each
    +   * file.
    +   */
    +  def wholeTextFile(path: String): RDD[(String, String)] = {
    +    self.newAPIHadoopFile(
    +      path,
    +      classOf[WholeTextFileInputFormat],
    +      classOf[String],
    +      classOf[Text]).mapValues(_.toString)
    +  }
    +}
    +
    +
    +/**
    + * The MLContext object contains a number of implicit conversions and parameters for use with
    + * various mllib features.
    + */
    +object MLContext extends Logging {
    +  implicit def mlContextToSparkContext(sc: SparkContext) = new MLContext(sc)
    --- End diff --
    
    sparkContextToMLContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38568167
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13432/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38562620
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865895
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    +import java.nio.file.Path
    +import java.nio.file.Paths
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
    + * directory is created as fake input. Temporal storage would be deleted in the end.
    + */
    +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +  }
    +
    +  private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are
    +   * three aspects to check:
    +   *   1) is all files are read.
    --- End diff --
    
    whether all files are read


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10880354
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    + * (filename, content) format. Each element in split is an record of a unique, whole file. File name
    + * is the full path name for easy deduplicate.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +    return key;
    +  }
    +
    +  @Override
    +  public Text getCurrentValue() throws IOException, InterruptedException{
    +    return value;
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException {
    +    if (!processed) {
    +      if (key == null) {
    +        key = path.toString();
    +      }
    +      if (value == null) {
    +        value = new Text();
    +      }
    +
    +      FSDataInputStream fileIn = null;
    +      try {
    +        fileIn = fs.open(path);
    +        byte[] innerBuffer = IOUtils.toByteArray(fileIn);
    --- End diff --
    
    @srowen It will be cool to remove the dependency of commons-io. But I find that there are only a few places using commons-io, and most of them are testsuites.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37901337
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10873319
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format.
    + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader.
    --- End diff --
    
    Sorry I forgot this is Java. Then you should use `@link` instead of `[[]]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10912852
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    + * (filename, content) format. Each element in split is an record of a unique, whole file. File name
    --- End diff --
    
    (path, content)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786181
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -120,4 +122,22 @@ object MLUtils {
         }
         sum
       }
    +
    +  /**
    +   * Reads a bunch of small files from HDFS, or a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI, and return an RDD[(String, String)].
    +   *
    +   * @param path The directory you should specified, such as
    +   *             hdfs://[address]:[port]/[dir]
    +   *
    +   * @return RDD[(fileName: String, content: String)]
    +   *         i.e. the first is the file name of a file, the second one is its content.
    +   */
    +  def smallTextFiles(sc: SparkContext, path: String): RDD[(String, String)] = {
    --- End diff --
    
    The current API uses `textFile` to load text files. I suggest using `wholeTextFile` to load the entire content of each file. `small` is not a good name here because it is not defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690953
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    --- End diff --
    
    Are you reading the entire file in bytes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690798
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    --- End diff --
    
    This is basically the `WholeFileInputFormat` for `Text`:
    
    https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
    
    Shall we call it `WholeTextFileInputFormat`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786005
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        length = (int) split.getLength(index);
    +        fs = path.getFileSystem(context.getConfiguration());
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        return processed ? 1.0f : 0.0f;
    +    }
    +
    +    @Override
    +    public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    +    }
    +
    +    @Override
    +    public Text getCurrentValue() throws IOException, InterruptedException{
    +        return value;
    +    }
    +
    +    @Override
    +    public boolean nextKeyValue() throws IOException {
    +        if (!processed) {
    +            if (key == null) {
    +                key = path.getName();
    +            }
    +            if (value == null) {
    +                value = new Text();
    +            }
    +
    +            FSDataInputStream fileIn = null;
    +            try {
    +                fileIn = fs.open(path);
    +                fileIn.seek(startOffset);
    +                byte[] innerBuffer = new byte[length];
    +                IOUtils.readFully(fileIn, innerBuffer, 0, length);
    --- End diff --
    
    use `IOUtils.toByteArray(fileIn)` for simplicity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38568166
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691327
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFileRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private long end;
    +    private long pos;
    +    private Path path;
    +
    +    private static final int MAX_BYTES_ALLOCATION = 64 * 1024 * 1024;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private FSDataInputStream fileIn;
    +
    +    public BatchFileRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        pos = startOffset;
    +        end = startOffset + split.getLength(index);
    +
    +        FileSystem fs = path.getFileSystem(context.getConfiguration());
    +        fileIn = fs.open(path);
    +        fileIn.seek(startOffset);
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {}
    +
    +    @Override
    +    public void close() throws IOException {
    +        if (fileIn != null) {
    +            fileIn.close();
    +        }
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        if (startOffset == end) return 0;
    --- End diff --
    
    Yeah, but not `startOffset == end`, should be `pos == end`. I make a mistake here.
    
    `if (pos == end) return 1.0f;` is OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38640446
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37910845
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691012
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    --- End diff --
    
    I think `input` is better, because input of machine learning algorithm is usually datasets, batch/streaming files, in different format. Meantime, the `output` is usually models or prediction results, which has a different semantic meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38346619
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38128996
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13287/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38150473
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13298/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691283
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    --- End diff --
    
    In the life cycle of an instance of this class, the answer is yes. You know, the `split(index)` here indicates a single file. Because I set `isSplitable()` in `BatchFileInputFormat` to false.
    
    The `nextKeyValue()` could be called several times, until not more bytes left in the `split(index)`, by `nextKeyValue()` function in `CombineFileInputFormat`, and eventually the `compute()` in `HadoopRDD`. It will read an entire file in bytes, and I encapsulate these bytes into `Text`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37910134
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37899279
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865885
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -120,4 +122,21 @@ object MLUtils {
         }
         sum
       }
    +
    +  /**
    +   * Reads a bunch of whole files from HDFS, or a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI, and return an RDD[(String, String)].
    +   *
    +   * @param path The directory you should specified, such as
    +   *             hdfs://[address]:[port]/[dir]
    +   *
    +   * @return The first String is a file name, the second one is its content.
    --- End diff --
    
    Need to improve the doc. See my comment on `WholeTextInputFormat` and the doc of `textFile` as a reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829013
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in <filename, content> format.
    + */
    +
    --- End diff --
    
    remove empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691105
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    +        extends CombineFileInputFormat<String, Text> {
    --- End diff --
    
    Sorry for my carelessness. I'll fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38518467
  
    @yinxusen I think the code looks good to me except a few docs. I'm not sure whether `MLUtils` is a good place for `wholeTextFile`. We can pimp `SparkContext` and put this method there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38253432
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865841
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format.
    + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader.
    --- End diff --
    
    Shall we change the doc to the following?
    
    > An [[org.apache.hadoop.mapreduce.lib.CombineFileInputFormat]] for reading whole text files. Each file is read as a key-value pair, where the key is the file path and the value is the entire content of the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690809
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    +        extends CombineFileInputFormat<String, Text> {
    --- End diff --
    
    The indentation you used is not consistent with Spark code style. This line may fit the line above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10785965
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        length = (int) split.getLength(index);
    +        fs = path.getFileSystem(context.getConfiguration());
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        return processed ? 1.0f : 0.0f;
    +    }
    +
    +    @Override
    +    public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    +    }
    +
    +    @Override
    +    public Text getCurrentValue() throws IOException, InterruptedException{
    +        return value;
    +    }
    +
    +    @Override
    +    public boolean nextKeyValue() throws IOException {
    +        if (!processed) {
    --- End diff --
    
    So it only reads the first file and throws away the rest?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786698
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +  private var dfs: MiniDFSCluster = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    sc.hadoopConfiguration.set("dfs.datanode.data.dir.perm", SmallTextFilesSuite.dirPermission())
    +    dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
    +                             Array("/rack0", "/rack0", "/rack1", "/rack1"),
    +                             Array("host0", "host1", "host2", "host3"))
    +  }
    +
    +  override def afterAll() {
    +    if (dfs != null) dfs.shutdown()
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +
    +  private def createHDFSFile(
    +      fs: FileSystem,
    +      inputDir: Path,
    +      fileName: String,
    +      contents: Array[Byte]) = {
    +    val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    System.out.println("Wrote HDFS file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on HDFS. There are three aspects to test:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || HDFS IO") {
    +    val fs: FileSystem = dfs.getFileSystem
    +    val dir = "/foo/"
    +    val inputDir: Path = new Path(dir)
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createHDFSFile(fs, inputDir, fname, contents)
    +    }
    +
    +    println(s"name node is ${dfs.getNameNode.getNameNodeAddress.getHostName}")
    +    println(s"name node port is ${dfs.getNameNodePort}")
    +
    +    val hdfsAddressDir =
    +      s"hdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir"
    +    println(s"HDFS address dir is $hdfsAddressDir")
    +
    +    val res = smallTextFiles(sc, hdfsAddressDir).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +  }
    +
    +  private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on native file system. There are three aspects:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || native disk IO") {
    +
    +    sc.hadoopConfiguration.clear()
    +
    +    val dir = Files.createTempDirectory("smallfiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = smallTextFiles(sc, dir.toString).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +
    +    SmallTextFilesSuite.fileNames.foreach { fname =>
    +      Files.deleteIfExists(JPaths.get(s"${dir.toString}/$fname"))
    +    }
    +    Files.deleteIfExists(dir)
    +  }
    +}
    +
    +/**
    + * Some final values are defined here. chineseWordsSpark is refer to the Chinese character version
    + * of "Spark", we use UTF-8 to encode the bytes together, with a '\n' in the end. fileNames and
    + * fileContents represent the test data that will be used later. hashCodeOfContents is a Map of
    + * fileName to the hashcode of contents, which is used for the comparison of contents, i.e. the
    + * "read in" contents should be same with the "read out" ones.
    + */
    +
    +object SmallTextFilesSuite {
    +  private val chineseWordsSpark = Array(
    +    0xe7.toByte, 0x81.toByte, 0xab.toByte,
    +    0xe8.toByte, 0x8a.toByte, 0xb1.toByte,
    +    '\n'.toByte)
    +
    +  private val fileNames = Array("part-00000", "part-00001", "part-00002")
    +
    +  private val filesContents = Array(7, 70, 700).map { upperBound =>
    +    Stream.continually(chineseWordsSpark.toList.toStream).flatten.take(upperBound).toArray
    +  }
    +
    +  private val hashCodeOfContents = fileNames.zip(filesContents).map { case (fname, contents) =>
    +    fname -> new Text(contents).toString.hashCode
    +  }.toMap
    +
    +  /**
    +   * Functions getUmask and getDirPermission are used for get the default directory permission on
    +   * current system, so as to give the MiniDFSCluster proper permissions. Or the test will be abort
    +   * due to the improper permission.
    +   */
    +  private def umask(): Int = {
    --- End diff --
    
    It is a bug in `MiniDFSCluster`, reference [here](http://stackoverflow.com/questions/17625938/hbase-minidfscluster-java-fails-in-certain-environments). MiniDFSCluster creates fake HDFS in your local disk when doing test. But if it encounters the improper `umask` in Linux system, it would complain about NPE, which took me lots of time to find the reason and fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38346620
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37899280
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10828993
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new WholeTextFileRecordReader.
    + */
    +public class WholeTextFileInputFormat
    +  extends CombineFileInputFormat<String, Text> {
    +
    +  @Override
    +  protected boolean isSplitable(JobContext context, Path file) {
    +    return false;
    +  }
    +  @Override
    +  public RecordReader<String, Text> createRecordReader(
    +      InputSplit split,
    +      TaskAttemptContext context) throws IOException {
    +    return new CombineFileRecordReader<String, Text>(
    +      (CombineFileSplit)split,
    --- End diff --
    
    space after ')'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38248858
  
    Yes. If you're not sure about the right style for something, try to follow the style of the existing codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37910847
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13234/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10692666
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    --- End diff --
    
    It depends. If the file length little than the `MAX_BYTES_ALLOCATION`, it will read the whole file out one-time. Otherwise, it will read several splits of a file out. Fragments of a file will be merged together in `smallTextFiles()` interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38128543
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38253433
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38643350
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38128624
  
    @mengxr Your advise makes sense. I remove the merge process away from `smallTextFiles()`, and rewrite the reading logic in `RecoderReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38759439
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13485/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829073
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -17,12 +17,14 @@
     
     package org.apache.spark.mllib.util
     
    -import org.apache.spark.SparkContext
    -import org.apache.spark.rdd.RDD
    -import org.apache.spark.SparkContext._
    -
    +import org.apache.hadoop.io.Text
    --- End diff --
    
    organize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829741
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{BufferedReader, DataOutputStream, FileOutputStream, InputStreamReader}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll {
    --- End diff --
    
    Indeed, we only need to write a unit test for `WholeTextFileRecordReader`. It is not necessary to verify that  `wholeTextFile` can load files from HDFS. We are actually testing `newAPIHadoopFile` here. Maybe we can get some suggestions from others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38765347
  
    Oh... Is that OK? That's strange...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38409137
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786403
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +  private var dfs: MiniDFSCluster = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    sc.hadoopConfiguration.set("dfs.datanode.data.dir.perm", SmallTextFilesSuite.dirPermission())
    +    dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
    +                             Array("/rack0", "/rack0", "/rack1", "/rack1"),
    +                             Array("host0", "host1", "host2", "host3"))
    +  }
    +
    +  override def afterAll() {
    +    if (dfs != null) dfs.shutdown()
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +
    +  private def createHDFSFile(
    +      fs: FileSystem,
    +      inputDir: Path,
    +      fileName: String,
    +      contents: Array[Byte]) = {
    +    val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    System.out.println("Wrote HDFS file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on HDFS. There are three aspects to test:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || HDFS IO") {
    +    val fs: FileSystem = dfs.getFileSystem
    +    val dir = "/foo/"
    +    val inputDir: Path = new Path(dir)
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createHDFSFile(fs, inputDir, fname, contents)
    +    }
    +
    +    println(s"name node is ${dfs.getNameNode.getNameNodeAddress.getHostName}")
    +    println(s"name node port is ${dfs.getNameNodePort}")
    +
    +    val hdfsAddressDir =
    +      s"hdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir"
    +    println(s"HDFS address dir is $hdfsAddressDir")
    +
    +    val res = smallTextFiles(sc, hdfsAddressDir).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +  }
    +
    +  private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on native file system. There are three aspects:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || native disk IO") {
    +
    +    sc.hadoopConfiguration.clear()
    +
    +    val dir = Files.createTempDirectory("smallfiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = smallTextFiles(sc, dir.toString).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +
    +    SmallTextFilesSuite.fileNames.foreach { fname =>
    +      Files.deleteIfExists(JPaths.get(s"${dir.toString}/$fname"))
    +    }
    +    Files.deleteIfExists(dir)
    +  }
    +}
    +
    +/**
    + * Some final values are defined here. chineseWordsSpark is refer to the Chinese character version
    + * of "Spark", we use UTF-8 to encode the bytes together, with a '\n' in the end. fileNames and
    + * fileContents represent the test data that will be used later. hashCodeOfContents is a Map of
    + * fileName to the hashcode of contents, which is used for the comparison of contents, i.e. the
    + * "read in" contents should be same with the "read out" ones.
    + */
    +
    +object SmallTextFilesSuite {
    +  private val chineseWordsSpark = Array(
    --- End diff --
    
    Is it necessary to use UTF-8 characters to create the test data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38727263
  
    lgtm other than the couple minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10877199
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    + * (filename, content) format. Each element in split is an record of a unique, whole file. File name
    + * is the full path name for easy deduplicate.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +    return key;
    +  }
    +
    +  @Override
    +  public Text getCurrentValue() throws IOException, InterruptedException{
    +    return value;
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException {
    +    if (!processed) {
    +      if (key == null) {
    +        key = path.toString();
    +      }
    +      if (value == null) {
    +        value = new Text();
    +      }
    +
    +      FSDataInputStream fileIn = null;
    +      try {
    +        fileIn = fs.open(path);
    +        byte[] innerBuffer = IOUtils.toByteArray(fileIn);
    --- End diff --
    
    @mengxr @yinxusen PS I am happy to take on removal of use of commons-io in favor of equivalents in Guava. There is a bit more than this usage, but it's easy stuff. Commons IO is fine but it's not necessary to use it here and it is one of those dependencies that could collide with other versions in Hadoop. If anyone nods I'll open a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38769750
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865844
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format.
    + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader.
    + */
    +public class WholeTextFileInputFormat
    +  extends CombineFileInputFormat<String, Text> {
    --- End diff --
    
    Does this line fit into the one above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38347604
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13340/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965949
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,105 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    --- End diff --
    
    Does nio.file.Files exist on JDK 6? If not... let's change it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38667186
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690831
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    +        extends CombineFileInputFormat<String, Text> {
    +
    +    @Override
    +    protected boolean isSplitable(JobContext context, Path file) {
    +        return false;
    +    }
    +    @Override
    +    public RecordReader<String, Text> createRecordReader(
    +            InputSplit split,
    +            TaskAttemptContext context) throws IOException {
    +        return new CombineFileRecordReader<String, Text>(
    +                (CombineFileSplit)split,
    --- End diff --
    
    put a space after `)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829857
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{BufferedReader, DataOutputStream, FileOutputStream, InputStreamReader}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll {
    --- End diff --
    
    Yep, it will be much easy if we just test `WholeTextFileRecordReader` alone. I can construct a `CombineFileSplit` and test it. The correctness of reading files from HDFS and native disk is guaranteed by other tests, I think. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829067
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in <filename, content> format.
    + */
    +
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    --- End diff --
    
    Need doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38761534
  
    Sure, let me update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10955063
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.input.WholeTextFileInputFormat
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * Extra functions available on SparkContext of mllib through an implicit conversion. Import
    + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions.
    + */
    +class MLContext(self: SparkContext) extends Logging {
    --- End diff --
    
    Logging is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38255475
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829200
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -120,4 +122,22 @@ object MLUtils {
         }
         sum
       }
    +
    +  /**
    +   * Reads a bunch of whole files from HDFS, or a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI, and return an RDD[(String, String)].
    +   *
    +   * @param path The directory you should specified, such as
    +   *             hdfs://[address]:[port]/[dir]
    +   *
    +   * @return RDD[(fileName: String, content: String)]
    +   *         i.e. the first is the file name of a file, the second one is its content.
    --- End diff --
    
    do not need to repeat `RDD[(fileName: String, content: String)]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965981
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    --- End diff --
    
    Sorry to ask for this right now when you are already very much into it, but do you mind rewriting the two java files into scala? It should be easier & shorter ... and then you can mark the classes as private[mllib]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10955158
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.input.WholeTextFileInputFormat
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * Extra functions available on SparkContext of mllib through an implicit conversion. Import
    + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions.
    + */
    +class MLContext(self: SparkContext) extends Logging {
    +
    +  /**
    +   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
    +   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
    +   * key-value pair, where the key is the path of each file, and the value is the content of each
    +   * file.
    +   */
    +  def wholeTextFile(path: String): RDD[(String, String)] = {
    +    self.newAPIHadoopFile(
    +      path,
    +      classOf[WholeTextFileInputFormat],
    +      classOf[String],
    +      classOf[Text]).mapValues(_.toString)
    +  }
    +}
    +
    +
    +/**
    + * The MLContext object contains a number of implicit conversions and parameters for use with
    + * various mllib features.
    + */
    +object MLContext extends Logging {
    --- End diff --
    
    Remove `Logging`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690775
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    --- End diff --
    
    Instead of `util`, how about `io` or `input`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865907
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    +import java.nio.file.Path
    +import java.nio.file.Paths
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
    + * directory is created as fake input. Temporal storage would be deleted in the end.
    + */
    +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +  }
    +
    +  private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are
    +   * three aspects to check:
    +   *   1) is all files are read.
    +   *   2) is the fileNames are read correctly.
    +   *   3) is the contents must be the same.
    +   */
    +  test("Correctness of WholeTextFileRecordReader.") {
    +
    +    val dir = Files.createTempDirectory("wholefiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    WholeTextFileRecordReaderSuite.fileNames
    +      .zip(WholeTextFileRecordReaderSuite.filesContents)
    +      .foreach { case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = wholeTextFile(sc, dir.toString).collect()
    +
    +    assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    --- End diff --
    
    put res.size and fileNames.size into error message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38767554
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37970513
  
    @yinxusen I'm still confused by the implementation. You want to make the entire content of a text file as the value returned by smallTextFiles. In the BatchFileRecordReader, the buffer size is limited by `MAX_BYTES_ALLOCATION`, and you merge the buffers from the same file in smallTextFiles to assemble the entire content. Is it necessary to take this approach? How does it differ from a simple `CombineWholeTextFileInputFormat`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38128995
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen closed the pull request at:

    https://github.com/apache/spark/pull/164


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38524328
  
    Yes, we can put it in MLContext and add an implicit conversion from SparkContext to MLContext, so that the API can be added in a non-intrusive manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10828986
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new WholeTextFileRecordReader.
    --- End diff --
    
    Need definition of `key` and `value`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38129529
  
    Ah... It seems that Jenkins causes problem. The last two commits test failed due to this error:
    
    > Fetching upstream changes from https://github.com/apache/spark.git
    > ERROR: Timeout after 10 minutes
    > FATAL: Failed to fetch from https://github.com/apache/spark.git
    
    Who can help to have a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10828991
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new WholeTextFileRecordReader.
    + */
    +public class WholeTextFileInputFormat
    +  extends CombineFileInputFormat<String, Text> {
    +
    +  @Override
    +  protected boolean isSplitable(JobContext context, Path file) {
    +    return false;
    +  }
    +  @Override
    --- End diff --
    
    empty line between methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10829033
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in <filename, content> format.
    + */
    +
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +        return key;
    --- End diff --
    
    fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10966132
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    --- End diff --
    
    Never mind, I'd like to do that. Let me try to change them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10912869
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    + * (filename, content) format. Each element in split is an record of a unique, whole file. File name
    --- End diff --
    
    Use `WholeTextInputFormat`'s doc as a reference and update the doc here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38671709
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13468/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865893
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    +import java.nio.file.Path
    +import java.nio.file.Paths
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
    + * directory is created as fake input. Temporal storage would be deleted in the end.
    + */
    +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +  }
    +
    +  private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    --- End diff --
    
    remove println


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38008356
  
    @mengxr There are 3 reasons from me: 
    
    * The file length could be larger than `Int.maxValue`, and usually the buffer size of `FSDataInputStream ` is smaller than `Int.maxValue`.
    
    * `innerBuffer ` I used is an array of byte, we cannot use a length larger than `Int.maxValue` to initialize the array.
    
    * `byte[] innerBuffer = new byte[maxBufferLength];` could cause OOM if the `maxBufferLength` is too large to fit in your current usable memory.
    
    Indeed, the scenario that a file length is larger than `Int.maxValue` is rare, but it could happen. Mahout also have encountered the problem, but they just use `(int) fileLength` to cast a `long` to an `int`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38562621
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38653629
  
    @yinxusen It looks good to me now. @rxin , could you please make a final scan and merge it if it looks good to you? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38409136
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38769753
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13493/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965844
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.ByteStreams;
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a single whole text file out in a
    + * key-value pair, where the key is the file path and the value is the entire content of the file.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    --- End diff --
    
    Can we change the value type to String here directly? Then you don't need the extra mapValues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10965746
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.ByteStreams;
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a single whole text file out in a
    + * key-value pair, where the key is the file path and the value is the entire content of the file.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +    return key;
    +  }
    +
    +  @Override
    +  public Text getCurrentValue() throws IOException, InterruptedException{
    --- End diff --
    
    add a space after InterruptedException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10692570
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    --- End diff --
    
    Here is really a mistake that I ignore.
    
    I use `MAX_BYTES_ALLOCATION` here to avoid the scenario that a file length is bigger than what an `Int.maxValue` can represent. Mahout implementation just ignore the case, they convert `long` directly into `int`. Though the semantic of the interface is "small file", but we should not limit the input file length for end-user.
    
    I just treat it as bytes array which encapsulated in a `Text`, I will join the slices of a single file together in `smallTextFiles()` interface. Due to the locality of split, I can merge them together without shuffle. But I forget it in this version, I'll fix it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10786751
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
    @@ -0,0 +1,218 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream}
    +import java.nio.file.Files
    +import java.nio.file.{Path => JPath}
    +import java.nio.file.{Paths => JPaths}
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.hdfs.MiniDFSCluster
    +import org.apache.hadoop.fs.FileSystem
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in
    + * memory, while local disk test create a temp directory. All these temporal storages are deleted
    + * in the end.
    + */
    +
    +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +  private var dfs: MiniDFSCluster = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    sc.hadoopConfiguration.set("dfs.datanode.data.dir.perm", SmallTextFilesSuite.dirPermission())
    +    dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
    +                             Array("/rack0", "/rack0", "/rack1", "/rack1"),
    +                             Array("host0", "host1", "host2", "host3"))
    +  }
    +
    +  override def afterAll() {
    +    if (dfs != null) dfs.shutdown()
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +
    +  private def createHDFSFile(
    +      fs: FileSystem,
    +      inputDir: Path,
    +      fileName: String,
    +      contents: Array[Byte]) = {
    +    val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    System.out.println("Wrote HDFS file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on HDFS. There are three aspects to test:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || HDFS IO") {
    +    val fs: FileSystem = dfs.getFileSystem
    +    val dir = "/foo/"
    +    val inputDir: Path = new Path(dir)
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createHDFSFile(fs, inputDir, fname, contents)
    +    }
    +
    +    println(s"name node is ${dfs.getNameNode.getNameNodeAddress.getHostName}")
    +    println(s"name node port is ${dfs.getNameNodePort}")
    +
    +    val hdfsAddressDir =
    +      s"hdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir"
    +    println(s"HDFS address dir is $hdfsAddressDir")
    +
    +    val res = smallTextFiles(sc, hdfsAddressDir).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +  }
    +
    +  private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors on native file system. There are three aspects:
    +   *    1) is all files are read.
    +   *    2) is the fileNames are read correctly.
    +   *    3) is the contents must be the same.
    +   */
    +  test("Small file input || native disk IO") {
    +
    +    sc.hadoopConfiguration.clear()
    +
    +    val dir = Files.createTempDirectory("smallfiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
    +      case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = smallTextFiles(sc, dir.toString).collect()
    +
    +    assert(res.size === SmallTextFilesSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      assert(SmallTextFilesSuite.fileNames.contains(fname),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname),
    +        s"file $fname contents can not match")
    +    }
    +
    +    SmallTextFilesSuite.fileNames.foreach { fname =>
    +      Files.deleteIfExists(JPaths.get(s"${dir.toString}/$fname"))
    +    }
    +    Files.deleteIfExists(dir)
    +  }
    +}
    +
    +/**
    + * Some final values are defined here. chineseWordsSpark is refer to the Chinese character version
    + * of "Spark", we use UTF-8 to encode the bytes together, with a '\n' in the end. fileNames and
    + * fileContents represent the test data that will be used later. hashCodeOfContents is a Map of
    + * fileName to the hashcode of contents, which is used for the comparison of contents, i.e. the
    + * "read in" contents should be same with the "read out" ones.
    + */
    +
    +object SmallTextFilesSuite {
    +  private val chineseWordsSpark = Array(
    --- End diff --
    
    Hmm.. sorry I forget to modify the comments in testsuite along with other code.
    
    It is not necessary to do that, I'll fix it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38410992
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13379/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10690859
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFileRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private long end;
    +    private long pos;
    +    private Path path;
    +
    +    private static final int MAX_BYTES_ALLOCATION = 64 * 1024 * 1024;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private FSDataInputStream fileIn;
    +
    +    public BatchFileRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    +        pos = startOffset;
    +        end = startOffset + split.getLength(index);
    +
    +        FileSystem fs = path.getFileSystem(context.getConfiguration());
    +        fileIn = fs.open(path);
    +        fileIn.seek(startOffset);
    +    }
    +
    +    @Override
    +    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +            throws IOException, InterruptedException {}
    +
    +    @Override
    +    public void close() throws IOException {
    +        if (fileIn != null) {
    +            fileIn.close();
    +        }
    +    }
    +
    +    @Override
    +    public float getProgress() throws IOException {
    +        if (startOffset == end) return 0;
    --- End diff --
    
    should return 1.0f?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10993628
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/input/WholeTextFileRecordReader.scala ---
    @@ -0,0 +1,72 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input
    +
    +import com.google.common.io.{ByteStreams, Closeables}
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce.InputSplit
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
    +import org.apache.hadoop.mapreduce.RecordReader
    +import org.apache.hadoop.mapreduce.TaskAttemptContext
    +
    +/**
    + * A [[org.apache.hadoop.mapreduce.RecordReader RecordReader]] for reading a single whole text file
    + * out in a key-value pair, where the key is the file path and the value is the entire content of
    + * the file.
    + */
    +private[mllib] class WholeTextFileRecordReader(
    +    split: CombineFileSplit,
    +    context: TaskAttemptContext,
    +    index: Integer)
    +  extends RecordReader[String, String] {
    +
    +  private val path = split.getPath(index)
    +  private val fs = path.getFileSystem(context.getConfiguration)
    +
    +  // True means the current file has been processed, then skip it.
    +  private var processed = false
    +
    +  private val key = path.toString
    +  private var value: String = null
    +
    +  override def initialize(split: InputSplit, context: TaskAttemptContext) = {}
    +
    +  override def close() = {}
    +
    +  override def getProgress = if(processed) 1.0f else 0.0f
    +
    +  override def getCurrentKey = key
    +
    +  override def getCurrentValue = value
    +
    +  override def nextKeyValue = {
    +    if(!processed) {
    --- End diff --
    
    add a space after if


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10692101
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
    @@ -0,0 +1,117 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    --- End diff --
    
    `CombineFileInputFormat` is to combine splits but not records. If the file is long and you cut it at `MAX_BYTES_ALLOCATION`, how do you guarantee this would be a valid cut for UTF-8 text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865863
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in (filename, content) format. Each element in split is an record of a
    + * unique, whole file. File name is full path name for easy deduplicate.
    --- End diff --
    
    Need to improve the doc. See my comment on `WholeTextFileInputFormat`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865898
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    +import java.nio.file.Path
    +import java.nio.file.Paths
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
    + * directory is created as fake input. Temporal storage would be deleted in the end.
    + */
    +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +  }
    +
    +  private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are
    +   * three aspects to check:
    +   *   1) is all files are read.
    +   *   2) is the fileNames are read correctly.
    --- End diff --
    
    whether paths are read correctly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38255476
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13316/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-37910135
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10865913
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util
    +
    +
    +import java.io.DataOutputStream
    +import java.io.FileOutputStream
    +import java.nio.file.Files
    +import java.nio.file.Path
    +import java.nio.file.Paths
    +
    +import scala.collection.immutable.IndexedSeq
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +
    +import org.apache.hadoop.io.Text
    +
    +import org.apache.spark.mllib.util.MLUtils._
    +import org.apache.spark.SparkContext
    +
    +/**
    + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
    + * directory is created as fake input. Temporal storage would be deleted in the end.
    + */
    +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
    +  private var sc: SparkContext = _
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +  }
    +
    +  private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = {
    +    val out = new DataOutputStream(new FileOutputStream(s"${inputDir.toString}/$fileName"))
    +    out.write(contents, 0, contents.length)
    +    out.close()
    +    println("Wrote native file")
    +  }
    +
    +  /**
    +   * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are
    +   * three aspects to check:
    +   *   1) is all files are read.
    +   *   2) is the fileNames are read correctly.
    +   *   3) is the contents must be the same.
    +   */
    +  test("Correctness of WholeTextFileRecordReader.") {
    +
    +    val dir = Files.createTempDirectory("wholefiles")
    +    println(s"native disk address is ${dir.toString}")
    +
    +    WholeTextFileRecordReaderSuite.fileNames
    +      .zip(WholeTextFileRecordReaderSuite.filesContents)
    +      .foreach { case (fname, contents) =>
    +        createNativeFile(dir, fname, contents)
    +    }
    +
    +    val res = wholeTextFile(sc, dir.toString).collect()
    +
    +    assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size,
    +      "Number of files read out do not fit with the actual value")
    +
    +    for ((fname, contents) <- res) {
    +      val shortName = fname.split('/').last
    +      assert(WholeTextFileRecordReaderSuite.fileNames.contains(shortName),
    +        s"Missing file name $fname.")
    +      assert(contents.hashCode === WholeTextFileRecordReaderSuite.hashCodeOfContents(shortName),
    --- End diff --
    
    verify text directly instead of hashcode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10785909
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.IOUtils;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * Reads an entire file out in bytes format in <filename, content> format.
    + */
    +
    +public class BatchFilesRecordReader extends RecordReader<String, Text> {
    +    private long startOffset;
    +    private int length;
    +    private Path path;
    +
    +    private String key = null;
    +    private Text value = null;
    +
    +    private boolean processed = false;
    +
    +    private FileSystem fs;
    +
    +    public BatchFilesRecordReader(
    +            CombineFileSplit split,
    +            TaskAttemptContext context,
    +            Integer index)
    +            throws IOException {
    +        path = split.getPath(index);
    +        startOffset = split.getOffset(index);
    --- End diff --
    
    You are reading the entire file, which is not splitable. Should offset always be 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10873431
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java ---
    @@ -0,0 +1,104 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.input;
    +
    +import java.io.IOException;
    +
    +import com.google.common.io.Closeables;
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * An <code>org.apache.hadoop.mapreduce.RecordReader</code> for reading whole text file out in
    + * (filename, content) format. Each element in split is an record of a unique, whole file. File name
    + * is the full path name for easy deduplicate.
    + */
    +public class WholeTextFileRecordReader extends RecordReader<String, Text> {
    +  private Path path;
    +
    +  private String key = null;
    +  private Text value = null;
    +
    +  private boolean processed = false;
    +
    +  private FileSystem fs;
    +
    +  public WholeTextFileRecordReader(
    +      CombineFileSplit split,
    +      TaskAttemptContext context,
    +      Integer index)
    +    throws IOException {
    +    path = split.getPath(index);
    +    fs = path.getFileSystem(context.getConfiguration());
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
    +    throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException {
    +    return processed ? 1.0f : 0.0f;
    +  }
    +
    +  @Override
    +  public String getCurrentKey() throws IOException, InterruptedException {
    +    return key;
    +  }
    +
    +  @Override
    +  public Text getCurrentValue() throws IOException, InterruptedException{
    +    return value;
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException {
    +    if (!processed) {
    +      if (key == null) {
    +        key = path.toString();
    +      }
    +      if (value == null) {
    +        value = new Text();
    +      }
    +
    +      FSDataInputStream fileIn = null;
    +      try {
    +        fileIn = fs.open(path);
    +        byte[] innerBuffer = IOUtils.toByteArray(fileIn);
    --- End diff --
    
    Shall we change `IOUtils.toBytesArray` to Guava's `ByteStreams.toByteArray`? So we can remove `commons-io` from our dependencies. Guava is better at maintaining API compatibility across versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/164#discussion_r10691109
  
    --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.util;
    +
    +import java.io.IOException;
    +
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Text;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +/**
    + * The specific InputFormat reads files in HDFS or local disk. It will be called by
    + * HadoopRDD to generate new BatchFileRecordReader.
    + */
    +public class BatchFileInputFormat
    +        extends CombineFileInputFormat<String, Text> {
    +
    +    @Override
    +    protected boolean isSplitable(JobContext context, Path file) {
    +        return false;
    +    }
    +    @Override
    +    public RecordReader<String, Text> createRecordReader(
    +            InputSplit split,
    +            TaskAttemptContext context) throws IOException {
    +        return new CombineFileRecordReader<String, Text>(
    +                (CombineFileSplit)split,
    --- End diff --
    
    Yep, I will add it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38519654
  
    @mengxr I talked to @liancheng about the placement of WholeTextFiles
    interface, we have no idea of whether it is a commonly used interface or
    not at that time, so we choose the MLUtils instead of SparkContexts in
    cautious.
    
    If the community thinks it is good to put it in SparkContexts, I am glad to
    do that. :)
    
    @yinxusen I think the code looks good to me except a few docs. I'm not sure
    whether MLUtils is a good place for wholeTextFile. We can pimp SparkContext
    and put this method there.
    
    --
    Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38245425
  
    @mengxr There are 2 java files in my PR, and another 2 scala files - the MLUtils.scala and the test suite. I just find the scala code style in the [style page](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). Does java code also use 2 space indentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/164#issuecomment-38436606
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---