You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/07/06 16:50:41 UTC

[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18556

    [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat and allow multiple input paths for determining numFeatures

    ## What changes were proposed in this pull request?
    
    This is related with [SPARK-19918](https://issues.apache.org/jira/browse/SPARK-19918) and [SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362).
    
    This PR proposes to use `TextFileFormat` and allow multiple input paths (but with a warning) when determining the number of features in LibSVM data source via an extra scan.
    
    There are three points here:
    
    - The main advantage of this change should be to remove file-listing bottlenecks in driver side.
    
    - Another advantage is ones from using `FileScanRDD`. For example, I guess we can use `spark.sql.files.ignoreCorruptFiles` option when determining the number of features.
    
    - We can unify the schema inference code path in text based data sources. This is also a preparation for [SPARK-21289](https://issues.apache.org/jira/browse/SPARK-21289).
    
    ## How was this patch tested?
    
    Unit tests in `LibSVMRelationSuite`.
    
    Closes #18288

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark libsvm-schema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18556
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126066952
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -102,6 +104,25 @@ object MLUtils extends Logging {
           .map(parseLibSVMRecord)
       }
     
    +  private[spark] def parseLibSVMFile(
    +      sparkSession: SparkSession, paths: Seq[String]): RDD[(Double, Array[Int], Array[Double])] = {
    +    val lines = sparkSession.baseRelationToDataFrame(
    +      DataSource.apply(
    +        sparkSession,
    +        paths = paths,
    +        className = classOf[TextFileFormat].getName
    +      ).resolveRelation(checkFilesExist = false))
    +      .select("value")
    --- End diff --
    
    is this needed? I think text format is known to have only one column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126042950
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    --- End diff --
    
    Yea, exactly. I believe we are still not doing this within each source. If it does by any change after that commit, it should not as such.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126043543
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    -      } else {
    -        throw new IOException("Multiple input paths are not supported for libsvm data.")
    +      } else if (files.length > 1) {
    +        logWarning(
    +          "Multiple input paths detected, determining the number of features by going " +
    --- End diff --
    
    Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79305/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126051332
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    --- End diff --
    
    Actually, that should be right after this function call so probably fine :). Yea, but at least using `require` should be shorter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126045375
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    --- End diff --
    
    I could not use `AnalysisException` here for package access modifier. Also, I could not remove this line to defer the message as I am not sure if saying "Unable to infer schema ... It must be specified manually." Is a proper message here -
     @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79293/testReport)** for PR 18556 at commit [`e1e13ac`](https://github.com/apache/spark/commit/e1e13ac1ce610a00fd9b07a8919f2dd2fa241026).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126050849
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    --- End diff --
    
    In my opinion, it is safe / necessary to check whether the parameter is valid in advance. Perhaps `IllegalArgumentException` is more suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79307/testReport)** for PR 18556 at commit [`b345cb1`](https://github.com/apache/spark/commit/b345cb14758ae6f16d699b2f38b17eefbf316468).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126043309
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
    --- End diff --
    
    Btw, I think we can't defer the error message here as I guess it is not strictly schema inference. Also, I could not use `AnalysisException`  here due to package access identifier. - @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Thank you @cloud-fan!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79305/testReport)** for PR 18556 at commit [`de33d6d`](https://github.com/apache/spark/commit/de33d6d8809e87edbe42c2dfab4b914da29c7143).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    @facaiy, thank you for approving this approach. I just addressed your comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79307/testReport)** for PR 18556 at commit [`b345cb1`](https://github.com/apache/spark/commit/b345cb14758ae6f16d699b2f38b17eefbf316468).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79293/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126026388
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    --- End diff --
    
    Do you mean that _success file of hdfs, if existed, will be safely ignored by spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r125956748
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    -      } else {
    -        throw new IOException("Multiple input paths are not supported for libsvm data.")
    +      } else if (files.length > 1) {
    +        logWarning(
    +          "Multiple input paths detected, determining the number of features by going " +
    --- End diff --
    
    Yes I tend to agree with this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18556


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79305/testReport)** for PR 18556 at commit [`de33d6d`](https://github.com/apache/spark/commit/de33d6d8809e87edbe42c2dfab4b914da29c7143).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r125955384
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    --- End diff --
    
    I am quite sure that we can remove this - https://github.com/apache/spark/commit/76622c661fcae81eb0352c61f54a2e9e21a4fb98.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    **[Test build #79293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79293/testReport)** for PR 18556 at commit [`e1e13ac`](https://github.com/apache/spark/commit/e1e13ac1ce610a00fd9b07a8919f2dd2fa241026).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    cc @MLnick, @srowen, @yanboliang, @cloud-fan, @facaiy and @yanboliang. I happened to take over https://github.com/apache/spark/pull/18288. Could you take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126053740
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,14 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    -        throw new IOException("No input path specified for libsvm data")
    -      } else {
    -        throw new IOException("Multiple input paths are not supported for libsvm data.")
    -      }
    -
    -      val sc = sparkSession.sparkContext
    -      val parsed = MLUtils.parseLibSVMFile(sc, path, sc.defaultParallelism)
    +      require(files.nonEmpty, "No input path specified for libsvm data")
    --- End diff --
    
    Please refer https://github.com/apache/spark/pull/18556#discussion_r126045375.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in Lib...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79307/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18556#discussion_r126023986
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour
           files: Seq[FileStatus]): Option[StructType] = {
         val libSVMOptions = new LibSVMOptions(options)
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
    -      // Infers number of features if the user doesn't specify (a valid) one.
    -      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      if (files.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
    -      } else {
    -        throw new IOException("Multiple input paths are not supported for libsvm data.")
    +      } else if (files.length > 1) {
    +        logWarning(
    +          "Multiple input paths detected, determining the number of features by going " +
    --- End diff --
    
    Good. How about warning only if `numFeature` is missing? I mean, remove the condition `files.length > 1`
    
    ```scala
    if (files.isEmpty) { }
    else { ... logWarning ...}
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org