You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "darion.yaphet" <fl...@163.com> on 2017/06/12 03:46:35 UTC

LibSVM should have just one input file

Hi team :


Currently when we using SVM to train dataset we found the input files limit only one .


the source code as following :


valpath=if (dataFiles.length ==1) {
dataFiles.head.getPath.toUri.toString
} elseif (dataFiles.isEmpty) {
thrownewIOException("No input path specified for libsvm data")
} else {
thrownewIOException("Multiple input paths are not supported for libsvm data.")
}



The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . I'm not sure is it a bug ? or something I'm using not correctly . 


thanks a lot ~~~

Re: LibSVM should have just one input file

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

Hi, yaphet.
It seems that the code you pasted should be located in  LibSVM,  rather
than SVM.
Do I misunderstand?

For LibSVMDataSource,
1. if numFeatures is unspecified, only one file is valid input.

val df = spark.read.format("libsvm")
  .load("data/mllib/sample_libsvm_data.txt")

2. otherwise, multiple files are OK.

val df = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")


For more to see: http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.ml.source.libsvm.LibSVMDataSource


On Mon, Jun 12, 2017 at 11:46 AM, darion.yaphet <fl...@163.com> wrote:

> Hi team :
>
> Currently when we using SVM to train dataset we found the input
> files limit only one .
>
> the source code as following :
>
> val path = if (dataFiles.length == 1) {
> dataFiles.head.getPath.toUri.toString
> } else if (dataFiles.isEmpty) {
> throw new IOException("No input path specified for libsvm data")
> } else {
> throw new IOException("Multiple input paths are not supported for libsvm
> data.")
> }
>
> The file store on the Distributed File System such as HDFS is split into
> mutil piece and I think this limit is not necessary . I'm not sure is it a
> bug ? or something I'm using not correctly .
>
> thanks a lot ~~~
>
>
>
>

Re: LibSVM should have just one input file

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

Hi, yaphet.
It seems that the code you pasted should be located in  LibSVM,  rather
than SVM.
Do I misunderstand?

For LibSVMDataSource,
1. if numFeatures is unspecified, only one file is valid input.

val df = spark.read.format("libsvm")
  .load("data/mllib/sample_libsvm_data.txt")

2. otherwise, multiple files are OK.

val df = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")


For more to see: http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.ml.source.libsvm.LibSVMDataSource


On Mon, Jun 12, 2017 at 11:46 AM, darion.yaphet <fl...@163.com> wrote:

> Hi team :
>
> Currently when we using SVM to train dataset we found the input
> files limit only one .
>
> the source code as following :
>
> val path = if (dataFiles.length == 1) {
> dataFiles.head.getPath.toUri.toString
> } else if (dataFiles.isEmpty) {
> throw new IOException("No input path specified for libsvm data")
> } else {
> throw new IOException("Multiple input paths are not supported for libsvm
> data.")
> }
>
> The file store on the Distributed File System such as HDFS is split into
> mutil piece and I think this limit is not necessary . I'm not sure is it a
> bug ? or something I'm using not correctly .
>
> thanks a lot ~~~
>
>
>
>