You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hp.com> on 2015/03/26 22:16:26 UTC

Storing large data for MLlib machine learning

Hi,

Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark?

My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated.

Best regards, Alexander

RE: Storing large data for MLlib machine learning

Posted by "Ulanov, Alexander" <al...@hp.com>.
Thanks, Jeremy! I also work with time series data right now, so your suggestions are really relevant. However, we want to handle not the raw data, but already processed and prepared for machine learning.

Initially, we also wanted to have our own simple binary format, but we could not agree on handling little/big endian. We did not agree if we have to stick to a specific endian or to ship this information in metadata file. And metadata file sounds like another data format engineering (aka inventing the bicycle). Does this make sense to you?

From: Jeremy Freeman [mailto:freeman.jeremy@gmail.com]
Sent: Thursday, March 26, 2015 3:01 PM
To: Ulanov, Alexander
Cc: Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

Hi Ulvanov, great question, we've encountered it frequently with scientific data (e.g. time series). Agreed text is inefficient for dense arrays, and we also found HDF5+Spark to be a pain.

Our strategy has been flat binary files with fixed length records. Loading these is now supported in Spark via the binaryRecords method, which wraps a custom Hadoop InputFormat we wrote.

An example (in python):

# write data from an array
from numpy import random
dat = random.randn(100,5)
f = open('test.bin', 'w')
f.write(dat)
f.close()

# load the data back in
from numpy import frombuffer
nrecords = 5
bytesize = 8
recordsize = nrecords * bytesize
data = sc.binaryRecords('test.bin', recordsize)
parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

# these should be equal
parsed.first()
dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays nicer with non-distributed data science tools (e.g. numpy). It also scales great (we use it routinely to process TBs of time series). And handles single files or directories. But it's extremely simple!

-------------------------
jeremyfreeman.net<http://jeremyfreeman.net>
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander <al...@hp.com>> wrote:


Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have dense vectors. In my opinion, text format is not appropriate for storing large dense vectors due to overhead related to parsing from string to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:javadba@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

          MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark?

My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated.

Best regards, Alexander


Re: Storing large data for MLlib machine learning

Posted by Jeremy Freeman <fr...@gmail.com>.
Hi Ulvanov, great question, we’ve encountered it frequently with scientific data (e.g. time series). Agreed text is inefficient for dense arrays, and we also found HDF5+Spark to be a pain.
 
Our strategy has been flat binary files with fixed length records. Loading these is now supported in Spark via the binaryRecords method, which wraps a custom Hadoop InputFormat we wrote.

An example (in python):

> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()

> # load the data back in
> from numpy import frombuffer
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

> # these should be equal
> parsed.first()
> dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays nicer with non-distributed data science tools (e.g. numpy). It also scales great (we use it routinely to process TBs of time series). And handles single files or directories. But it's extremely simple!

-------------------------
jeremyfreeman.net
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander <al...@hp.com> wrote:

> Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have dense vectors. In my opinion, text format is not appropriate for storing large dense vectors due to overhead related to parsing from string to digits and also storing digits as strings is not efficient.
> 
> From: Stephen Boesch [mailto:javadba@gmail.com]
> Sent: Thursday, March 26, 2015 2:27 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
> 
> There are some convenience methods you might consider including:
> 
>           MLUtils.loadLibSVMFile
> 
> and   MLUtils.loadLabeledPoint
> 
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>>:
> Hi,
> 
> Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark?
> 
> My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated.
> 
> Best regards, Alexander
> 


RE: Storing large data for MLlib machine learning

Posted by "Ulanov, Alexander" <al...@hp.com>.
Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have dense vectors. In my opinion, text format is not appropriate for storing large dense vectors due to overhead related to parsing from string to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:javadba@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark?

My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated.

Best regards, Alexander


Re: Storing large data for MLlib machine learning

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Protobufs are great for serializing individual records - but parquet is
good for efficiently storing a whole bunch of these objects.

Matt Massie has a good (slightly dated) blog post on using
Spark+Parquet+Avro (and you can pretty much s/Avro/Protobuf/) describing
how they all work together here:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Your use case (storing dense features, presumably as a single column) is
pretty straightforward and the extra layers of indirection are maybe
overkill.

Lastly - you might consider using some of SparkSQL/DataFrame's built-in
features for persistence, which support lots of storage backends.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources

On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <al...@hp.com>
wrote:

>  Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
>
>
>
> *From:* Evan R. Sparks [mailto:evan.sparks@gmail.com]
> *Sent:* Thursday, March 26, 2015 2:34 PM
> *To:* Stephen Boesch
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
>
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
>
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>:
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>

RE: Storing large data for MLlib machine learning

Posted by "Ulanov, Alexander" <al...@hp.com>.
Jeremy, thanks for explanation!
What if instead you've used Parquet file format? You can still write a number of small files as you do, but you don't have to implement a writer/reader, because they are available for Parquet in various languages.

From: Jeremy Freeman [mailto:freeman.jeremy@gmail.com]
Sent: Wednesday, April 01, 2015 1:37 PM
To: Hector Yee
Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we decided on a specific endianness, but do end up storing some extremely minimal specification in a JSON file, and have written importers and exporters within our library to parse it. While it does feel a little like reinvention, it's fast, direct, and scalable, and seems pretty sensible if you know your data will be dense arrays of numerical features.

-------------------------
jeremyfreeman.net<http://jeremyfreeman.net>
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee <he...@gmail.com>> wrote:


Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <al...@hp.com>>
wrote:


Thanks, sounds interesting! How do you load files to Spark? Did you
consider having multiple files instead of file lines?



*From:* Hector Yee [mailto:hector.yee@gmail.com]
*Sent:* Wednesday, April 01, 2015 11:36 AM
*To:* Ulanov, Alexander
*Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org<ma...@spark.apache.org>

*Subject:* Re: Storing large data for MLlib machine learning



I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.



It makes it very easy to copy small chunks locally and play with subsets
of the data and not have dependencies on HDFS / hadoop for server stuff for
example.





On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
alexander.ulanov@hp.com<ma...@hp.com>> wrote:

Thanks, Evan. What do you think about Protobuf? Twitter has a library to
manage protobuf files in hdfshttps://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related
projects using it already. Keep in mind that it's column-oriented which
might impact performance - but basically you're going to want your features
in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <ja...@gmail.com><mailto:
javadba@gmail.com<ma...@gmail.com>>> wrote:
There are some convenience methods you might consider including:

          MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>
<ma...@hp.com>>:



Hi,

Could you suggest what would be the reasonable file format to store
feature vector data for machine learning in Spark MLlib? Are there any
best

practices for Spark?

My data is dense feature vectors with labels. Some of the requirements
are

that the format should be easy loaded/serialized, randomly accessible,
with

a small footprint (binary). I am considering Parquet, hdf5, protocol
buffer

(protobuf), but I have little to no experience with them, so any
suggestions would be really appreciated.

Best regards, Alexander





--

Yee Yang Li Hector <http://google.com/+HectorYee>

*google.com/+HectorYee<http://google.com/+HectorYee> <http://google.com/+HectorYee>*



--
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee<http://google.com/+HectorYee> <http://google.com/+HectorYee>*


Re: Storing large data for MLlib machine learning

Posted by Jeremy Freeman <fr...@gmail.com>.
@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we decided on a specific endianness, but do end up storing some extremely minimal specification in a JSON file, and have written importers and exporters within our library to parse it. While it does feel a little like reinvention, it’s fast, direct, and scalable, and seems pretty sensible if you know your data will be dense arrays of numerical features.

-------------------------
jeremyfreeman.net
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee <he...@gmail.com> wrote:

> Just using sc.textfile then a .map(decode)
> Yes by default it is multiple files .. our training data is 1TB gzipped
> into 5000 shards.
> 
> On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <al...@hp.com>
> wrote:
> 
>> Thanks, sounds interesting! How do you load files to Spark? Did you
>> consider having multiple files instead of file lines?
>> 
>> 
>> 
>> *From:* Hector Yee [mailto:hector.yee@gmail.com]
>> *Sent:* Wednesday, April 01, 2015 11:36 AM
>> *To:* Ulanov, Alexander
>> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>> 
>> *Subject:* Re: Storing large data for MLlib machine learning
>> 
>> 
>> 
>> I use Thrift and then base64 encode the binary and save it as text file
>> lines that are snappy or gzip encoded.
>> 
>> 
>> 
>> It makes it very easy to copy small chunks locally and play with subsets
>> of the data and not have dependencies on HDFS / hadoop for server stuff for
>> example.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
>> alexander.ulanov@hp.com> wrote:
>> 
>> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
>> manage protobuf files in hdfshttps://github.com/twitter/elephant-bird
>> 
>> 
>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
>> Sent: Thursday, March 26, 2015 2:34 PM
>> To: Stephen Boesch
>> Cc: Ulanov, Alexander; dev@spark.apache.org
>> Subject: Re: Storing large data for MLlib machine learning
>> 
>> On binary file formats - I looked at HDF5+Spark a couple of years ago and
>> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
>> needed filenames as input, you couldn't pass it anything like an
>> InputStream). I don't know if it has gotten any better.
>> 
>> Parquet plays much more nicely and there are lots of spark-related
>> projects using it already. Keep in mind that it's column-oriented which
>> might impact performance - but basically you're going to want your features
>> in a byte array and deser should be pretty straightforward.
>> 
>> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <javadba@gmail.com<mailto:
>> javadba@gmail.com>> wrote:
>> There are some convenience methods you might consider including:
>> 
>>           MLUtils.loadLibSVMFile
>> 
>> and   MLUtils.loadLabeledPoint
>> 
>> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com
>> <ma...@hp.com>>:
>> 
>> 
>>> Hi,
>>> 
>>> Could you suggest what would be the reasonable file format to store
>>> feature vector data for machine learning in Spark MLlib? Are there any
>> best
>>> practices for Spark?
>>> 
>>> My data is dense feature vectors with labels. Some of the requirements
>> are
>>> that the format should be easy loaded/serialized, randomly accessible,
>> with
>>> a small footprint (binary). I am considering Parquet, hdf5, protocol
>> buffer
>>> (protobuf), but I have little to no experience with them, so any
>>> suggestions would be really appreciated.
>>> 
>>> Best regards, Alexander
>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> 
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> 
> 
> 
> 
> -- 
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*


Re: Storing large data for MLlib machine learning

Posted by Hector Yee <he...@gmail.com>.
Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <al...@hp.com>
wrote:

>  Thanks, sounds interesting! How do you load files to Spark? Did you
> consider having multiple files instead of file lines?
>
>
>
> *From:* Hector Yee [mailto:hector.yee@gmail.com]
> *Sent:* Wednesday, April 01, 2015 11:36 AM
> *To:* Ulanov, Alexander
> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> I use Thrift and then base64 encode the binary and save it as text file
> lines that are snappy or gzip encoded.
>
>
>
> It makes it very easy to copy small chunks locally and play with subsets
> of the data and not have dependencies on HDFS / hadoop for server stuff for
> example.
>
>
>
>
>
> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
> alexander.ulanov@hp.com> wrote:
>
> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
> Sent: Thursday, March 26, 2015 2:34 PM
> To: Stephen Boesch
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <javadba@gmail.com<mailto:
> javadba@gmail.com>> wrote:
> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com
> <ma...@hp.com>>:
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>
>
>
> --
>
> Yee Yang Li Hector <http://google.com/+HectorYee>
>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

RE: Storing large data for MLlib machine learning

Posted by "Ulanov, Alexander" <al...@hp.com>.
Thanks, sounds interesting! How do you load files to Spark? Did you consider having multiple files instead of file lines?

From: Hector Yee [mailto:hector.yee@gmail.com]
Sent: Wednesday, April 01, 2015 11:36 AM
To: Ulanov, Alexander
Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

I use Thrift and then base64 encode the binary and save it as text file lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of the data and not have dependencies on HDFS / hadoop for server stuff for example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <al...@hp.com>> wrote:
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.sparks@gmail.com<ma...@gmail.com>]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <ja...@gmail.com>>> wrote:
There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>



--
Yee Yang Li Hector<http://google.com/+HectorYee>
google.com/+HectorYee<http://google.com/+HectorYee>

Re: Storing large data for MLlib machine learning

Posted by Hector Yee <he...@gmail.com>.
I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of
the data and not have dependencies on HDFS / hadoop for server stuff for
example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <al...@hp.com>
wrote:

> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
> Sent: Thursday, March 26, 2015 2:34 PM
> To: Stephen Boesch
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <javadba@gmail.com<mailto:
> javadba@gmail.com>> wrote:
> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com
> <ma...@hp.com>>:
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>


-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

RE: Storing large data for MLlib machine learning

Posted by "Ulanov, Alexander" <al...@hp.com>.
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <ja...@gmail.com>> wrote:
There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>


Re: Storing large data for MLlib machine learning

Posted by "Evan R. Sparks" <ev...@gmail.com>.
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects
using it already. Keep in mind that it's column-oriented which might impact
performance - but basically you're going to want your features in a byte
array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <ja...@gmail.com> wrote:

> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>:
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>

Re: Storing large data for MLlib machine learning

Posted by Stephen Boesch <ja...@gmail.com>.
There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <al...@hp.com>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>