You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by habitats <ma...@habitats.no> on 2016/02/04 21:58:52 UTC

Recommended storage solution for my setup (~5M items, 10KB pr.)

Hello

I have ~5 million text documents, each around 10-15KB in size, and split
into ~15 columns. I intend to do machine learning, and thus I need to
extract all of the data at the same time, and potentially update everything
on every run.

So far I've just used json serializing, or simply cached the RDD to dick.
However, I feel like there must be a better way.

I have tried HBase, but I had a hard time setting it up and getting it to
work properly. It also felt like a lot of work for my simple requirements. I
want something /simple/.

Any suggestions?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

Posted by Nick Pentreath <ni...@gmail.com>.

If I'm not mistaken, your data seems to be about 50MB of text documents? In which case simple flat text files in JSON or CSV seems ideal, as you are already doing. If you are using Spark then DataFrames can read/write either of these formats.

For that size of data you may not require Spark. Single-instance scikit-learn or VW or whatever should be ok (depending on what model you want to build).

If you need any search & filtering capability I'd recommend elasticsearch, which has a very good Spark connecter within the elasticsearch-hadoop project. It's also easy to set up and get started (but more tricky to actually run in production).

PostgresSQL may also be a good option with its JSON support.

Hope that helps

Sent from my iPhone

> On 4 Feb 2016, at 23:23, Patrick Skjennum <ma...@habitats.no> wrote:
> 
> (Am I doing this mailinglist thing right? Never used this ...)
> 
> I do not have a cluster.
> 
> Initially I tried to setup hadoop+hbase+spark, but after spending a week trying to get work, I gave up. I had a million problems with mismatching versions, and things working locally on the server, but not programatically through my client computer, and vice versa. There was always something  that did not work, one way another.
> 
> And since I had to actually get things done rather than becoming an expert in clustering, I gave up and just used simple serializing.
> 
> Now I'm going to make a second attempt, but this time around I'll ask for help:p
> -- 
> mvh
> Patrick Skjennum
> 
> 
>> On 04.02.2016 22.14, Ted Yu wrote:
>> bq. had a hard time setting it up
>> 
>> Mind sharing your experience in more detail :-)
>> If you already have a hadoop cluster, it should be relatively straight forward to setup.
>> 
>> Tuning needs extra effort.
>> 
>>> On Thu, Feb 4, 2016 at 12:58 PM, habitats <ma...@habitats.no> wrote:
>>> Hello
>>> 
>>> I have ~5 million text documents, each around 10-15KB in size, and split
>>> into ~15 columns. I intend to do machine learning, and thus I need to
>>> extract all of the data at the same time, and potentially update everything
>>> on every run.
>>> 
>>> So far I've just used json serializing, or simply cached the RDD to dick.
>>> However, I feel like there must be a better way.
>>> 
>>> I have tried HBase, but I had a hard time setting it up and getting it to
>>> work properly. It also felt like a lot of work for my simple requirements. I
>>> want something /simple/.
>>> 
>>> Any suggestions?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

Posted by Patrick Skjennum <ma...@habitats.no>.

(Am I doing this mailinglist thing right? Never used this ...)

I do not have a cluster.

Initially I tried to setup hadoop+hbase+spark, but after spending a week 
trying to get work, I gave up. I had a million problems with mismatching 
versions, and things working locally on the server, but not 
programatically through my client computer, and vice versa. There was 
/always something /that did not work, one way another.

And since I had to actually get things /done /rather than becoming an 
expert in clustering, I gave up and just used simple serializing.

Now I'm going to make a second attempt, but this time around I'll ask 
for help:p

-- 
mvh
Patrick Skjennum


On 04.02.2016 22.14, Ted Yu wrote:
> bq. had a hard time setting it up
>
> Mind sharing your experience in more detail :-)
> If you already have a hadoop cluster, it should be relatively straight 
> forward to setup.
>
> Tuning needs extra effort.
>
> On Thu, Feb 4, 2016 at 12:58 PM, habitats <mail@habitats.no 
> <ma...@habitats.no>> wrote:
>
>     Hello
>
>     I have ~5 million text documents, each around 10-15KB in size, and
>     split
>     into ~15 columns. I intend to do machine learning, and thus I need to
>     extract all of the data at the same time, and potentially update
>     everything
>     on every run.
>
>     So far I've just used json serializing, or simply cached the RDD
>     to dick.
>     However, I feel like there must be a better way.
>
>     I have tried HBase, but I had a hard time setting it up and
>     getting it to
>     work properly. It also felt like a lot of work for my simple
>     requirements. I
>     want something /simple/.
>
>     Any suggestions?
>
>
>
>     --
>     View this message in context:
>     http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
>     Sent from the Apache Spark User List mailing list archive at
>     Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <ma...@spark.apache.org>
>
>

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

Posted by Ted Yu <yu...@gmail.com>.

bq. had a hard time setting it up

Mind sharing your experience in more detail :-)
If you already have a hadoop cluster, it should be relatively straight
forward to setup.

Tuning needs extra effort.

On Thu, Feb 4, 2016 at 12:58 PM, habitats <ma...@habitats.no> wrote:

> Hello
>
> I have ~5 million text documents, each around 10-15KB in size, and split
> into ~15 columns. I intend to do machine learning, and thus I need to
> extract all of the data at the same time, and potentially update everything
> on every run.
>
> So far I've just used json serializing, or simply cached the RDD to dick.
> However, I feel like there must be a better way.
>
> I have tried HBase, but I had a hard time setting it up and getting it to
> work properly. It also felt like a lot of work for my simple requirements.
> I
> want something /simple/.
>
> Any suggestions?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>