You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Brian Cohn <bc...@gmail.com> on 2014/07/21 12:04:06 UTC

Data row-operation processing advice

Hi!

I'm a student interested in using Spark for my big data research project.
I've set up a cloud successfully, and now I'm developing a 'data pipeline'
for batch processing of data.

Input:
A S3 folder of flat files (csv), all with the same number of columns.

Columns include:
ID (integer)
in_degree connections (list of integer ID's)
out_degree connections (list of integer ID's)
year (int)
month (str)
latitude (double)
longitude (double)
city (string)
name (string)
descriptionA (a very long string)
descriptionB
descriptionC
descriptionD
etc,etc (more discrete and categorical variables)



Current total size is about 100GB, but i'd like to be able to write a
workflow that can scale up to the full 2TB dataset later.

Desired action:

I want to import the data from S3, and then 'transform the dataset (add
columns)
For instance, I would want to do apply operations to every row in the
dataset, creating new 'features/columns' for each row.
Some of these operations would be functions (that take in values from
multiple columns) and either run simple operations (addition, division,
split_by_word, etc) or more complicated functions (interface with an
external website (google/facebook) to grab further information about a
given row ID, and plug a result into a new column. By nature, all of these
operations can run totally in parallel.

What's a workable way to transform my dataset within spark (or spark sql),
into a format that I can run through mllib?


Thanks in advance!

Brian