You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pradeep Nayak <pr...@gmail.com> on 2016/05/11 18:36:30 UTC

Is this possible to do in spark ?

Hi -

I have a very unique problem which I am trying to solve and I am not sure
if spark would help here.

I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt.

a.txt contains a unique serial number, say:
12345

and b.txt contains key value pairs.
a,1
b,1,
c,0 etc.

Everyday you receive data for a system Y. so there are multiple a.txt and
b.txt for a serial number.  The serial number doesn't change and that the
key. So there are multiple systems and the data of a whole year is
available and its huge.

I am trying to generate a report of unique serial numbers where the value
of the option a has changed to 1 over the last few months. Lets say the
default is 0. Also figure how many times it was toggled.


I am not sure how to read two text files in spark at the same time and
associated them with the serial number. Is there a way of doing this in
place given that we know the directory structure ? OR we should be
transforming the data anyway to solve this ?

Re: Is this possible to do in spark ?

Posted by Mathieu Longtin <ma...@closetwork.org>.
Make a function (or lambda) that reads the text file. Make a RDD with a
list of X/Y, then map that RDD throught the file reading function. Same
with you X/Y/Z directory. You then have RDDs with the content of each file
as a record. Work with those as needed.

On Wed, May 11, 2016 at 2:36 PM Pradeep Nayak <pr...@gmail.com> wrote:

> Hi -
>
> I have a very unique problem which I am trying to solve and I am not sure
> if spark would help here.
>
> I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt.
>
> a.txt contains a unique serial number, say:
> 12345
>
> and b.txt contains key value pairs.
> a,1
> b,1,
> c,0 etc.
>
> Everyday you receive data for a system Y. so there are multiple a.txt and
> b.txt for a serial number.  The serial number doesn't change and that the
> key. So there are multiple systems and the data of a whole year is
> available and its huge.
>
> I am trying to generate a report of unique serial numbers where the value
> of the option a has changed to 1 over the last few months. Lets say the
> default is 0. Also figure how many times it was toggled.
>
>
> I am not sure how to read two text files in spark at the same time and
> associated them with the serial number. Is there a way of doing this in
> place given that we know the directory structure ? OR we should be
> transforming the data anyway to solve this ?
>
-- 
Mathieu Longtin
1-514-803-8977