You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mohit Singh <mo...@gmail.com> on 2012/08/30 06:20:57 UTC

Beginner. Help needed in getting started

 am new to hadoop and all its derivatives. And I am really getting
intimidated by the abundance of information available.

But one thing I have realized is that to start implementing/using hadoop or
distributed codes, one has to basically change the way they think about a
problem.

I was wondering if someone can help me in the following.

So, basically (like anyone else) I have a raw data.. I want to parse it and
extract some information and then run some algorithm and save the results.

Lets say I have a text file "foo.txt" where data is like:

 id,$value,garbage_field,time_string\n
  1, 200, grrrr,2012:12:2:13:00:00
  2, 12.22,jlfa,2012:12:4:15:00:00
  1, 2, ajf, 2012:12:22:13:56:00

As you can see that the id can be repeated.This id can be like how much
money a customer has spent!! What I want to do is save the result in a file
which contains how much money each of the customer has spent in
"morning","afternoon""evening""night" (You can define your some time
buckets to define what morning and all is. For example here probably

     1, 0,202,0,0
1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and night

Now I have a python code for it.. But I have to implement this in pig.. to
get started. If anyone can just write/guide me thru this.. Thats all I need
to get started.

Thanks

Re: Beginner. Help needed in getting started

Posted by TianYi Zhu <ti...@facilitatedigital.com>.

Hi Mohit,

assuming you are using pig 0.9+, please check this link and learn how to
write user defined functions in python:
http://archive.cloudera.com/cdh4/cdh/4/pig/udf.html#python-udfs

for your problem, you can handle it like this:

1. load data from text file
2. pass the data line by line through your UDF, your UDF should take a line
as input, and output the line with a additional
time_information ("morning", "afternoon", "evening")
3. group them by id
4. for each grouped result, filter and calculate the sum of the cost
by time_information
5. write them to file

additional reference:
http://ofps.oreilly.com/titles/9781449302641/index.html

--
Thanks,
TianYi

not a naive English speaker, correct me if i made mistakes....



On Thu, Aug 30, 2012 at 2:20 PM, Mohit Singh <mo...@gmail.com> wrote:

>  am new to hadoop and all its derivatives. And I am really getting
> intimidated by the abundance of information available.
>
> But one thing I have realized is that to start implementing/using hadoop or
> distributed codes, one has to basically change the way they think about a
> problem.
>
> I was wondering if someone can help me in the following.
>
> So, basically (like anyone else) I have a raw data.. I want to parse it and
> extract some information and then run some algorithm and save the results.
>
> Lets say I have a text file "foo.txt" where data is like:
>
>  id,$value,garbage_field,time_string\n
>   1, 200, grrrr,2012:12:2:13:00:00
>   2, 12.22,jlfa,2012:12:4:15:00:00
>   1, 2, ajf, 2012:12:22:13:56:00
>
> As you can see that the id can be repeated.This id can be like how much
> money a customer has spent!! What I want to do is save the result in a file
> which contains how much money each of the customer has spent in
> "morning","afternoon""evening""night" (You can define your some time
> buckets to define what morning and all is. For example here probably
>
>      1, 0,202,0,0
> 1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and
> night
>
> Now I have a python code for it.. But I have to implement this in pig.. to
> get started. If anyone can just write/guide me thru this.. Thats all I need
> to get started.
>
> Thanks
>