You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by kiranprasad <ki...@imimobile.com> on 2011/10/13 06:35:00 UTC

How to store each record in a seperate file

Hi

After grouping a data set, how do I save each group in a separate file.

ex:
A = E:/data.txt' USING PigStorage(',');
B = GROUP A BY $0;

cat data.txt;

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

After grouping 

(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)}) 
(8,{(8,3,4),(8,4,3)})

How do I save each record in separate file.


Regards
Kiran.G

Re: How to store each record in a seperate file

Posted by kiranprasad <ki...@imimobile.com>.

After USING MultipleStorage() the files have been generated based on the 
group, now
How can I append the Headers to all the files generated ?

-----Original Message----- 
From: kiranprasad
Sent: Thursday, October 13, 2011 3:40 PM
To: user@pig.apache.org
Subject: Re: How to store each record in a seperate file

Thank you All it is Working.

-----Original Message----- 
From: Thomas Kappler
Sent: Thursday, October 13, 2011 12:33 PM
To: user@pig.apache.org
Subject: Re: How to store each record in a seperate file

On Thu, Oct 13, 2011 at 07:56, Ayon Sinha <ay...@yahoo.com> wrote:
> Hi Kiranprasad,
> What is your usecase? Are you sure you have picked the right tool for the 
> job? Pig/Hadoop is meant for massive datasets which mean millions and 
> billions of rows. Which in your case would lead to millions & billions of 
> files which Hadoop doesn't like anyway.

I have also found that MultiStorage runs a reducer for each partition,
i.e., each separate file. This will be ok if for a small number of
partitions (locations in Kiran's case), but will break down for larger
numbers.

I ended up letting Pig group the records and writing a script that
splits the Pig output into one file per group.

-- Thomas

Re: How to store each record in a seperate file

Posted by kiranprasad <ki...@imimobile.com>.

Thank you All it is Working.

-----Original Message----- 
From: Thomas Kappler
Sent: Thursday, October 13, 2011 12:33 PM
To: user@pig.apache.org
Subject: Re: How to store each record in a seperate file

On Thu, Oct 13, 2011 at 07:56, Ayon Sinha <ay...@yahoo.com> wrote:
> Hi Kiranprasad,
> What is your usecase? Are you sure you have picked the right tool for the 
> job? Pig/Hadoop is meant for massive datasets which mean millions and 
> billions of rows. Which in your case would lead to millions & billions of 
> files which Hadoop doesn't like anyway.

I have also found that MultiStorage runs a reducer for each partition,
i.e., each separate file. This will be ok if for a small number of
partitions (locations in Kiran's case), but will break down for larger
numbers.

I ended up letting Pig group the records and writing a script that
splits the Pig output into one file per group.

-- Thomas

Re: How to store each record in a seperate file

Posted by Thomas Kappler <tk...@googlemail.com>.

On Thu, Oct 13, 2011 at 07:56, Ayon Sinha <ay...@yahoo.com> wrote:
> Hi Kiranprasad,
> What is your usecase? Are you sure you have picked the right tool for the job? Pig/Hadoop is meant for massive datasets which mean millions and billions of rows. Which in your case would lead to millions & billions of files which Hadoop doesn't like anyway.

I have also found that MultiStorage runs a reducer for each partition,
i.e., each separate file. This will be ok if for a small number of
partitions (locations in Kiran's case), but will break down for larger
numbers.

I ended up letting Pig group the records and writing a script that
splits the Pig output into one file per group.

-- Thomas

Re: How to store each record in a seperate file

Posted by kiranprasad <ki...@imimobile.com>.

Hi Ayon

I have just started working on PIG and trying with different usecases.
one of my use case is there are 10 million records and after grouping them 
with a field (say location), I want all the records of particular location 
in separate file.
I am presently working on the local mode.

Kiran.G

-----Original Message----- 
From: Ayon Sinha
Sent: Thursday, October 13, 2011 11:26 AM
To: user@pig.apache.org
Subject: Re: How to store each record in a seperate file

Hi Kiranprasad,
What is your usecase? Are you sure you have picked the right tool for the 
job? Pig/Hadoop is meant for massive datasets which mean millions and 
billions of rows. Which in your case would lead to millions & billions of 
files which Hadoop doesn't like anyway.
Now if your dataset is really small then do you really need hadoop or perl, 
python, shell or any programming language on a single machine would suffice?
Just asking to make sure you are not headed the wrong path.
OTOH, if you are doing this as an academic exercise, all is justified.

-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: kiranprasad <ki...@imimobile.com>
To: user@pig.apache.org; Ayon Sinha <ay...@yahoo.com>
Sent: Wednesday, October 12, 2011 10:19 PM
Subject: Re: How to store each record in a seperate file

Thank you for quick response, But how can I perform the below in local mode.

-----Original Message----- 
From: Jonathan Coveney
Sent: Thursday, October 13, 2011 10:28 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file

To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.

2011/10/12 Ayon Sinha <ay...@yahoo.com>

> Besides the bigger question of Why would you want to store each record in
> a
> separate file?
> I'm not sure how to do this in Pig but it is definitely possible in Hadoop
> (and also streaming) via MultipleOutputFormat where the name of the output
> file can be based on the base_dir and key and value. You can create your
> own
> filename based on those arguments.
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> You can definitely implement your own StoreFunc UDF.
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: kiranprasad <ki...@imimobile.com>
> To: user@pig.apache.org
> Sent: Wednesday, October 12, 2011 9:35 PM
> Subject: How to store each record in a seperate file
>
> Hi
>
> After grouping a data set, how do I save each group in a separate file.
>
> ex:
> A = E:/data.txt' USING PigStorage(',');
> B = GROUP A BY $0;
>
> cat data.txt;
>
> (1,2,3)
> (4,2,1)
> (8,3,4)
> (4,3,3)
> (7,2,5)
> (8,4,3)
>
> After grouping
>
> (1,{(1,2,3)})
> (4,{(4,2,1),(4,3,3)})
> (7,{(7,2,5)})
> (8,{(8,3,4),(8,4,3)})
>
> How do I save each record in separate file.
>
>
> Regards
> Kiran.G
>

Re: How to store each record in a seperate file

Posted by Ayon Sinha <ay...@yahoo.com>.

Hi Kiranprasad,
What is your usecase? Are you sure you have picked the right tool for the job? Pig/Hadoop is meant for massive datasets which mean millions and billions of rows. Which in your case would lead to millions & billions of files which Hadoop doesn't like anyway.
Now if your dataset is really small then do you really need hadoop or perl, python, shell or any programming language on a single machine would suffice?
Just asking to make sure you are not headed the wrong path.
OTOH, if you are doing this as an academic exercise, all is justified.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: kiranprasad <ki...@imimobile.com>
To: user@pig.apache.org; Ayon Sinha <ay...@yahoo.com>
Sent: Wednesday, October 12, 2011 10:19 PM
Subject: Re: How to store each record in a seperate file

Thank you for quick response, But how can I perform the below in local mode.

-----Original Message----- 
From: Jonathan Coveney
Sent: Thursday, October 13, 2011 10:28 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file

To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.

2011/10/12 Ayon Sinha <ay...@yahoo.com>

> Besides the bigger question of Why would you want to store each record in 
> a
> separate file?
> I'm not sure how to do this in Pig but it is definitely possible in Hadoop
> (and also streaming) via MultipleOutputFormat where the name of the output
> file can be based on the base_dir and key and value. You can create your 
> own
> filename based on those arguments.
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> You can definitely implement your own StoreFunc UDF.
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: kiranprasad <ki...@imimobile.com>
> To: user@pig.apache.org
> Sent: Wednesday, October 12, 2011 9:35 PM
> Subject: How to store each record in a seperate file
>
> Hi
>
> After grouping a data set, how do I save each group in a separate file.
>
> ex:
> A = E:/data.txt' USING PigStorage(',');
> B = GROUP A BY $0;
>
> cat data.txt;
>
> (1,2,3)
> (4,2,1)
> (8,3,4)
> (4,3,3)
> (7,2,5)
> (8,4,3)
>
> After grouping
>
> (1,{(1,2,3)})
> (4,{(4,2,1),(4,3,3)})
> (7,{(7,2,5)})
> (8,{(8,3,4),(8,4,3)})
>
> How do I save each record in separate file.
>
>
> Regards
> Kiran.G
>

Re: How to store each record in a seperate file

Posted by kiranprasad <ki...@imimobile.com>.

I wanna compare 2 files.
A.txt and B.txt

cat A;
(1,2,3)
(4,2,1)
(8,3,4)
(8,3,4)
(4,2,1)
(8,3,4)
(4,2,1)
cat B.txt;
1
2
3

now I wanna compare each A.$0 == B.$0 then write the result in separate 
file.

-----Original Message----- 
From: kiranprasad
Sent: Thursday, October 13, 2011 10:49 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file

Thank you for quick response, But how can I perform the below in local mode.

-----Original Message----- 
From: Jonathan Coveney
Sent: Thursday, October 13, 2011 10:28 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file

To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.

2011/10/12 Ayon Sinha <ay...@yahoo.com>

> Besides the bigger question of Why would you want to store each record in 
> a
> separate file?
> I'm not sure how to do this in Pig but it is definitely possible in Hadoop
> (and also streaming) via MultipleOutputFormat where the name of the output
> file can be based on the base_dir and key and value. You can create your 
> own
> filename based on those arguments.
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> You can definitely implement your own StoreFunc UDF.
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: kiranprasad <ki...@imimobile.com>
> To: user@pig.apache.org
> Sent: Wednesday, October 12, 2011 9:35 PM
> Subject: How to store each record in a seperate file
>
> Hi
>
> After grouping a data set, how do I save each group in a separate file.
>
> ex:
> A = E:/data.txt' USING PigStorage(',');
> B = GROUP A BY $0;
>
> cat data.txt;
>
> (1,2,3)
> (4,2,1)
> (8,3,4)
> (4,3,3)
> (7,2,5)
> (8,4,3)
>
> After grouping
>
> (1,{(1,2,3)})
> (4,{(4,2,1),(4,3,3)})
> (7,{(7,2,5)})
> (8,{(8,3,4),(8,4,3)})
>
> How do I save each record in separate file.
>
>
> Regards
> Kiran.G
>

Re: How to store each record in a seperate file

Posted by kiranprasad <ki...@imimobile.com>.

Thank you for quick response, But how can I perform the below in local mode.

-----Original Message----- 
From: Jonathan Coveney
Sent: Thursday, October 13, 2011 10:28 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file

To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.

2011/10/12 Ayon Sinha <ay...@yahoo.com>

> Besides the bigger question of Why would you want to store each record in 
> a
> separate file?
> I'm not sure how to do this in Pig but it is definitely possible in Hadoop
> (and also streaming) via MultipleOutputFormat where the name of the output
> file can be based on the base_dir and key and value. You can create your 
> own
> filename based on those arguments.
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> You can definitely implement your own StoreFunc UDF.
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: kiranprasad <ki...@imimobile.com>
> To: user@pig.apache.org
> Sent: Wednesday, October 12, 2011 9:35 PM
> Subject: How to store each record in a seperate file
>
> Hi
>
> After grouping a data set, how do I save each group in a separate file.
>
> ex:
> A = E:/data.txt' USING PigStorage(',');
> B = GROUP A BY $0;
>
> cat data.txt;
>
> (1,2,3)
> (4,2,1)
> (8,3,4)
> (4,3,3)
> (7,2,5)
> (8,4,3)
>
> After grouping
>
> (1,{(1,2,3)})
> (4,{(4,2,1),(4,3,3)})
> (7,{(7,2,5)})
> (8,{(8,3,4),(8,4,3)})
>
> How do I save each record in separate file.
>
>
> Regards
> Kiran.G
>

Re: How to store each record in a seperate file

Posted by Jonathan Coveney <jc...@gmail.com>.

To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.

2011/10/12 Ayon Sinha <ay...@yahoo.com>

> Besides the bigger question of Why would you want to store each record in a
> separate file?
> I'm not sure how to do this in Pig but it is definitely possible in Hadoop
> (and also streaming) via MultipleOutputFormat where the name of the output
> file can be based on the base_dir and key and value. You can create your own
> filename based on those arguments.
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> You can definitely implement your own StoreFunc UDF.
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: kiranprasad <ki...@imimobile.com>
> To: user@pig.apache.org
> Sent: Wednesday, October 12, 2011 9:35 PM
> Subject: How to store each record in a seperate file
>
> Hi
>
> After grouping a data set, how do I save each group in a separate file.
>
> ex:
> A = E:/data.txt' USING PigStorage(',');
> B = GROUP A BY $0;
>
> cat data.txt;
>
> (1,2,3)
> (4,2,1)
> (8,3,4)
> (4,3,3)
> (7,2,5)
> (8,4,3)
>
> After grouping
>
> (1,{(1,2,3)})
> (4,{(4,2,1),(4,3,3)})
> (7,{(7,2,5)})
> (8,{(8,3,4),(8,4,3)})
>
> How do I save each record in separate file.
>
>
> Regards
> Kiran.G
>

Re: How to store each record in a seperate file

Posted by Ayon Sinha <ay...@yahoo.com>.

Besides the bigger question of Why would you want to store each record in a separate file?
I'm not sure how to do this in Pig but it is definitely possible in Hadoop (and also streaming) via MultipleOutputFormat where the name of the output file can be based on the base_dir and key and value. You can create your own filename based on those arguments. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

You can definitely implement your own StoreFunc UDF. 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: kiranprasad <ki...@imimobile.com>
To: user@pig.apache.org
Sent: Wednesday, October 12, 2011 9:35 PM
Subject: How to store each record in a seperate file

Hi

After grouping a data set, how do I save each group in a separate file.

ex:
A = E:/data.txt' USING PigStorage(',');
B = GROUP A BY $0;

cat data.txt;

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

After grouping 

(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)}) 
(8,{(8,3,4),(8,4,3)})

How do I save each record in separate file.


Regards
Kiran.G