You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Moore, Michael A." <Mi...@jhuapl.edu> on 2011/06/07 21:04:21 UTC
Loading Files with Comment Lines
Hello all-
I've got a quick question and Google isn't proving to be much help.
I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?
The data might look something like this:
# Data Source: Project A
# Contact MMoore with Questions
# SenderId RecipientId
1 2
3 5
6 7
#2 1
3 6
11 7
Thanks!
-Michael
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
RE: Loading Files with Comment Lines
Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Brilliant! Thanks Alan!
________________________________________
From: Alan Gates [gates@yahoo-inc.com]
Sent: Tuesday, June 07, 2011 4:25 PM
To: user@pig.apache.org
Subject: Re: Loading Files with Comment Lines
A = load 'input' as (x, y);
B = filter A by SUBSTRING(x, 0, 1) != '#';
...
On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:
> Hello all-
>
> I've got a quick question and Google isn't proving to be much help.
>
> I've got a big file, that has a few lines in it prefaced with a
> pound sign (#) to indicate they are to be ignored. I would like to
> LOAD this file using PigStorage. Is there a way to do this, or is
> it handled automatically?
>
> The data might look something like this:
>
> # Data Source: Project A
> # Contact MMoore with Questions
> # SenderId RecipientId
> 1 2
> 3 5
> 6 7
> #2 1
> 3 6
> 11 7
>
> Thanks!
> -Michael
>
> ______________________________________
> Michael Moore :: Michael.Moore@jhuapl.edu
> The Johns Hopkins University Applied Physics Laboratory
> 0B7B17EE1AE2A80B pgp
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
Re: Loading Files with Comment Lines
Posted by Alan Gates <ga...@yahoo-inc.com>.
A = load 'input' as (x, y);
B = filter A by SUBSTRING(x, 0, 1) != '#';
...
On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:
> Hello all-
>
> I've got a quick question and Google isn't proving to be much help.
>
> I've got a big file, that has a few lines in it prefaced with a
> pound sign (#) to indicate they are to be ignored. I would like to
> LOAD this file using PigStorage. Is there a way to do this, or is
> it handled automatically?
>
> The data might look something like this:
>
> # Data Source: Project A
> # Contact MMoore with Questions
> # SenderId RecipientId
> 1 2
> 3 5
> 6 7
> #2 1
> 3 6
> 11 7
>
> Thanks!
> -Michael
>
> ______________________________________
> Michael Moore :: Michael.Moore@jhuapl.edu
> The Johns Hopkins University Applied Physics Laboratory
> 0B7B17EE1AE2A80B pgp
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
Re: Loading Files with Comment Lines
Posted by Daniel Eklund <do...@gmail.com>.
agree with the pre-processing step... BUT, in case the data is big data
(i.e. pound signs scattered over terabytes), you could load things into a
relvar first as one big data, filter, and then split on the columns... i
have many similar issues where the default loader won't handle something,
and I have been using this 'design pattern'... Something like:
A = LOAD 'yourfile' AS (data:chararray);
B = FILTER A by SUBSTRING(data,0,1) != '#';
C = FOREACH B generate SOMETOKENIZEUDF(data) as ( .. your columns...);
I've become a big fan of the python udfs, and you could easily use them as
your own 'loader' in the third step above.
I will not vouch for the efficiency of the approach.
On Tue, Jun 7, 2011 at 3:12 PM, <wi...@thomsonreuters.com> wrote:
> Can you stream it through
>
> grep -v ‘^#’
>
>
>
> ?
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
>
>
>
> Hello all-
>
>
>
> I've got a quick question and Google isn't proving to be much help.
>
>
>
> I've got a big file, that has a few lines in it prefaced with a pound sign
> (#) to indicate they are to be ignored. I would like to LOAD this file
> using PigStorage. Is there a way to do this, or is it handled
> automatically?
>
>
>
> The data might look something like this:
>
>
>
> # Data Source: Project A
>
> # Contact MMoore with Questions
>
> # SenderId RecipientId
>
> 1 2
>
> 3 5
>
> 6 7
>
> #2 1
>
> 3 6
>
> 11 7
>
>
>
> Thanks!
>
> -Michael
>
>
>
> ______________________________________
>
> Michael Moore :: Michael.Moore@jhuapl.edu <mailto:Michael.Moore@jhuapl.edu
> >
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
>
Re: Loading Files with Comment Lines
Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Hmm, thanks for the reply. Anyone have a Pig way of doing this? I'd rather not write a UDF to look for comment lines, but I can do so if I have to. This seems like something PigStorage or the like should handle.
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
JHUAPL/AISD/VES analytics section
240-228-6768 phone
202-370-7993 mobile
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
On Jun 7, 2011, at 3:17 PM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:
> I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
> Sent: Tuesday, June 07, 2011 3:14 PM
> To: user@pig.apache.org
> Subject: Re: Loading Files with Comment Lines
>
>
>
> Possibly. Can I do that if the file is already in HDFS?
>
> ______________________________________
>
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
> On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:
>
>
>
>
>
> Can you stream it through
>
> grep -v ‘^#’
>
>
>
> ?
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
>
>
>
> Hello all-
>
>
>
> I've got a quick question and Google isn't proving to be much help.
>
>
>
> I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?
>
>
>
> The data might look something like this:
>
>
>
> # Data Source: Project A
>
> # Contact MMoore with Questions
>
> # SenderId RecipientId
>
> 1 2
>
> 3 5
>
> 6 7
>
> #2 1
>
> 3 6
>
> 11 7
>
>
>
> Thanks!
>
> -Michael
>
>
>
> ______________________________________
>
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
>
>
>
RE: Loading Files with Comment Lines
Posted by wi...@thomsonreuters.com.
I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
Sent: Tuesday, June 07, 2011 3:14 PM
To: user@pig.apache.org
Subject: Re: Loading Files with Comment Lines
Possibly. Can I do that if the file is already in HDFS?
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:
Can you stream it through
grep -v ‘^#’
?
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
Sent: Tuesday, June 07, 2011 3:04 PM
To: user@pig.apache.org
Subject: Loading Files with Comment Lines
Hello all-
I've got a quick question and Google isn't proving to be much help.
I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?
The data might look something like this:
# Data Source: Project A
# Contact MMoore with Questions
# SenderId RecipientId
1 2
3 5
6 7
#2 1
3 6
11 7
Thanks!
-Michael
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
Re: Loading Files with Comment Lines
Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Possibly. Can I do that if the file is already in HDFS?
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:
> Can you stream it through
>
> grep -v ‘^#’
>
>
>
> ?
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
>
>
>
> Hello all-
>
>
>
> I've got a quick question and Google isn't proving to be much help.
>
>
>
> I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?
>
>
>
> The data might look something like this:
>
>
>
> # Data Source: Project A
>
> # Contact MMoore with Questions
>
> # SenderId RecipientId
>
> 1 2
>
> 3 5
>
> 6 7
>
> #2 1
>
> 3 6
>
> 11 7
>
>
>
> Thanks!
>
> -Michael
>
>
>
> ______________________________________
>
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
RE: Loading Files with Comment Lines
Posted by wi...@thomsonreuters.com.
Can you stream it through
grep -v ‘^#’
?
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
Sent: Tuesday, June 07, 2011 3:04 PM
To: user@pig.apache.org
Subject: Loading Files with Comment Lines
Hello all-
I've got a quick question and Google isn't proving to be much help.
I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?
The data might look something like this:
# Data Source: Project A
# Contact MMoore with Questions
# SenderId RecipientId
1 2
3 5
6 7
#2 1
3 6
11 7
Thanks!
-Michael
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu>
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint