You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Moore, Michael A." <Mi...@jhuapl.edu> on 2011/06/07 21:04:21 UTC

Loading Files with Comment Lines

Hello all-

I've got a quick question and Google isn't proving to be much help.

I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored.  I would like to LOAD this file using PigStorage.  Is there a way to do this, or is it handled automatically?

The data might look something like this:

# Data Source: Project A
# Contact MMoore with Questions
# SenderId	RecipientId
1	2
3	5
6	7
#2	1
3	6
11	7

Thanks!
-Michael

______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
 


RE: Loading Files with Comment Lines

Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Brilliant!  Thanks Alan!

________________________________________
From: Alan Gates [gates@yahoo-inc.com]
Sent: Tuesday, June 07, 2011 4:25 PM
To: user@pig.apache.org
Subject: Re: Loading Files with Comment Lines

A = load 'input' as (x, y);
B = filter A by SUBSTRING(x, 0, 1) != '#';
...


On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:

> Hello all-
>
> I've got a quick question and Google isn't proving to be much help.
>
> I've got a big file, that has a few lines in it prefaced with a
> pound sign (#) to indicate they are to be ignored.  I would like to
> LOAD this file using PigStorage.  Is there a way to do this, or is
> it handled automatically?
>
> The data might look something like this:
>
> # Data Source: Project A
> # Contact MMoore with Questions
> # SenderId    RecipientId
> 1     2
> 3     5
> 6     7
> #2    1
> 3     6
> 11    7
>
> Thanks!
> -Michael
>
> ______________________________________
> Michael Moore :: Michael.Moore@jhuapl.edu
> The Johns Hopkins University Applied Physics Laboratory
> 0B7B17EE1AE2A80B pgp
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>


Re: Loading Files with Comment Lines

Posted by Alan Gates <ga...@yahoo-inc.com>.
A = load 'input' as (x, y);
B = filter A by SUBSTRING(x, 0, 1) != '#';
...


On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:

> Hello all-
>
> I've got a quick question and Google isn't proving to be much help.
>
> I've got a big file, that has a few lines in it prefaced with a  
> pound sign (#) to indicate they are to be ignored.  I would like to  
> LOAD this file using PigStorage.  Is there a way to do this, or is  
> it handled automatically?
>
> The data might look something like this:
>
> # Data Source: Project A
> # Contact MMoore with Questions
> # SenderId	RecipientId
> 1	2
> 3	5
> 6	7
> #2	1
> 3	6
> 11	7
>
> Thanks!
> -Michael
>
> ______________________________________
> Michael Moore :: Michael.Moore@jhuapl.edu
> The Johns Hopkins University Applied Physics Laboratory
> 0B7B17EE1AE2A80B pgp
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>


Re: Loading Files with Comment Lines

Posted by Daniel Eklund <do...@gmail.com>.
agree with the pre-processing step... BUT, in case the data is big data
(i.e. pound signs scattered over terabytes), you could load things into a
relvar first as one big data, filter, and then split on the columns...  i
have many similar issues where the default loader won't handle something,
and I have been using this 'design pattern'... Something like:

A = LOAD 'yourfile' AS (data:chararray);
B = FILTER A by SUBSTRING(data,0,1) != '#';
C = FOREACH B generate SOMETOKENIZEUDF(data) as ( .. your columns...);

I've become a big fan of the python udfs, and you could easily use them as
your own 'loader' in the third step above.

I will not vouch for the efficiency of the approach.

On Tue, Jun 7, 2011 at 3:12 PM, <wi...@thomsonreuters.com> wrote:

> Can you stream it through
>
>  grep -v ‘^#’
>
>
>
> ?
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu]
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
>
>
>
> Hello all-
>
>
>
> I've got a quick question and Google isn't proving to be much help.
>
>
>
> I've got a big file, that has a few lines in it prefaced with a pound sign
> (#) to indicate they are to be ignored.  I would like to LOAD this file
> using PigStorage.  Is there a way to do this, or is it handled
> automatically?
>
>
>
> The data might look something like this:
>
>
>
> # Data Source: Project A
>
> # Contact MMoore with Questions
>
> # SenderId      RecipientId
>
> 1          2
>
> 3          5
>
> 6          7
>
> #2        1
>
> 3          6
>
> 11        7
>
>
>
> Thanks!
>
> -Michael
>
>
>
> ______________________________________
>
> Michael Moore :: Michael.Moore@jhuapl.edu <mailto:Michael.Moore@jhuapl.edu
> >
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
>

Re: Loading Files with Comment Lines

Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Hmm, thanks for the reply.  Anyone have a Pig way of doing this?  I'd rather not write a UDF to look for comment lines, but I can do so if I have to.  This seems like something PigStorage or the like should handle.
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
JHUAPL/AISD/VES analytics section
240-228-6768 phone
202-370-7993 mobile

0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
 

On Jun 7, 2011, at 3:17 PM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:

> I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.
> 
> 
> 
> William F Dowling
> 
> Sr Technical Specialist, Software Engineering
> 
> Thomson Reuters
> 
> 0 +1 215 823 3853
> 
> 
> 
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
> Sent: Tuesday, June 07, 2011 3:14 PM
> To: user@pig.apache.org
> Subject: Re: Loading Files with Comment Lines
> 
> 
> 
> Possibly.  Can I do that if the file is already in HDFS?
> 
> ______________________________________
> 
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 
> 
> The Johns Hopkins University Applied Physics Laboratory
> 
> 0B7B17EE1AE2A80B pgp
> 
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
> 
> 
> 
> 
> 
> On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:
> 
> 
> 
> 
> 
> Can you stream it through
> 
> grep -v ‘^#’
> 
> 
> 
> ?
> 
> 
> 
> William F Dowling
> 
> Sr Technical Specialist, Software Engineering
> 
> Thomson Reuters
> 
> 0 +1 215 823 3853
> 
> 
> 
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
> 
> 
> 
> Hello all-
> 
> 
> 
> I've got a quick question and Google isn't proving to be much help.
> 
> 
> 
> I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored.  I would like to LOAD this file using PigStorage.  Is there a way to do this, or is it handled automatically?
> 
> 
> 
> The data might look something like this:
> 
> 
> 
> # Data Source: Project A
> 
> # Contact MMoore with Questions
> 
> # SenderId      RecipientId
> 
> 1          2
> 
> 3          5
> 
> 6          7
> 
> #2        1
> 
> 3          6
> 
> 11        7
> 
> 
> 
> Thanks!
> 
> -Michael
> 
> 
> 
> ______________________________________
> 
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 
> 
> The Johns Hopkins University Applied Physics Laboratory
> 
> 0B7B17EE1AE2A80B pgp
> 
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
> 
> 
> 
> 
> 
> 
> 
> 


RE: Loading Files with Comment Lines

Posted by wi...@thomsonreuters.com.
I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.

 

William F Dowling

Sr Technical Specialist, Software Engineering

Thomson Reuters

0 +1 215 823 3853

 

From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
Sent: Tuesday, June 07, 2011 3:14 PM
To: user@pig.apache.org
Subject: Re: Loading Files with Comment Lines

 

Possibly.  Can I do that if the file is already in HDFS?

______________________________________

Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 

The Johns Hopkins University Applied Physics Laboratory

0B7B17EE1AE2A80B pgp

BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint

 

 

On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:





Can you stream it through

 grep -v ‘^#’



?



William F Dowling

Sr Technical Specialist, Software Engineering

Thomson Reuters

0 +1 215 823 3853



From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
Sent: Tuesday, June 07, 2011 3:04 PM
To: user@pig.apache.org
Subject: Loading Files with Comment Lines



Hello all-



I've got a quick question and Google isn't proving to be much help.



I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored.  I would like to LOAD this file using PigStorage.  Is there a way to do this, or is it handled automatically?



The data might look something like this:



# Data Source: Project A

# Contact MMoore with Questions

# SenderId      RecipientId

1          2

3          5

6          7

#2        1

3          6

11        7



Thanks!

-Michael



______________________________________

Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 

The Johns Hopkins University Applied Physics Laboratory

0B7B17EE1AE2A80B pgp

BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint






 


Re: Loading Files with Comment Lines

Posted by "Moore, Michael A." <Mi...@jhuapl.edu>.
Possibly.  Can I do that if the file is already in HDFS?
______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
 

On Jun 7, 2011, at 3:12 PM, <wi...@thomsonreuters.com> wrote:

> Can you stream it through
> 
>  grep -v ‘^#’
> 
> 
> 
> ?
> 
> 
> 
> William F Dowling
> 
> Sr Technical Specialist, Software Engineering
> 
> Thomson Reuters
> 
> 0 +1 215 823 3853
> 
> 
> 
> From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: user@pig.apache.org
> Subject: Loading Files with Comment Lines
> 
> 
> 
> Hello all-
> 
> 
> 
> I've got a quick question and Google isn't proving to be much help.
> 
> 
> 
> I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored.  I would like to LOAD this file using PigStorage.  Is there a way to do this, or is it handled automatically?
> 
> 
> 
> The data might look something like this:
> 
> 
> 
> # Data Source: Project A
> 
> # Contact MMoore with Questions
> 
> # SenderId      RecipientId
> 
> 1          2
> 
> 3          5
> 
> 6          7
> 
> #2        1
> 
> 3          6
> 
> 11        7
> 
> 
> 
> Thanks!
> 
> -Michael
> 
> 
> 
> ______________________________________
> 
> Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 
> 
> The Johns Hopkins University Applied Physics Laboratory
> 
> 0B7B17EE1AE2A80B pgp
> 
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
> 
> 
> 
> 
> 


RE: Loading Files with Comment Lines

Posted by wi...@thomsonreuters.com.
Can you stream it through

  grep -v ‘^#’

 

?

 

William F Dowling

Sr Technical Specialist, Software Engineering

Thomson Reuters

0 +1 215 823 3853

 

From: Moore, Michael A. [mailto:Michael.Moore@jhuapl.edu] 
Sent: Tuesday, June 07, 2011 3:04 PM
To: user@pig.apache.org
Subject: Loading Files with Comment Lines

 

Hello all-

 

I've got a quick question and Google isn't proving to be much help.

 

I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored.  I would like to LOAD this file using PigStorage.  Is there a way to do this, or is it handled automatically?

 

The data might look something like this:

 

# Data Source: Project A

# Contact MMoore with Questions

# SenderId      RecipientId

1          2

3          5

6          7

#2        1

3          6

11        7

 

Thanks!

-Michael

 

______________________________________

Michael Moore :: Michael.Moore@jhuapl.edu <ma...@jhuapl.edu> 

The Johns Hopkins University Applied Physics Laboratory

0B7B17EE1AE2A80B pgp

BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint