You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Erik Paulson <ep...@cs.wisc.edu> on 2008/04/01 00:10:29 UTC

Re: Loading data

On Mon, Mar 24, 2008 at 03:20:02PM -0700, Benjamin Reed wrote:
> PigStorage uses regex for splitting as defined in:
> 
> http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#sum
> 
> It looks like you might need to specify PigStorage('[|]').
> 
> And yes, pig does process directories just like hadoop.

Sorry to keep asking beginning questions, but what is the 
syntax to get pig to load directories?


grunt> ls /scratch/epaulson/small/test
file:/scratch/epaulson/small/test/foo<r 1>      19
file:/scratch/epaulson/small/test/zot<r 1>      21
file:/scratch/epaulson/small/test/bar<r 1>      19
grunt> cat /scratch/epaulson/small/test/foo
first|second|third
grunt> cat /scratch/epaulson/small/test/bar
fourth|fifth|sixth
grunt> cat /scratch/epaulson/small/test/zot
seventh|eighth|ninth
grunt> dircontents = load '/scratch/epaulson/small/test/' using PigStorage('[|]');
grunt> dump dircontents;
2008-03-31 14:24:26,251 [main] ERROR org.apache.pig.tools.grunt.GruntParser - Unable to open iterator for alias: dircontents

Thanks!

-Erik

> 
> ben
> 
> On Monday 24 March 2008 15:07:39 Erik Paulson wrote:
> > Hello all -
> >
> > I'm trying to load data that is seperated by '|' characters, using the
> > PigStorage layer (using today's SVN)
> >
> > From following the code in Tuple, I think I'm doing this right, but maybe
> > something in the parser is eating my character seperators?
> >
> >
> >
> > grunt> cat /tmp/pipeseperated
> > first|second|third
> > grunt> cat /tmp/commaseperated
> > first,second,third
> > grunt> pipedata = load '/tmp/pipeseperated' using PigStorage('\\|');
> > grunt> commadata = load '/tmp/commaseperated' using PigStorage(',');
> > grunt> dump pipedata
> > (, f, i, r, s, t, |, s, e, c, o, n, d, |, t, h, i, r, d, )
> > grunt> dump commadata;
> > (first, second, third)
> > grunt> trytwo = load '/tmp/pipeseperated' using PigStorage('|');
> > grunt> dump trytwo
> > (, f, i, r, s, t, |, s, e, c, o, n, d, |, t, h, i, r, d, )
> >
> >
> > And a second question: in Hadoop, it's customary to give a path to a
> > directory containing all of the input files - is the same thing doable in
> > Pig?
> >
> > Thanks!
> >
> > -Erik
>