You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by miryala vignesh <mi...@gmail.com> on 2009/11/01 14:27:41 UTC

Using elements of a tuple in other tuples FOREACH statement.

I have a tuple
X = "contents of html file" like
X=(file:chararray)
X =
(<html><body><h2>hie</h2><h2>hie</h2><h2>hie</h2><h2>hie</h2>djfkdj<p>jhsdaj</p><h2>hie</h2></body></html>)

in

Y I have indices and tag name like
Y=
tag,start,end
(html,0,105)
(body,6,98)
(h2,12,24)
(h2,24,36)
(h2,36,48)
(h2,48,60)
(p,66,79)
(h2,79,91)

Z = FOREACH Y GENERATE udf(??); (what should be parameters to udf to send
X.file)
Now how do I store a tuple part of the string file from start index to end
index in some other alias say Z is my question

Join or Cross is not an option because I want to avoid redundant storage

Any alternate implementation or idea is welcomed

-- 
Vignesh Miriyala
http://web.iiit.ac.in/~miriyala
http://vigneshmiriyala.wordpress.com

Re: Using elements of a tuple in other tuples FOREACH statement.

Posted by Alan Gates <ga...@yahoo-inc.com>.

I'm not sure I understand your question, but it sounds like you want  
to comingle data from two relations, X and Y without doing a join or  
cross.  Is that correct?  If so, you can't do that.  If you have a  
script like:

X = load 'file_data';
Y = load 'tuple_data';
Z = do something with X and Y

Z must be either a cross, join, or cogroup.  Otherwise Pig has no way  
to understand how to stitch the data together.  Perhaps something like:

A = "contents of html file";
B = "indices and tag name"
C = cogroup A all, B all;
D = foreach C generate udf(A, B);

will do what you want.  This will collect all of your records together  
and pass them to your udf for evaluation.  Obviously this is not  
parallelizable.  If you want to collect them together instead on some  
key you can change the cogroup statement.

Alan.

On Nov 1, 2009, at 5:27 AM, miryala vignesh wrote:

> I have a tuple
> X = "contents of html file" like
> X=(file:chararray)
> X =
> (<html><body><h2>hie</h2><h2>hie</h2><h2>hie</h2><h2>hie</ 
> h2>djfkdj<p>jhsdaj</p><h2>hie</h2></body></html>)
>
> in
>
> Y I have indices and tag name like
> Y=
> tag,start,end
> (html,0,105)
> (body,6,98)
> (h2,12,24)
> (h2,24,36)
> (h2,36,48)
> (h2,48,60)
> (p,66,79)
> (h2,79,91)
>
> Z = FOREACH Y GENERATE udf(??); (what should be parameters to udf to  
> send
> X.file)
> Now how do I store a tuple part of the string file from start index  
> to end
> index in some other alias say Z is my question
>
> Join or Cross is not an option because I want to avoid redundant  
> storage
>
> Any alternate implementation or idea is welcomed
>
> -- 
> Vignesh Miriyala
> http://web.iiit.ac.in/~miriyala
> http://vigneshmiriyala.wordpress.com