You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2010/10/05 16:51:30 UTC

Re: Pig indexing

Hey Dmitriy!
I've been trying to get to this email for the last couple of weeks, and
finally I am here.
You were talking about Pig's merge, but there is one thing I didn't quite
understand from the wiki (http://wiki.apache.org/pig/PigMergeJoin) if it
uses sampling records to create indexes because we know that the files are
ordered which would result in "clustered indexes", right? And in this merge
join operator, the intermediary index gets just destroyed after it?
Do you know where I could find a similar a loader which is aware of indexes?
Maybe there is some source code I could look into. But, I will definitely
look into the MergeJoinIndexer code to try to get a grasp on it (:
One last thing, what do you mean by "splits for  blocks"?
Thanks in advanced.

Renato M.



> 2010/9/22 Dmitriy Ryaboy <dv...@gmail.com>
>
> Renato,
>> Using indexes is "just" a matter of writing a loader that is aware of said
>> indexes. Merge join already builds an index and uses it as part of its
>> internals.
>> With filters being offered to loaders that claim to implement filter
>> push-down, there is no reason not to have a loader that can look up block
>> locations in some index, and only create splits for blocks that contain
>> the
>> unfiltered values, for example.  One thing to note is that currently there
>> is no automatic index creation (since you can load arbitrary data), so you
>> need to code up a way to look up which of the resources you are trying to
>> load have been indexed.
>>
>> -D
>>
>> On Tue, Sep 21, 2010 at 6:32 PM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>>
>> > Hi everyone!
>> >
>> > After reading Ed's email, I got really intrigued about Pig using
>> indexes, I
>> > thought those were just plans lol
>> > But as commented in here https://issues.apache.org/jira/browse/PIG-209,
>> we
>> > could use indexing through Zebra, right? But that means that we would
>> have
>> > to preload our data into Zebra, "sort it" in a similar way to the sorted
>> > table union example of the wiki, and then if we make a join using them,
>> > this
>> > join is made in a similar way to the work of  Hung-chih Yang et al. ??
>> > Is there any published papers or technical overview on Pig/Zebra or
>> > MapReduce/Zebra?
>> > Thanks in advanced.
>> >
>> >
>> > Renato M.
>> >
>>
>
>

RE: Pig indexing

Posted by Yan Zhou <ya...@yahoo-inc.com>.
Zebra, a contrib project under pig, is such a loader that builds indexes by itself.

Yan

-----Original Message-----
From: Renato Marroquín Mogrovejo [mailto:renatoj.marroquin@gmail.com] 
Sent: Tuesday, October 05, 2010 7:52 AM
To: pig-user@hadoop.apache.org
Subject: Re: Pig indexing

Hey Dmitriy!
I've been trying to get to this email for the last couple of weeks, and
finally I am here.
You were talking about Pig's merge, but there is one thing I didn't quite
understand from the wiki (http://wiki.apache.org/pig/PigMergeJoin) if it
uses sampling records to create indexes because we know that the files are
ordered which would result in "clustered indexes", right? And in this merge
join operator, the intermediary index gets just destroyed after it?
Do you know where I could find a similar a loader which is aware of indexes?
Maybe there is some source code I could look into. But, I will definitely
look into the MergeJoinIndexer code to try to get a grasp on it (:
One last thing, what do you mean by "splits for  blocks"?
Thanks in advanced.

Renato M.



> 2010/9/22 Dmitriy Ryaboy <dv...@gmail.com>
>
> Renato,
>> Using indexes is "just" a matter of writing a loader that is aware of said
>> indexes. Merge join already builds an index and uses it as part of its
>> internals.
>> With filters being offered to loaders that claim to implement filter
>> push-down, there is no reason not to have a loader that can look up block
>> locations in some index, and only create splits for blocks that contain
>> the
>> unfiltered values, for example.  One thing to note is that currently there
>> is no automatic index creation (since you can load arbitrary data), so you
>> need to code up a way to look up which of the resources you are trying to
>> load have been indexed.
>>
>> -D
>>
>> On Tue, Sep 21, 2010 at 6:32 PM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>>
>> > Hi everyone!
>> >
>> > After reading Ed's email, I got really intrigued about Pig using
>> indexes, I
>> > thought those were just plans lol
>> > But as commented in here https://issues.apache.org/jira/browse/PIG-209,
>> we
>> > could use indexing through Zebra, right? But that means that we would
>> have
>> > to preload our data into Zebra, "sort it" in a similar way to the sorted
>> > table union example of the wiki, and then if we make a join using them,
>> > this
>> > join is made in a similar way to the work of  Hung-chih Yang et al. ??
>> > Is there any published papers or technical overview on Pig/Zebra or
>> > MapReduce/Zebra?
>> > Thanks in advanced.
>> >
>> >
>> > Renato M.
>> >
>>
>
>