You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Charlie Groves <ch...@threerings.net> on 2008/01/18 02:16:10 UTC

Getting query information while loading data

I'd like to expose the running query to my loading code for a few  
reasons:

- To allow the schema of the loaded data to be specified by its usage  
in the query, rather than by an explicit AS.  I know the names of the  
fields in my data, so it seems backwards to me to require it to be  
named in the query.  I'd rather use the data access in the query to  
figure out the names of the fields and pass that to my loader to put  
the data in the right place in a tuple.  This also seems like it  
could be nice for CSV data since it generally has the names as the  
first line.

- Following up on using the query to determine the schema, I'd like  
to use the query-determined schema to decide what to load.  My  
storage is broken out into files by field, so if I know which fields  
are used in a query, I can read only those fields and save a huge  
amount of busywork.

- To optimize filter operations using indexes.  For some of my  
fields, I have metadata that tells me the range of values in that  
file.  If I could find all the filter operations on that field, I  
could reject entire files if their values fell outside the filter range.

Are you interested in some patches to do this sort of thing?  If so,  
what's the best way to expose this information to user code?  My very  
basic, initial thinking for the first two use cases is to write a  
LOVisitor and an EvalSpecVisitor to spider through the built query  
and build a schema to pass to an interested load func.  A load func  
indicates its interest by implementing a new interface that takes the  
schema, and it takes responsibility for making a tuple that conforms  
to the schema.  If a load func isn't interested, it just implements  
the current interface and loads all the data in its input stream.

The final use case seems like it would require exposing EvalFuncs and  
the LogicalPlan to user code, so I'm fine with just going after the  
first two for now and figuring that out later.  However, if there's a  
way that's exposed already in the code that I've missed, or if  
there's a better way to do it, I'd like to check it out since it'd be  
hugely beneficial for what I'm doing.

Thanks,
Charlie

Re: Getting query information while loading data

Posted by Alan Gates <ga...@yahoo-inc.com>.

Comments at the end.

Charlie Groves wrote:
>
> On Feb 4, 2008, at 1:42 PM, Alan Gates wrote:
>> Charlie Groves wrote:
>>>
>>> On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
>>>> Our thinking of how to provide field metadata (name and eventually 
>>>> types) for pig queries was to allow several options:
>>>>    1) AS in the LOAD, as you can currently do for names.
>>>>    2) using an outside metadata service, where we would tell it the 
>>>> file name and it would tell us the metadata.
>>>>    3) Support self describing data formats such as JSON.
>>>>
>>>> You're suggestion for a very simple schema provided in the first 
>>>> line of the file falls under category 3.  The trick here is that we 
>>>> need to be able to read that metadata about the fields at parse 
>>>> time (because we'd like to be able to do type checking and such).  
>>>> So in addition to the load function itself needing to examine the 
>>>> tuples, we need a way for the load function to read just enough of 
>>>> the file to tell the front end (on the client box, not on the 
>>>> map-reduce backend) the schema.  Maybe the best way to implement 
>>>> this is to have an interface that the load function would implement 
>>>> that lets the parser know that the load function can discover the 
>>>> metadata for it, and then the parser could call that load function 
>>>> before proceeding to type checking.
>>>>
>>>> We're also interested in being able to tell the load function the 
>>>> fields needed in the query.  Even if you don't have field per file 
>>>> storage (aka columnar storage) it's useful to be able to 
>>>> immediately project out fields you know the query won't care about, 
>>>> as you can avoid translation costs and memory storage.
>>>>
>>>> It's not clear to me that we need another interface to implement 
>>>> this.  We could just add a method "void neededColumns(Schema s)" to 
>>>> PigLoader.  As a post parsing step the parser would then visit the 
>>>> plan, as you suggest, and submit a schema to the PigLoader 
>>>> function.  It would be up to the specific loader implementation to 
>>>> decide whether to make use of the provided schema or not.
>>>
>>> I don't see the use for the first new function in addition to the 
>>> second.  If a schema is required by the query, the loader must be 
>>> able to produce data matching that schema.  If the loader can figure 
>>> out an internal schema, it can make that check that you describe in 
>>> function 1 in addition to structuring its data correctly as in 
>>> function 2.  If it can't determine its internal schema until it 
>>> loads data, then it can do neither and we have to wait until runtime 
>>> to see if it succeeds.  What about making the call "Schema 
>>> neededColumns(Schema s) throws IOException"?  The returned Schema is 
>>> the actual Schema that will be loaded which must be a superset of 
>>> the incoming Schema.  If the loader is unable to create the needed 
>>> schema, an IOException is thrown.
>>>
>> I'm not sure I understand what you're proposing.  I was trying to say 
>> that we need two separate things from the load function:
>> 1) A way to discover the schema of the data at parse time for type 
>> checking and query correctness checking (e.g. the user asked for 
>> field 5, is there a field 5?)  This is needed for metadata option 3, 
>> where the metadata is described by the data (as in JSON) or where the 
>> metadata is located in a file associated with the data.  We want to 
>> detect these kinds of errors before we submit to the backend (i.e. 
>> Hadoop) so that we can give the earliest possible error feedback.
>> 2) A way to indicate to the load function the schema it needs to 
>> load, as a way to support columnar storage schemes (such as you 
>> propose) or pushing projection down into the load.
>>
>> Were you saying that you didn't think one of those is necessary, or 
>> are you saying that you think we can accomplish both with one 
>> function being adding to the load function?
>
> I'm saying that both can be accomplished with one new function on the 
> load func: Schema neededColumns(Schema s) throws IOException.  s is 
> the schema derived from the query, and the load func can use it to 
> satisfy your first requirement.  If it can check its underlying data, 
> it can then compare it to the schema in s and throw an IOException if 
> it can't satisfy that.  s can also be used to satisfy your second 
> requirement as it indicates to the load func what it's expected to load.
>
> The returned Schema is the form that the actual data returned by the 
> load func will take.  It must be a superset of the passed in Schema, 
> and really just exists to allow the load func to say it isn't going to 
> prune any of the data away at load time and just return everything 
> that it finds.  For load funcs that don't know the structure of their 
> data until they actually read it, they can return the * schema and 
> just wait until runtime to see if things blow up just like things work 
> currently.
>
> I think this makes more sense as a single function because the two 
> requirements are essentially the same operation.  To load enough of 
> the data to check a given schema against what's actually in the store 
> is almost the same work to determine what it'll actually load for 
> requirement two.
>
> Make more sense?
Let's work through a use case with the following script:

a = load 'mydata' using myloader();
b = filter a by $1 matches '.mysite.com';
c = group b by $0;
d = foreach c generate $0, SUM($1.$5);
store d into 'summeddata';

A post process step would figure out that the data loaded from 'myfile' 
needs to have at least 6 columns, column 2 needs to be a string, and 
column 6 needs to be int, long, float, or double.  It would then compose 
a schema with those slots filled in and call neededColumns, passing in 
that schema.  If myloader was of a type that it could push the 
projection down into the load, it would store this information for use 
later when actually loading data.  If myloader was loading some type of 
self describing data it would need in this same function call, to 
discover the schema of the data it is loading.  It would then check this 
against the passed in schema to assure it makes sense.  In addition, it 
would create an output schema that describes the data, and return that 
from neededSchema.  In the case where the data was not self describing, 
it would simply return a star schema (why not the schema passed in, 
since the data should match that or we'll get an error?).  Is that correct?

It still feels to me like you have one function doing two things.  One 
way or another, something needs to check that the schema derived from 
the post process analysis of the script matches the schema derived by 
investigating the data itself, and in your proposal that can be done in 
this function.

One way or another I think we agree on the needed functionality, 
interface concerns are probably secondary.   I need to update my type 
design doc to address how types are converted (at load time or lazily) 
and how the load function exposes that functionality.  I'll add this to 
the doc at the same time.

Alan.


>
> Charlie
>
>

Re: Getting query information while loading data

Posted by Charlie Groves <ch...@threerings.net>.

On Feb 4, 2008, at 1:42 PM, Alan Gates wrote:
> Charlie Groves wrote:
>>
>> On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
>>> Our thinking of how to provide field metadata (name and  
>>> eventually types) for pig queries was to allow several options:
>>>    1) AS in the LOAD, as you can currently do for names.
>>>    2) using an outside metadata service, where we would tell it  
>>> the file name and it would tell us the metadata.
>>>    3) Support self describing data formats such as JSON.
>>>
>>> You're suggestion for a very simple schema provided in the first  
>>> line of the file falls under category 3.  The trick here is that  
>>> we need to be able to read that metadata about the fields at  
>>> parse time (because we'd like to be able to do type checking and  
>>> such).  So in addition to the load function itself needing to  
>>> examine the tuples, we need a way for the load function to read  
>>> just enough of the file to tell the front end (on the client box,  
>>> not on the map-reduce backend) the schema.  Maybe the best way to  
>>> implement this is to have an interface that the load function  
>>> would implement that lets the parser know that the load function  
>>> can discover the metadata for it, and then the parser could call  
>>> that load function before proceeding to type checking.
>>>
>>> We're also interested in being able to tell the load function the  
>>> fields needed in the query.  Even if you don't have field per  
>>> file storage (aka columnar storage) it's useful to be able to  
>>> immediately project out fields you know the query won't care  
>>> about, as you can avoid translation costs and memory storage.
>>>
>>> It's not clear to me that we need another interface to implement  
>>> this.  We could just add a method "void neededColumns(Schema s)"  
>>> to PigLoader.  As a post parsing step the parser would then visit  
>>> the plan, as you suggest, and submit a schema to the PigLoader  
>>> function.  It would be up to the specific loader implementation  
>>> to decide whether to make use of the provided schema or not.
>>
>> I don't see the use for the first new function in addition to the  
>> second.  If a schema is required by the query, the loader must be  
>> able to produce data matching that schema.  If the loader can  
>> figure out an internal schema, it can make that check that you  
>> describe in function 1 in addition to structuring its data  
>> correctly as in function 2.  If it can't determine its internal  
>> schema until it loads data, then it can do neither and we have to  
>> wait until runtime to see if it succeeds.  What about making the  
>> call "Schema neededColumns(Schema s) throws IOException"?  The  
>> returned Schema is the actual Schema that will be loaded which  
>> must be a superset of the incoming Schema.  If the loader is  
>> unable to create the needed schema, an IOException is thrown.
>>
> I'm not sure I understand what you're proposing.  I was trying to  
> say that we need two separate things from the load function:
> 1) A way to discover the schema of the data at parse time for type  
> checking and query correctness checking (e.g. the user asked for  
> field 5, is there a field 5?)  This is needed for metadata option  
> 3, where the metadata is described by the data (as in JSON) or  
> where the metadata is located in a file associated with the data.   
> We want to detect these kinds of errors before we submit to the  
> backend (i.e. Hadoop) so that we can give the earliest possible  
> error feedback.
> 2) A way to indicate to the load function the schema it needs to  
> load, as a way to support columnar storage schemes (such as you  
> propose) or pushing projection down into the load.
>
> Were you saying that you didn't think one of those is necessary, or  
> are you saying that you think we can accomplish both with one  
> function being adding to the load function?

I'm saying that both can be accomplished with one new function on the  
load func: Schema neededColumns(Schema s) throws IOException.  s is  
the schema derived from the query, and the load func can use it to  
satisfy your first requirement.  If it can check its underlying data,  
it can then compare it to the schema in s and throw an IOException if  
it can't satisfy that.  s can also be used to satisfy your second  
requirement as it indicates to the load func what it's expected to load.

The returned Schema is the form that the actual data returned by the  
load func will take.  It must be a superset of the passed in Schema,  
and really just exists to allow the load func to say it isn't going  
to prune any of the data away at load time and just return everything  
that it finds.  For load funcs that don't know the structure of their  
data until they actually read it, they can return the * schema and  
just wait until runtime to see if things blow up just like things  
work currently.

I think this makes more sense as a single function because the two  
requirements are essentially the same operation.  To load enough of  
the data to check a given schema against what's actually in the store  
is almost the same work to determine what it'll actually load for  
requirement two.

Make more sense?

Charlie

Re: Getting query information while loading data

Posted by Alan Gates <ga...@yahoo-inc.com>.

Charlie,

I apologize, I got busy and let this thread drop.  Comments inlined below.

Charlie Groves wrote:
>
> On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
>
>> We're definitely interested.
>
> Excellent!
>
>> Our thinking of how to provide field metadata (name and eventually 
>> types) for pig queries was to allow several options:
>>    1) AS in the LOAD, as you can currently do for names.
>>    2) using an outside metadata service, where we would tell it the 
>> file name and it would tell us the metadata.
>>    3) Support self describing data formats such as JSON.
>>
>> You're suggestion for a very simple schema provided in the first line 
>> of the file falls under category 3.  The trick here is that we need 
>> to be able to read that metadata about the fields at parse time 
>> (because we'd like to be able to do type checking and such).  So in 
>> addition to the load function itself needing to examine the tuples, 
>> we need a way for the load function to read just enough of the file 
>> to tell the front end (on the client box, not on the map-reduce 
>> backend) the schema.  Maybe the best way to implement this is to have 
>> an interface that the load function would implement that lets the 
>> parser know that the load function can discover the metadata for it, 
>> and then the parser could call that load function before proceeding 
>> to type checking.
>>
>> We're also interested in being able to tell the load function the 
>> fields needed in the query.  Even if you don't have field per file 
>> storage (aka columnar storage) it's useful to be able to immediately 
>> project out fields you know the query won't care about, as you can 
>> avoid translation costs and memory storage.
>>
>> It's not clear to me that we need another interface to implement 
>> this.  We could just add a method "void neededColumns(Schema s)" to 
>> PigLoader.  As a post parsing step the parser would then visit the 
>> plan, as you suggest, and submit a schema to the PigLoader function.  
>> It would be up to the specific loader implementation to decide 
>> whether to make use of the provided schema or not.
>
> I don't see the use for the first new function in addition to the 
> second.  If a schema is required by the query, the loader must be able 
> to produce data matching that schema.  If the loader can figure out an 
> internal schema, it can make that check that you describe in function 
> 1 in addition to structuring its data correctly as in function 2.  If 
> it can't determine its internal schema until it loads data, then it 
> can do neither and we have to wait until runtime to see if it 
> succeeds.  What about making the call "Schema neededColumns(Schema s) 
> throws IOException"?  The returned Schema is the actual Schema that 
> will be loaded which must be a superset of the incoming Schema.  If 
> the loader is unable to create the needed schema, an IOException is 
> thrown.
>
> Is the necessary Schema known somewhere in the parser, or will I have 
> to figure that out from the Schemas available at each step?  I haven't 
> seen anything like that.
>
> Charlie
I'm not sure I understand what you're proposing.  I was trying to say 
that we need two separate things from the load function:
1) A way to discover the schema of the data at parse time for type 
checking and query correctness checking (e.g. the user asked for field 
5, is there a field 5?)  This is needed for metadata option 3, where the 
metadata is described by the data (as in JSON) or where the metadata is 
located in a file associated with the data.  We want to detect these 
kinds of errors before we submit to the backend (i.e. Hadoop) so that we 
can give the earliest possible error feedback.
2) A way to indicate to the load function the schema it needs to load, 
as a way to support columnar storage schemes (such as you propose) or 
pushing projection down into the load.

Were you saying that you didn't think one of those is necessary, or are 
you saying that you think we can accomplish both with one function being 
adding to the load function?

Alan.

Re: Getting query information while loading data

Posted by Charlie Groves <ch...@threerings.net>.

On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:

> We're definitely interested.

Excellent!

> Our thinking of how to provide field metadata (name and eventually  
> types) for pig queries was to allow several options:
>    1) AS in the LOAD, as you can currently do for names.
>    2) using an outside metadata service, where we would tell it the  
> file name and it would tell us the metadata.
>    3) Support self describing data formats such as JSON.
>
> You're suggestion for a very simple schema provided in the first  
> line of the file falls under category 3.  The trick here is that we  
> need to be able to read that metadata about the fields at parse  
> time (because we'd like to be able to do type checking and such).   
> So in addition to the load function itself needing to examine the  
> tuples, we need a way for the load function to read just enough of  
> the file to tell the front end (on the client box, not on the map- 
> reduce backend) the schema.  Maybe the best way to implement this  
> is to have an interface that the load function would implement that  
> lets the parser know that the load function can discover the  
> metadata for it, and then the parser could call that load function  
> before proceeding to type checking.
>
> We're also interested in being able to tell the load function the  
> fields needed in the query.  Even if you don't have field per file  
> storage (aka columnar storage) it's useful to be able to  
> immediately project out fields you know the query won't care about,  
> as you can avoid translation costs and memory storage.
>
> It's not clear to me that we need another interface to implement  
> this.  We could just add a method "void neededColumns(Schema s)" to  
> PigLoader.  As a post parsing step the parser would then visit the  
> plan, as you suggest, and submit a schema to the PigLoader  
> function.  It would be up to the specific loader implementation to  
> decide whether to make use of the provided schema or not.

I don't see the use for the first new function in addition to the  
second.  If a schema is required by the query, the loader must be  
able to produce data matching that schema.  If the loader can figure  
out an internal schema, it can make that check that you describe in  
function 1 in addition to structuring its data correctly as in  
function 2.  If it can't determine its internal schema until it loads  
data, then it can do neither and we have to wait until runtime to see  
if it succeeds.  What about making the call "Schema neededColumns 
(Schema s) throws IOException"?  The returned Schema is the actual  
Schema that will be loaded which must be a superset of the incoming  
Schema.  If the loader is unable to create the needed schema, an  
IOException is thrown.

Is the necessary Schema known somewhere in the parser, or will I have  
to figure that out from the Schemas available at each step?  I  
haven't seen anything like that.

Charlie

> Charlie Groves wrote:
>> I'd like to expose the running query to my loading code for a few  
>> reasons:
>>
>> - To allow the schema of the loaded data to be specified by its  
>> usage in the query, rather than by an explicit AS.  I know the  
>> names of the fields in my data, so it seems backwards to me to  
>> require it to be named in the query.  I'd rather use the data  
>> access in the query to figure out the names of the fields and pass  
>> that to my loader to put the data in the right place in a tuple.   
>> This also seems like it could be nice for CSV data since it  
>> generally has the names as the first line.
>>
>> - Following up on using the query to determine the schema, I'd  
>> like to use the query-determined schema to decide what to load.   
>> My storage is broken out into files by field, so if I know which  
>> fields are used in a query, I can read only those fields and save  
>> a huge amount of busywork.
>>
>> - To optimize filter operations using indexes.  For some of my  
>> fields, I have metadata that tells me the range of values in that  
>> file.  If I could find all the filter operations on that field, I  
>> could reject entire files if their values fell outside the filter  
>> range.
>>
>> Are you interested in some patches to do this sort of thing?  If  
>> so, what's the best way to expose this information to user code?   
>> My very basic, initial thinking for the first two use cases is to  
>> write a LOVisitor and an EvalSpecVisitor to spider through the  
>> built query and build a schema to pass to an interested load  
>> func.  A load func indicates its interest by implementing a new  
>> interface that takes the schema, and it takes responsibility for  
>> making a tuple that conforms to the schema.  If a load func isn't  
>> interested, it just implements the current interface and loads all  
>> the data in its input stream.
>>
>> The final use case seems like it would require exposing EvalFuncs  
>> and the LogicalPlan to user code, so I'm fine with just going  
>> after the first two for now and figuring that out later.  However,  
>> if there's a way that's exposed already in the code that I've  
>> missed, or if there's a better way to do it, I'd like to check it  
>> out since it'd be hugely beneficial for what I'm doing.
>>
>> Thanks,
>> Charlie
>

Re: Getting query information while loading data

Posted by Alan Gates <ga...@yahoo-inc.com>.

We're definitely interested.

Our thinking of how to provide field metadata (name and eventually 
types) for pig queries was to allow several options:
    1) AS in the LOAD, as you can currently do for names.
    2) using an outside metadata service, where we would tell it the 
file name and it would tell us the metadata.
    3) Support self describing data formats such as JSON.

You're suggestion for a very simple schema provided in the first line of 
the file falls under category 3.  The trick here is that we need to be 
able to read that metadata about the fields at parse time (because we'd 
like to be able to do type checking and such).  So in addition to the 
load function itself needing to examine the tuples, we need a way for 
the load function to read just enough of the file to tell the front end 
(on the client box, not on the map-reduce backend) the schema.  Maybe 
the best way to implement this is to have an interface that the load 
function would implement that lets the parser know that the load 
function can discover the metadata for it, and then the parser could 
call that load function before proceeding to type checking.

We're also interested in being able to tell the load function the fields 
needed in the query.  Even if you don't have field per file storage (aka 
columnar storage) it's useful to be able to immediately project out 
fields you know the query won't care about, as you can avoid translation 
costs and memory storage.

It's not clear to me that we need another interface to implement this.  
We could just add a method "void neededColumns(Schema s)" to PigLoader.  
As a post parsing step the parser would then visit the plan, as you 
suggest, and submit a schema to the PigLoader function.  It would be up 
to the specific loader implementation to decide whether to make use of 
the provided schema or not.

Alan.

Charlie Groves wrote:
> I'd like to expose the running query to my loading code for a few 
> reasons:
>
> - To allow the schema of the loaded data to be specified by its usage 
> in the query, rather than by an explicit AS.  I know the names of the 
> fields in my data, so it seems backwards to me to require it to be 
> named in the query.  I'd rather use the data access in the query to 
> figure out the names of the fields and pass that to my loader to put 
> the data in the right place in a tuple.  This also seems like it could 
> be nice for CSV data since it generally has the names as the first line.
>
> - Following up on using the query to determine the schema, I'd like to 
> use the query-determined schema to decide what to load.  My storage is 
> broken out into files by field, so if I know which fields are used in 
> a query, I can read only those fields and save a huge amount of busywork.
>
> - To optimize filter operations using indexes.  For some of my fields, 
> I have metadata that tells me the range of values in that file.  If I 
> could find all the filter operations on that field, I could reject 
> entire files if their values fell outside the filter range.
>
> Are you interested in some patches to do this sort of thing?  If so, 
> what's the best way to expose this information to user code?  My very 
> basic, initial thinking for the first two use cases is to write a 
> LOVisitor and an EvalSpecVisitor to spider through the built query and 
> build a schema to pass to an interested load func.  A load func 
> indicates its interest by implementing a new interface that takes the 
> schema, and it takes responsibility for making a tuple that conforms 
> to the schema.  If a load func isn't interested, it just implements 
> the current interface and loads all the data in its input stream.
>
> The final use case seems like it would require exposing EvalFuncs and 
> the LogicalPlan to user code, so I'm fine with just going after the 
> first two for now and figuring that out later.  However, if there's a 
> way that's exposed already in the code that I've missed, or if there's 
> a better way to do it, I'd like to check it out since it'd be hugely 
> beneficial for what I'm doing.
>
> Thanks,
> Charlie