You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by David Capwell <dc...@gmail.com> on 2012/07/05 19:32:12 UTC
Pig not returning the right column when using LoadMetadata to provide
a schema
I have the following code in my LoadFunc
public class Loader extends PigStorage implements LoadMetadata {
...
@Override
public ResourceSchema getSchema(String location, Job job) throws
IOException {
String fields = "ts: long, ip:chararray, username:chararray,
host:chararray, method: chararray, uri: chararray, query: map[chararray],
status:int, bytes:chararray, version:chararray, userAgent:chararray,
un1:chararray, un2:chararray, cookies:map[chararray], referer:chararray,
originId:chararray, id:chararray";
ResourceSchema schema = new
ResourceSchema(Utils.getSchemaFromString(fields));
if (schema == null) {
throw new IOException("Unable to parse schema");
}
return schema;
}
@Override
public ResourceStatistics getStatistics(String s, Job job) throws
IOException {
return null; //To change body of implemented methods use File |
Settings | File Templates.
}
@Override
public String[] getPartitionKeys(String s, Job job) throws IOException {
return new String[0]; //To change body of implemented methods use File
| Settings | File Templates.
}
@Override
public void setPartitionFilter(Expression expression) throws IOException {
//To change body of implemented methods use File | Settings | File
Templates.
}
...
}
When I use this with both pig 0.9.2 and 0.10.0 (both apache) pig keeps
giving me the wrong column when I use ether the column name or the index
(ex. $5).
Here is a sample script that shows pig loading the metadata:
grunt> REGISTER './my-lib-pig-1.0.jar';
grunt> A = LOAD '/path/to/data/logs' USING Loader();
grunt> DESCRIBE A;
A: {ts: long,ip: chararray,username: chararray,host: chararray,method:
chararray,uri: chararray,query: map[chararray],status: int,bytes:
chararray,version: chararray,userAgent: chararray,un1: chararray,un2:
chararray,cookies: map[chararray],referer: chararray,originId:
chararray,id: chararray}
If I add the following:
grunt> B = FOREACH A GENERATE uri, FLATTEN(query);
grunt> DUMP B;
pig will return (ts, ip). I have tried switching to using the index and
get the same result:
grunt> B = FOREACH A GENERATE $6, FLATTEN($7);
grunt> DUMP B;
this yields (ts, ip).
Am I missing something? If I DUMP A i see that the tuple contains all 17
columns.
Thanks for your time reading this email.