You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by David Capwell <dc...@gmail.com> on 2012/07/05 19:32:12 UTC

Pig not returning the right column when using LoadMetadata to provide a schema

I have the following code in my LoadFunc

public class Loader extends PigStorage implements LoadMetadata {
...
@Override
  public ResourceSchema getSchema(String location, Job job) throws
IOException {
    String fields = "ts: long, ip:chararray, username:chararray,
host:chararray, method: chararray, uri: chararray, query: map[chararray],
status:int, bytes:chararray, version:chararray, userAgent:chararray,
un1:chararray, un2:chararray, cookies:map[chararray], referer:chararray,
originId:chararray, id:chararray";
    ResourceSchema schema = new
ResourceSchema(Utils.getSchemaFromString(fields));
    if (schema == null) {
      throw new IOException("Unable to parse schema");
    }
    return schema;
  }

  @Override
  public ResourceStatistics getStatistics(String s, Job job) throws
IOException {
    return null;  //To change body of implemented methods use File |
Settings | File Templates.
  }

  @Override
  public String[] getPartitionKeys(String s, Job job) throws IOException {
    return new String[0];  //To change body of implemented methods use File
| Settings | File Templates.
  }

  @Override
  public void setPartitionFilter(Expression expression) throws IOException {
    //To change body of implemented methods use File | Settings | File
Templates.
  }
...
}

When I use this with both pig 0.9.2 and 0.10.0 (both apache) pig keeps
giving me the wrong column when I use ether the column name or the index
(ex. $5).

Here is a sample script that shows pig loading the metadata:

grunt> REGISTER './my-lib-pig-1.0.jar';

grunt> A = LOAD '/path/to/data/logs' USING Loader();
grunt> DESCRIBE A;
A: {ts: long,ip: chararray,username: chararray,host: chararray,method:
chararray,uri: chararray,query: map[chararray],status: int,bytes:
chararray,version: chararray,userAgent: chararray,un1: chararray,un2:
chararray,cookies: map[chararray],referer: chararray,originId:
chararray,id: chararray}


If I add the following:
grunt> B = FOREACH A GENERATE uri, FLATTEN(query);

grunt> DUMP B;

pig will return (ts, ip).  I have tried switching to using the index and
get the same result:
grunt> B = FOREACH A GENERATE $6, FLATTEN($7);

grunt> DUMP B;

this yields (ts, ip).

Am I missing something?  If I DUMP A i see that the tuple contains all 17
columns.

Thanks for your time reading this email.