You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/06/04 01:50:44 UTC
[jira] Commented: (PIG-160) Change LoadFunc interface to work with
new types
[ https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602118#action_12602118 ]
Alan Gates commented on PIG-160:
--------------------------------
Checked in loadfunc_v1 patch.
> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
> Key: PIG-160
> URL: https://issues.apache.org/jira/browse/PIG-160
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: loadfuncs_v1.patch
>
>
> The LoadFunc interface needs to change to support new types. The load function will need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to java Integer, Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, and we will need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
> * This interface is used to implement functions to parse records
> * from a dataset. This also includes functions to cast raw byte data into various
> * datatypes. These are external functions because we want loaders, whenever
> * possible, to delay casting of datatypes until the last possible moment (i.e.
> * don't do it on load). This means we need to expose the functionality so that
> * other sections of the code can call back to the loader to do the cast.
> */
> public interface LoadFunc {
> /**
> * Specifies a portion of an InputStream to read tuples. Because the
> * starting and ending offsets may not be on record boundaries it is up to
> * the implementor to deal with figuring out the actual starting and ending
> * offsets in such a way that an arbitrarily sliced up file will be processed
> * in its entirety.
> * <p>
> * A common way of handling slices in the middle of records is to start at
> * the given offset and, if the offset is not zero, skip to the end of the
> * first record (which may be a partial record) before reading tuples.
> * Reading continues until a tuple has been read that ends at an offset past
> * the ending offset.
> * <p>
> * <b>The load function should not do any buffering on the input stream</b>. Buffering will
> * cause the offsets returned by is.getPos() to be unreliable.
> *
> * @param fileName the name of the file to be read
> * @param is the stream representing the file to be processed, and which can also provide its position.
> * @param offset the offset to start reading tuples.
> * @param end the ending offset for reading.
> * @throws IOException
> */
> public void bindTo(String fileName,
> BufferedPositionedInputStream is,
> long offset,
> long end) throws IOException;
> /**
> * Retrieves the next tuple to be processed.
> * @return the next tuple to be processed or null if there are no more tuples
> * to be processed.
> * @throws IOException
> */
> public Tuple getNext() throws IOException;
>
> /**
> * Cast data from bytes to boolean value.
> * @param bytes byte array to be cast.
> * @return Boolean value.
> * @throws IOException if the value cannot be cast.
> */
> public Boolean bytesToBoolean(byte[] b) throws IOException;
>
> /**
> * Cast data from bytes to integer value.
> * @param bytes byte array to be cast.
> * @return Integer value.
> * @throws IOException if the value cannot be cast.
> */
> public Integer bytesToInteger(byte[] b) throws IOException;
> /**
> * Cast data from bytes to long value.
> * @param bytes byte array to be cast.
> * @return Long value.
> * @throws IOException if the value cannot be cast.
> */
> public Long bytesToLong(byte[] b) throws IOException;
> /**
> * Cast data from bytes to float value.
> * @param bytes byte array to be cast.
> * @return Float value.
> * @throws IOException if the value cannot be cast.
> */
> public Float bytesToFloat(byte[] b) throws IOException;
> /**
> * Cast data from bytes to double value.
> * @param bytes byte array to be cast.
> * @return Double value.
> * @throws IOException if the value cannot be cast.
> */
> public Double bytesToDouble(byte[] b) throws IOException;
> /**
> * Cast data from bytes to chararray value.
> * @param bytes byte array to be cast.
> * @return String value.
> * @throws IOException if the value cannot be cast.
> */
> public String bytesToCharArray(byte[] b) throws IOException;
> /**
> * Cast data from bytes to map value.
> * @param bytes byte array to be cast.
> * @return Map value.
> * @throws IOException if the value cannot be cast.
> */
> public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
> /**
> * Cast data from bytes to tuple value.
> * @param bytes byte array to be cast.
> * @return Tuple value.
> * @throws IOException if the value cannot be cast.
> */
> public Tuple bytesToTuple(byte[] b) throws IOException;
> /**
> * Cast data from bytes to bag value.
> * @param bytes byte array to be cast.
> * @return Bag value.
> * @throws IOException if the value cannot be cast.
> */
> public DataBag bytesToBag(byte[] b) throws IOException;
> /**
> * Indicate to the loader fields that will be needed. This can be useful for
> * loaders that access data that is stored in a columnar format where indicating
> * columns to be accessed a head of time will save scans. If the loader
> * function cannot make use of this information, it is free to ignore it.
> * @param schema Schema indicating which columns will be needed.
> */
> public void fieldsToRead(Schema schema);
> /**
> * Find the schema from the loader. This function will be called at parse time
> * (not run time) to see if the loader can provide a schema for the data. The
> * loader may be able to do this if the data is self describing (e.g. JSON). If
> * the loader cannot determine the schema, it can return a null.
> * @param fileName Name of the file to be read.
> * @param in inpu stream, so that the function can read enough of the
> * data to determine the schema.
> * @param end Function should not read past this position in the stream.
> * @return a Schema describing the data if possible, or null otherwise.
> * @throws IOException.
> */
> public Schema determineSchema(String fileName,
> BufferedPositionedInputStream in,
> long end) throws IOException;
> }
> {code}
> This bug also covers the work to convert existing load function (eg PigStorage, BinStorage) to the new interface.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.