You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/06/04 01:50:44 UTC

[jira] Commented: (PIG-160) Change LoadFunc interface to work with new types

    [ https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602118#action_12602118 ] 

Alan Gates commented on PIG-160:
--------------------------------

Checked in loadfunc_v1 patch.

> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
>                 Key: PIG-160
>                 URL: https://issues.apache.org/jira/browse/PIG-160
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: loadfuncs_v1.patch
>
>
> The LoadFunc interface needs to change to support new types.  The load function will need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to java Integer, Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, and we will need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
>  * This interface is used to implement functions to parse records
>  * from a dataset.  This also includes functions to cast raw byte data into various
>  * datatypes.  These are external functions because we want loaders, whenever
>  * possible, to delay casting of datatypes until the last possible moment (i.e.
>  * don't do it on load).  This means we need to expose the functionality so that
>  * other sections of the code can call back to the loader to do the cast.
>  */
> public interface LoadFunc {
>     /**
>      * Specifies a portion of an InputStream to read tuples. Because the
>      * starting and ending offsets may not be on record boundaries it is up to
>      * the implementor to deal with figuring out the actual starting and ending
>      * offsets in such a way that an arbitrarily sliced up file will be processed
>      * in its entirety.
>      * <p>
>      * A common way of handling slices in the middle of records is to start at
>      * the given offset and, if the offset is not zero, skip to the end of the
>      * first record (which may be a partial record) before reading tuples.
>      * Reading continues until a tuple has been read that ends at an offset past
>      * the ending offset.
>      * <p>
>      * <b>The load function should not do any buffering on the input stream</b>. Buffering will
>      * cause the offsets returned by is.getPos() to be unreliable.
>      *  
>      * @param fileName the name of the file to be read
>      * @param is the stream representing the file to be processed, and which can also provide its position.
>      * @param offset the offset to start reading tuples.
>      * @param end the ending offset for reading.
>      * @throws IOException
>      */
>     public void bindTo(String fileName,
>                        BufferedPositionedInputStream is,
>                        long offset,
>                        long end) throws IOException;
>     /**
>      * Retrieves the next tuple to be processed.
>      * @return the next tuple to be processed or null if there are no more tuples
>      * to be processed.
>      * @throws IOException
>      */
>     public Tuple getNext() throws IOException;
>     
>     /**
>      * Cast data from bytes to boolean value.  
>      * @param bytes byte array to be cast.
>      * @return Boolean value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Boolean bytesToBoolean(byte[] b) throws IOException;
>     
>     /**
>      * Cast data from bytes to integer value.  
>      * @param bytes byte array to be cast.
>      * @return Integer value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Integer bytesToInteger(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to long value.  
>      * @param bytes byte array to be cast.
>      * @return Long value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Long bytesToLong(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to float value.  
>      * @param bytes byte array to be cast.
>      * @return Float value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Float bytesToFloat(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to double value.  
>      * @param bytes byte array to be cast.
>      * @return Double value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Double bytesToDouble(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to chararray value.  
>      * @param bytes byte array to be cast.
>      * @return String value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public String bytesToCharArray(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to map value.  
>      * @param bytes byte array to be cast.
>      * @return Map value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to tuple value.  
>      * @param bytes byte array to be cast.
>      * @return Tuple value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Tuple bytesToTuple(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to bag value.  
>      * @param bytes byte array to be cast.
>      * @return Bag value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public DataBag bytesToBag(byte[] b) throws IOException;
>     /**
>      * Indicate to the loader fields that will be needed.  This can be useful for
>      * loaders that access data that is stored in a columnar format where indicating
>      * columns to be accessed a head of time will save scans.  If the loader
>      * function cannot make use of this information, it is free to ignore it.
>      * @param schema Schema indicating which columns will be needed.
>      */
>     public void fieldsToRead(Schema schema);
>     /**
>      * Find the schema from the loader.  This function will be called at parse time
>      * (not run time) to see if the loader can provide a schema for the data.  The
>      * loader may be able to do this if the data is self describing (e.g. JSON).  If
>      * the loader cannot determine the schema, it can return a null.
>      * @param fileName Name of the file to be read.
>      * @param in inpu stream, so that the function can read enough of the
>      * data to determine the schema.
>      * @param end Function should not read past this position in the stream.
>      * @return a Schema describing the data if possible, or null otherwise.
>      * @throws IOException.
>      */
>     public Schema determineSchema(String fileName,
>                                   BufferedPositionedInputStream in,
>                                   long end) throws IOException;
> }
> {code} 
> This bug also covers the work to convert existing load function (eg PigStorage, BinStorage) to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.