You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by pablomar <pa...@gmail.com> on 2012/11/16 21:48:25 UTC
PigStorage
hi all,
I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
I got a case today on which I needed to clean up some fields before
processing. I will need to do the same for all my scripts. So instead of
doing it inside the scripts, I thought in extending PigStorage and do it
inside my own Loader. My scripts will be shorter and cleaner
in fact, the only method that I needed to overwrite was :
void *readField*(byte[] buf, int start, int end)
Everything was ok and it is working. Problem was that I had to copy/paste a
lot just because private declarations
for example:
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
private boolean mRequiredColumnsInitialized = false;
and of course:
*private *void readField(byte[] buf, int start, int end)
so I had to copy/paste:
public Tuple getNext() and all the aforementioned variables just to be able
to write my own *readField*
would it be possible in next versions of Pig to have *readField *protected
as well as *mProtoTuple *? I think it could be useful in some cases like
mine
I'm asking because I don't know the reasoning after the decisions of made
them private
thanks a lot,
Re: PigStorage
Posted by pablomar <pa...@gmail.com>.
done. PIG-3057 <https://issues.apache.org/jira/browse/PIG-3057>
On Mon, Nov 19, 2012 at 6:32 PM, Jonathan Coveney <jc...@gmail.com>wrote:
> Make a JIRA and attach the patch, please.
>
>
> 2012/11/19 pablomar <pa...@gmail.com>
>
> > hi all,
> >
> > I did it as simple as I could. What about this changes ?
> >
> >
> > PigStorage.java
> > original:
> > private void readField(byte[] buf, int start, int end) {
> > if (start == end) {
> > // NULL value
> > mProtoTuple.add(null);
> > } else {
> > mProtoTuple.add(new DataByteArray(buf, start, end));
> > }
> > }
> >
> >
> > new:
> > protected void addToCurrentTuple(DataByteArray data) {
> > mProtoTuple.add(data);
> > }
> >
> > protected void readField(byte[] buf, int start, int end) {
> > if (start == end) {
> > // NULL value
> > addToCurrentTuple(null);
> > } else {
> > addToCurrentTuple(new DataByteArray(buf, start, end));
> > }
> > }
> >
> >
> >
> > so, what's the advantage ?
> > with the new version, if I want to extend PigStorage just to override
> > readField, I do something like:
> >
> > import org.apache.pig.builtin.PigStorage;
> > import org.apache.pig.data.DataByteArray;
> >
> > public class MyLoader extends PigStorage {
> > public MyLoader() {
> > this("\t");
> > }
> >
> > public MyLoader(String delimiter) {
> > super(delimiter);
> > }
> >
> > @Override
> > protected void readField(byte[] buf, int start, int end) {
> > // remove the field name (example: min=, max=)
> > for (int i=start; i<=end;i++) {
> > if (buf[i] == '=') {
> > start = i + 1;
> > break;
> > }
> > }
> >
> > if (start == end) {
> > // NULL value
> > addToCurrentTuple(null);
> > } else {
> > addToCurrentTuple(new DataByteArray(buf, start, end));
> > }
> > }
> > }
> >
> > currently, just to override readField, I also need to copy/paste a lot
> from
> > PigStorage:
> >
> > import java.io.IOException;
> > import java.util.ArrayList;
> > import java.util.Properties;
> > import org.apache.hadoop.io.Text;
> > import org.apache.pig.backend.executionengine.ExecException;
> > import org.apache.pig.builtin.PigStorage;
> > import org.apache.pig.PigException;
> > import org.apache.pig.data.DataByteArray;
> > import org.apache.pig.data.Tuple;
> > import org.apache.pig.data.TupleFactory;
> > import org.apache.pig.impl.util.ObjectSerializer;
> > import org.apache.pig.impl.util.StorageUtil;
> > import org.apache.pig.impl.util.UDFContext;
> >
> > public class MyLoaderextends PigStorage {
> > private byte fieldDel = '\t';
> > private ArrayList<Object> mProtoTuple = null;
> > private TupleFactory mTupleFactory = TupleFactory.getInstance();
> > private boolean mRequiredColumnsInitialized = false;
> >
> > public MyLoader() {
> > this("\t");
> > }
> >
> > public MyLoader(String delimiter) {
> > super(delimiter);
> > fieldDel = StorageUtil.parseFieldDel(delimiter);
> > }
> >
> > // exactly the same than in PigStorage, but I needed to modify
> readField
> > which is private, so I have to copy all of it
> > @Override
> > public Tuple getNext() throws IOException {
> > mProtoTuple = new ArrayList<Object>();
> > if (!mRequiredColumnsInitialized) {
> > if (signature!=null) {
> > Properties p =
> > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > mRequiredColumns =
> > (boolean[])ObjectSerializer.deserialize(p.getProperty(signature));
> > }
> > mRequiredColumnsInitialized = true;
> > }
> > try {
> > boolean notDone = in.nextKeyValue();
> > if (!notDone) {
> > return null;
> >
> > }
> >
> > Text value = (Text) in.getCurrentValue();
> > byte[] buf = value.getBytes();
> > int len = value.getLength();
> > int start = 0;
> > int fieldID = 0;
> > for (int i = 0; i < len; i++) {
> > if (buf[i] == fieldDel) {
> > if (mRequiredColumns==null ||
> > (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))
> > readField(buf, start, i);
> > start = i + 1;
> > fieldID++;
> > }
> > }
> > // pick up the last field
> > if (start <= len && (mRequiredColumns==null ||
> > (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))) {
> > readField(buf, start, len);
> > }
> > Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple);
> > return t;
> > } catch (InterruptedException e) {
> > int errCode = 6018;
> > String errMsg = "Error while reading input";
> > throw new ExecException(errMsg, errCode,
> > PigException.REMOTE_ENVIRONMENT, e);
> > }
> >
> > }
> >
> > private void readField(byte[] buf, int start, int end) {
> > // remove the field name (example: min=, max=)
> > for (int i=start; i<=end;i++) {
> > if (buf[i] == '=') {
> > start = i + 1;
> > break;
> > }
> > }
> >
> > if (start == end) {
> > // NULL value
> > mProtoTuple.add(null);
> > } else {
> > mProtoTuple.add(new DataByteArray(buf, start, end));
> > }
> > }
> > }
> >
> >
> > On Mon, Nov 19, 2012 at 12:24 PM, pablomar
> > <pa...@gmail.com>wrote:
> >
> > > sure. My initial (and dirty) idea changed only 2 lines. I completely
> > agree
> > > with you
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <billgraham@gmail.com
> > >wrote:
> > >
> > >> +1 as well, but I'd suggest we do the following:
> > >>
> > >> - Keep mProtoTuple private and add protected getters/setters instead
> > with
> > >> javadocs describing expected usage.
> > >> - Rename mProtoTuple and the getters/setters to something more
> > descriptive
> > >> than mProtoTuple.
> > >>
> > >>
> > >> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > >> wrote:
> > >>
> > >> > That sounds reasonable, I've run into the same problem. Do you mind
> > >> > submitting a patch?
> > >> >
> > >> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar
> > >> > <pa...@gmail.com> wrote:
> > >> > > hi all,
> > >> > >
> > >> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> > >> > > I got a case today on which I needed to clean up some fields
> before
> > >> > > processing. I will need to do the same for all my scripts. So
> > instead
> > >> of
> > >> > > doing it inside the scripts, I thought in extending PigStorage and
> > do
> > >> it
> > >> > > inside my own Loader. My scripts will be shorter and cleaner
> > >> > >
> > >> > > in fact, the only method that I needed to overwrite was :
> > >> > > void *readField*(byte[] buf, int start, int end)
> > >> > >
> > >> > >
> > >> > > Everything was ok and it is working. Problem was that I had to
> > >> > copy/paste a
> > >> > > lot just because private declarations
> > >> > > for example:
> > >> > > private byte fieldDel = '\t';
> > >> > > private ArrayList<Object> mProtoTuple = null;
> > >> > > private TupleFactory mTupleFactory = TupleFactory.getInstance();
> > >> > > private boolean mRequiredColumnsInitialized = false;
> > >> > >
> > >> > > and of course:
> > >> > > *private *void readField(byte[] buf, int start, int end)
> > >> > >
> > >> > > so I had to copy/paste:
> > >> > > public Tuple getNext() and all the aforementioned variables just
> to
> > be
> > >> > able
> > >> > > to write my own *readField*
> > >> > >
> > >> > >
> > >> > > would it be possible in next versions of Pig to have *readField
> > >> > *protected
> > >> > > as well as *mProtoTuple *? I think it could be useful in some
> cases
> > >> like
> > >> > > mine
> > >> > > I'm asking because I don't know the reasoning after the decisions
> of
> > >> made
> > >> > > them private
> > >> > >
> > >> > > thanks a lot,
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Note that I'm no longer using my Yahoo! email address. Please email
> me
> > at
> > >> billgraham@gmail.com going forward.*
> > >>
> > >
> > >
> >
>
Re: PigStorage
Posted by Jonathan Coveney <jc...@gmail.com>.
Make a JIRA and attach the patch, please.
2012/11/19 pablomar <pa...@gmail.com>
> hi all,
>
> I did it as simple as I could. What about this changes ?
>
>
> PigStorage.java
> original:
> private void readField(byte[] buf, int start, int end) {
> if (start == end) {
> // NULL value
> mProtoTuple.add(null);
> } else {
> mProtoTuple.add(new DataByteArray(buf, start, end));
> }
> }
>
>
> new:
> protected void addToCurrentTuple(DataByteArray data) {
> mProtoTuple.add(data);
> }
>
> protected void readField(byte[] buf, int start, int end) {
> if (start == end) {
> // NULL value
> addToCurrentTuple(null);
> } else {
> addToCurrentTuple(new DataByteArray(buf, start, end));
> }
> }
>
>
>
> so, what's the advantage ?
> with the new version, if I want to extend PigStorage just to override
> readField, I do something like:
>
> import org.apache.pig.builtin.PigStorage;
> import org.apache.pig.data.DataByteArray;
>
> public class MyLoader extends PigStorage {
> public MyLoader() {
> this("\t");
> }
>
> public MyLoader(String delimiter) {
> super(delimiter);
> }
>
> @Override
> protected void readField(byte[] buf, int start, int end) {
> // remove the field name (example: min=, max=)
> for (int i=start; i<=end;i++) {
> if (buf[i] == '=') {
> start = i + 1;
> break;
> }
> }
>
> if (start == end) {
> // NULL value
> addToCurrentTuple(null);
> } else {
> addToCurrentTuple(new DataByteArray(buf, start, end));
> }
> }
> }
>
> currently, just to override readField, I also need to copy/paste a lot from
> PigStorage:
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.Properties;
> import org.apache.hadoop.io.Text;
> import org.apache.pig.backend.executionengine.ExecException;
> import org.apache.pig.builtin.PigStorage;
> import org.apache.pig.PigException;
> import org.apache.pig.data.DataByteArray;
> import org.apache.pig.data.Tuple;
> import org.apache.pig.data.TupleFactory;
> import org.apache.pig.impl.util.ObjectSerializer;
> import org.apache.pig.impl.util.StorageUtil;
> import org.apache.pig.impl.util.UDFContext;
>
> public class MyLoaderextends PigStorage {
> private byte fieldDel = '\t';
> private ArrayList<Object> mProtoTuple = null;
> private TupleFactory mTupleFactory = TupleFactory.getInstance();
> private boolean mRequiredColumnsInitialized = false;
>
> public MyLoader() {
> this("\t");
> }
>
> public MyLoader(String delimiter) {
> super(delimiter);
> fieldDel = StorageUtil.parseFieldDel(delimiter);
> }
>
> // exactly the same than in PigStorage, but I needed to modify readField
> which is private, so I have to copy all of it
> @Override
> public Tuple getNext() throws IOException {
> mProtoTuple = new ArrayList<Object>();
> if (!mRequiredColumnsInitialized) {
> if (signature!=null) {
> Properties p =
> UDFContext.getUDFContext().getUDFProperties(this.getClass());
> mRequiredColumns =
> (boolean[])ObjectSerializer.deserialize(p.getProperty(signature));
> }
> mRequiredColumnsInitialized = true;
> }
> try {
> boolean notDone = in.nextKeyValue();
> if (!notDone) {
> return null;
>
> }
>
> Text value = (Text) in.getCurrentValue();
> byte[] buf = value.getBytes();
> int len = value.getLength();
> int start = 0;
> int fieldID = 0;
> for (int i = 0; i < len; i++) {
> if (buf[i] == fieldDel) {
> if (mRequiredColumns==null ||
> (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))
> readField(buf, start, i);
> start = i + 1;
> fieldID++;
> }
> }
> // pick up the last field
> if (start <= len && (mRequiredColumns==null ||
> (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))) {
> readField(buf, start, len);
> }
> Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple);
> return t;
> } catch (InterruptedException e) {
> int errCode = 6018;
> String errMsg = "Error while reading input";
> throw new ExecException(errMsg, errCode,
> PigException.REMOTE_ENVIRONMENT, e);
> }
>
> }
>
> private void readField(byte[] buf, int start, int end) {
> // remove the field name (example: min=, max=)
> for (int i=start; i<=end;i++) {
> if (buf[i] == '=') {
> start = i + 1;
> break;
> }
> }
>
> if (start == end) {
> // NULL value
> mProtoTuple.add(null);
> } else {
> mProtoTuple.add(new DataByteArray(buf, start, end));
> }
> }
> }
>
>
> On Mon, Nov 19, 2012 at 12:24 PM, pablomar
> <pa...@gmail.com>wrote:
>
> > sure. My initial (and dirty) idea changed only 2 lines. I completely
> agree
> > with you
> >
> >
> >
> >
> >
> > On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <billgraham@gmail.com
> >wrote:
> >
> >> +1 as well, but I'd suggest we do the following:
> >>
> >> - Keep mProtoTuple private and add protected getters/setters instead
> with
> >> javadocs describing expected usage.
> >> - Rename mProtoTuple and the getters/setters to something more
> descriptive
> >> than mProtoTuple.
> >>
> >>
> >> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >>
> >> > That sounds reasonable, I've run into the same problem. Do you mind
> >> > submitting a patch?
> >> >
> >> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar
> >> > <pa...@gmail.com> wrote:
> >> > > hi all,
> >> > >
> >> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> >> > > I got a case today on which I needed to clean up some fields before
> >> > > processing. I will need to do the same for all my scripts. So
> instead
> >> of
> >> > > doing it inside the scripts, I thought in extending PigStorage and
> do
> >> it
> >> > > inside my own Loader. My scripts will be shorter and cleaner
> >> > >
> >> > > in fact, the only method that I needed to overwrite was :
> >> > > void *readField*(byte[] buf, int start, int end)
> >> > >
> >> > >
> >> > > Everything was ok and it is working. Problem was that I had to
> >> > copy/paste a
> >> > > lot just because private declarations
> >> > > for example:
> >> > > private byte fieldDel = '\t';
> >> > > private ArrayList<Object> mProtoTuple = null;
> >> > > private TupleFactory mTupleFactory = TupleFactory.getInstance();
> >> > > private boolean mRequiredColumnsInitialized = false;
> >> > >
> >> > > and of course:
> >> > > *private *void readField(byte[] buf, int start, int end)
> >> > >
> >> > > so I had to copy/paste:
> >> > > public Tuple getNext() and all the aforementioned variables just to
> be
> >> > able
> >> > > to write my own *readField*
> >> > >
> >> > >
> >> > > would it be possible in next versions of Pig to have *readField
> >> > *protected
> >> > > as well as *mProtoTuple *? I think it could be useful in some cases
> >> like
> >> > > mine
> >> > > I'm asking because I don't know the reasoning after the decisions of
> >> made
> >> > > them private
> >> > >
> >> > > thanks a lot,
> >> >
> >>
> >>
> >>
> >> --
> >> *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> >> billgraham@gmail.com going forward.*
> >>
> >
> >
>
Re: PigStorage
Posted by pablomar <pa...@gmail.com>.
hi all,
I did it as simple as I could. What about this changes ?
PigStorage.java
original:
private void readField(byte[] buf, int start, int end) {
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
}
new:
protected void addToCurrentTuple(DataByteArray data) {
mProtoTuple.add(data);
}
protected void readField(byte[] buf, int start, int end) {
if (start == end) {
// NULL value
addToCurrentTuple(null);
} else {
addToCurrentTuple(new DataByteArray(buf, start, end));
}
}
so, what's the advantage ?
with the new version, if I want to extend PigStorage just to override
readField, I do something like:
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.DataByteArray;
public class MyLoader extends PigStorage {
public MyLoader() {
this("\t");
}
public MyLoader(String delimiter) {
super(delimiter);
}
@Override
protected void readField(byte[] buf, int start, int end) {
// remove the field name (example: min=, max=)
for (int i=start; i<=end;i++) {
if (buf[i] == '=') {
start = i + 1;
break;
}
}
if (start == end) {
// NULL value
addToCurrentTuple(null);
} else {
addToCurrentTuple(new DataByteArray(buf, start, end));
}
}
}
currently, just to override readField, I also need to copy/paste a lot from
PigStorage:
import java.io.IOException;
import java.util.ArrayList;
import java.util.Properties;
import org.apache.hadoop.io.Text;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.PigException;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.util.ObjectSerializer;
import org.apache.pig.impl.util.StorageUtil;
import org.apache.pig.impl.util.UDFContext;
public class MyLoaderextends PigStorage {
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
private boolean mRequiredColumnsInitialized = false;
public MyLoader() {
this("\t");
}
public MyLoader(String delimiter) {
super(delimiter);
fieldDel = StorageUtil.parseFieldDel(delimiter);
}
// exactly the same than in PigStorage, but I needed to modify readField
which is private, so I have to copy all of it
@Override
public Tuple getNext() throws IOException {
mProtoTuple = new ArrayList<Object>();
if (!mRequiredColumnsInitialized) {
if (signature!=null) {
Properties p =
UDFContext.getUDFContext().getUDFProperties(this.getClass());
mRequiredColumns =
(boolean[])ObjectSerializer.deserialize(p.getProperty(signature));
}
mRequiredColumnsInitialized = true;
}
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
Text value = (Text) in.getCurrentValue();
byte[] buf = value.getBytes();
int len = value.getLength();
int start = 0;
int fieldID = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
if (mRequiredColumns==null ||
(mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))
readField(buf, start, i);
start = i + 1;
fieldID++;
}
}
// pick up the last field
if (start <= len && (mRequiredColumns==null ||
(mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))) {
readField(buf, start, len);
}
Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple);
return t;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
}
private void readField(byte[] buf, int start, int end) {
// remove the field name (example: min=, max=)
for (int i=start; i<=end;i++) {
if (buf[i] == '=') {
start = i + 1;
break;
}
}
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
}
}
On Mon, Nov 19, 2012 at 12:24 PM, pablomar
<pa...@gmail.com>wrote:
> sure. My initial (and dirty) idea changed only 2 lines. I completely agree
> with you
>
>
>
>
>
> On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <bi...@gmail.com>wrote:
>
>> +1 as well, but I'd suggest we do the following:
>>
>> - Keep mProtoTuple private and add protected getters/setters instead with
>> javadocs describing expected usage.
>> - Rename mProtoTuple and the getters/setters to something more descriptive
>> than mProtoTuple.
>>
>>
>> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>>
>> > That sounds reasonable, I've run into the same problem. Do you mind
>> > submitting a patch?
>> >
>> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar
>> > <pa...@gmail.com> wrote:
>> > > hi all,
>> > >
>> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
>> > > I got a case today on which I needed to clean up some fields before
>> > > processing. I will need to do the same for all my scripts. So instead
>> of
>> > > doing it inside the scripts, I thought in extending PigStorage and do
>> it
>> > > inside my own Loader. My scripts will be shorter and cleaner
>> > >
>> > > in fact, the only method that I needed to overwrite was :
>> > > void *readField*(byte[] buf, int start, int end)
>> > >
>> > >
>> > > Everything was ok and it is working. Problem was that I had to
>> > copy/paste a
>> > > lot just because private declarations
>> > > for example:
>> > > private byte fieldDel = '\t';
>> > > private ArrayList<Object> mProtoTuple = null;
>> > > private TupleFactory mTupleFactory = TupleFactory.getInstance();
>> > > private boolean mRequiredColumnsInitialized = false;
>> > >
>> > > and of course:
>> > > *private *void readField(byte[] buf, int start, int end)
>> > >
>> > > so I had to copy/paste:
>> > > public Tuple getNext() and all the aforementioned variables just to be
>> > able
>> > > to write my own *readField*
>> > >
>> > >
>> > > would it be possible in next versions of Pig to have *readField
>> > *protected
>> > > as well as *mProtoTuple *? I think it could be useful in some cases
>> like
>> > > mine
>> > > I'm asking because I don't know the reasoning after the decisions of
>> made
>> > > them private
>> > >
>> > > thanks a lot,
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>
Re: PigStorage
Posted by pablomar <pa...@gmail.com>.
sure. My initial (and dirty) idea changed only 2 lines. I completely agree
with you
On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <bi...@gmail.com> wrote:
> +1 as well, but I'd suggest we do the following:
>
> - Keep mProtoTuple private and add protected getters/setters instead with
> javadocs describing expected usage.
> - Rename mProtoTuple and the getters/setters to something more descriptive
> than mProtoTuple.
>
>
> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > That sounds reasonable, I've run into the same problem. Do you mind
> > submitting a patch?
> >
> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar
> > <pa...@gmail.com> wrote:
> > > hi all,
> > >
> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> > > I got a case today on which I needed to clean up some fields before
> > > processing. I will need to do the same for all my scripts. So instead
> of
> > > doing it inside the scripts, I thought in extending PigStorage and do
> it
> > > inside my own Loader. My scripts will be shorter and cleaner
> > >
> > > in fact, the only method that I needed to overwrite was :
> > > void *readField*(byte[] buf, int start, int end)
> > >
> > >
> > > Everything was ok and it is working. Problem was that I had to
> > copy/paste a
> > > lot just because private declarations
> > > for example:
> > > private byte fieldDel = '\t';
> > > private ArrayList<Object> mProtoTuple = null;
> > > private TupleFactory mTupleFactory = TupleFactory.getInstance();
> > > private boolean mRequiredColumnsInitialized = false;
> > >
> > > and of course:
> > > *private *void readField(byte[] buf, int start, int end)
> > >
> > > so I had to copy/paste:
> > > public Tuple getNext() and all the aforementioned variables just to be
> > able
> > > to write my own *readField*
> > >
> > >
> > > would it be possible in next versions of Pig to have *readField
> > *protected
> > > as well as *mProtoTuple *? I think it could be useful in some cases
> like
> > > mine
> > > I'm asking because I don't know the reasoning after the decisions of
> made
> > > them private
> > >
> > > thanks a lot,
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>
Re: PigStorage
Posted by Bill Graham <bi...@gmail.com>.
+1 as well, but I'd suggest we do the following:
- Keep mProtoTuple private and add protected getters/setters instead with
javadocs describing expected usage.
- Rename mProtoTuple and the getters/setters to something more descriptive
than mProtoTuple.
On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> That sounds reasonable, I've run into the same problem. Do you mind
> submitting a patch?
>
> On Fri, Nov 16, 2012 at 12:48 PM, pablomar
> <pa...@gmail.com> wrote:
> > hi all,
> >
> > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> > I got a case today on which I needed to clean up some fields before
> > processing. I will need to do the same for all my scripts. So instead of
> > doing it inside the scripts, I thought in extending PigStorage and do it
> > inside my own Loader. My scripts will be shorter and cleaner
> >
> > in fact, the only method that I needed to overwrite was :
> > void *readField*(byte[] buf, int start, int end)
> >
> >
> > Everything was ok and it is working. Problem was that I had to
> copy/paste a
> > lot just because private declarations
> > for example:
> > private byte fieldDel = '\t';
> > private ArrayList<Object> mProtoTuple = null;
> > private TupleFactory mTupleFactory = TupleFactory.getInstance();
> > private boolean mRequiredColumnsInitialized = false;
> >
> > and of course:
> > *private *void readField(byte[] buf, int start, int end)
> >
> > so I had to copy/paste:
> > public Tuple getNext() and all the aforementioned variables just to be
> able
> > to write my own *readField*
> >
> >
> > would it be possible in next versions of Pig to have *readField
> *protected
> > as well as *mProtoTuple *? I think it could be useful in some cases like
> > mine
> > I'm asking because I don't know the reasoning after the decisions of made
> > them private
> >
> > thanks a lot,
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*
Re: PigStorage
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
That sounds reasonable, I've run into the same problem. Do you mind
submitting a patch?
On Fri, Nov 16, 2012 at 12:48 PM, pablomar
<pa...@gmail.com> wrote:
> hi all,
>
> I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> I got a case today on which I needed to clean up some fields before
> processing. I will need to do the same for all my scripts. So instead of
> doing it inside the scripts, I thought in extending PigStorage and do it
> inside my own Loader. My scripts will be shorter and cleaner
>
> in fact, the only method that I needed to overwrite was :
> void *readField*(byte[] buf, int start, int end)
>
>
> Everything was ok and it is working. Problem was that I had to copy/paste a
> lot just because private declarations
> for example:
> private byte fieldDel = '\t';
> private ArrayList<Object> mProtoTuple = null;
> private TupleFactory mTupleFactory = TupleFactory.getInstance();
> private boolean mRequiredColumnsInitialized = false;
>
> and of course:
> *private *void readField(byte[] buf, int start, int end)
>
> so I had to copy/paste:
> public Tuple getNext() and all the aforementioned variables just to be able
> to write my own *readField*
>
>
> would it be possible in next versions of Pig to have *readField *protected
> as well as *mProtoTuple *? I think it could be useful in some cases like
> mine
> I'm asking because I don't know the reasoning after the decisions of made
> them private
>
> thanks a lot,