You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Mike Barretta <mi...@gmail.com> on 2012/12/03 23:25:34 UTC

Confusion regarding SeqFileSource

As there are no examples on using non-text files as input, I'm trying to
piece together the steps involved in reading in sequence data.

The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
takes a path and a PType.  The PType is where my confusion begins.

How does PType relate to InputFormat and OutputFormat? Do I need to
implement my own PTypes and the associated in/out MapFns?

Thanks,
Mike

Re: Confusion regarding SeqFileSource

Posted by Josh Wills <jo...@gmail.com>.

You're welcome, and thanks for the feedback. As part of documenting the
From.java class, I should add functions that do the Writables.writables
part for you (i.e., you just pass in the Class<K extends Writable>, Class<V
extends Writable> arguments to make that easier to get rolling with. I'll
add a JIRA for it.

J


On Tue, Dec 4, 2012 at 12:53 PM, Mike Barretta <mi...@gmail.com>wrote:

> Josh, thank you, that did help.  I'd found the From class, but not the
> Writables.writables.
>
>
> On Mon, Dec 3, 2012 at 5:42 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Mike,
>>
>> Sorry about that, it's mainly b/c they're tedious to write and I've been
>> lazy about it. Here's the skinny.
>>
>> For the SeqFileSource, we assume that you're only interested in the
>> "value" portion of the key-value pair for each record in the SequenceFile.
>> The PType<T> should be for whatever data type you expect to read from that
>> value, which is probably a class that implements Writable. The easy way to
>> do it is to do:
>>
>> import static org.apache.crunch.types.writable.Writables.writables;
>>
>> import org.apache.crunch.io.From;
>>
>> // This reads the value and ignore the key in each record
>> PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>,
>> writables(MyWritable.class)));
>>
>> If you want both the key and the value, you need to read the SequenceFile
>> as a PTable<K, V>, as:
>>
>> PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>,
>> writables(MyKey.class), writables(MyValue.class)));
>>
>> After you read in the values, you're free to convert them to whatever
>> types you like using parallelDo and friends. I especially recommend using
>> the Avro-based PTypeFamily, since it will significantly outperform the
>> Writable family on jobs that involve complex joins or aggregations.
>>
>> Hope that helps, feel free to send follow-ups.
>>
>> Josh
>>
>>
>>
>> On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <mi...@gmail.com>wrote:
>>
>>> As there are no examples on using non-text files as input, I'm trying to
>>> piece together the steps involved in reading in sequence data.
>>>
>>> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
>>> takes a path and a PType.  The PType is where my confusion begins.
>>>
>>> How does PType relate to InputFormat and OutputFormat? Do I need to
>>> implement my own PTypes and the associated in/out MapFns?
>>>
>>> Thanks,
>>> Mike
>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>

Re: Confusion regarding SeqFileSource

Posted by Mike Barretta <mi...@gmail.com>.

Josh, thank you, that did help.  I'd found the From class, but not the
Writables.writables.


On Mon, Dec 3, 2012 at 5:42 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Mike,
>
> Sorry about that, it's mainly b/c they're tedious to write and I've been
> lazy about it. Here's the skinny.
>
> For the SeqFileSource, we assume that you're only interested in the
> "value" portion of the key-value pair for each record in the SequenceFile.
> The PType<T> should be for whatever data type you expect to read from that
> value, which is probably a class that implements Writable. The easy way to
> do it is to do:
>
> import static org.apache.crunch.types.writable.Writables.writables;
>
> import org.apache.crunch.io.From;
>
> // This reads the value and ignore the key in each record
> PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>,
> writables(MyWritable.class)));
>
> If you want both the key and the value, you need to read the SequenceFile
> as a PTable<K, V>, as:
>
> PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>,
> writables(MyKey.class), writables(MyValue.class)));
>
> After you read in the values, you're free to convert them to whatever
> types you like using parallelDo and friends. I especially recommend using
> the Avro-based PTypeFamily, since it will significantly outperform the
> Writable family on jobs that involve complex joins or aggregations.
>
> Hope that helps, feel free to send follow-ups.
>
> Josh
>
>
>
> On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <mi...@gmail.com>wrote:
>
>> As there are no examples on using non-text files as input, I'm trying to
>> piece together the steps involved in reading in sequence data.
>>
>> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
>> takes a path and a PType.  The PType is where my confusion begins.
>>
>> How does PType relate to InputFormat and OutputFormat? Do I need to
>> implement my own PTypes and the associated in/out MapFns?
>>
>> Thanks,
>> Mike
>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Re: Confusion regarding SeqFileSource

Posted by Josh Wills <jw...@cloudera.com>.

Hey Mike,

Sorry about that, it's mainly b/c they're tedious to write and I've been
lazy about it. Here's the skinny.

For the SeqFileSource, we assume that you're only interested in the "value"
portion of the key-value pair for each record in the SequenceFile. The
PType<T> should be for whatever data type you expect to read from that
value, which is probably a class that implements Writable. The easy way to
do it is to do:

import static org.apache.crunch.types.writable.Writables.writables;

import org.apache.crunch.io.From;

// This reads the value and ignore the key in each record
PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>,
writables(MyWritable.class)));

If you want both the key and the value, you need to read the SequenceFile
as a PTable<K, V>, as:

PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>,
writables(MyKey.class), writables(MyValue.class)));

After you read in the values, you're free to convert them to whatever types
you like using parallelDo and friends. I especially recommend using the
Avro-based PTypeFamily, since it will significantly outperform the Writable
family on jobs that involve complex joins or aggregations.

Hope that helps, feel free to send follow-ups.

Josh

On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <mi...@gmail.com>wrote:

> As there are no examples on using non-text files as input, I'm trying to
> piece together the steps involved in reading in sequence data.
>
> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
> takes a path and a PType.  The PType is where my confusion begins.
>
> How does PType relate to InputFormat and OutputFormat? Do I need to
> implement my own PTypes and the associated in/out MapFns?
>
> Thanks,
> Mike
>
>
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>