You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Jeff Quinn <je...@nunahealth.com> on 2015/05/05 08:38:56 UTC

Byte Offset for Records

Hello,

I would like to know the byte offset (absolute offset, not relative to
split) for each record inside of my crunch pipeline.

My planned approach is to use a custom `InputFormat` class.

I have tried tried using `From#formattedFile` to apply a custom
`InputFormat` class, however the returned class does not implement
`ReadableSource`, and thus cannot be used as a parameter for
`Pipeline#read`.

What is the purpose of the `From#formattedFile` method if the Source class
it returns output cannot actually be read? Is using a custom `InputFormat`
class possible or recommended?

Thanks,

Jeff Quinn
Data Engineer
Nuna

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Byte Offset for Records

Posted by Josh Wills <jw...@cloudera.com>.
https://issues.apache.org/jira/browse/CRUNCH-517

Patch is up. I also fixed that stupid crunch-spark compile error on
hadoop1. I so cannot wait to get rid of hadoop1. :)

J

On Tue, May 5, 2015 at 8:21 AM, Jeff Quinn <je...@nunahealth.com> wrote:

> Great. I would definitely agree, that sounds ideal.
>
> Thanks,
>
> Jeff
>
>
>
> On May 5, 2015, at 12:14 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>
> On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <je...@nunahealth.com> wrote:
>
>> Josh,
>>
>> Thanks so much for your response, you’re correct I hit the error while
>> using the MemPipeline. The difference between Source and ReadableSource
>> makes much more sense to me now.
>>
>> It sounds like I just need to implement ReadableSource and override the
>> #read and #asReadable methods with behavior that is equivalent to how my
>> `InputFormat`  would act. Then I should be able to use my `InputFormat` in
>> my test suite with MemPipeline, and in my real pipeline I can rest assured
>> those methods will never be called.
>>
>
> That will work, but I still think the right thing to do is to make those
> formattedFile impls support ReadableSource. And there are definitely places
> in the MRPipeline and MemPipeline where ReadableSources would be useful
> w/formattedFiles (e.g., mapside joins) that we don't support right now.
>
>
>>
>> Best,
>>
>> Jeff
>>
>> On May 4, 2015, at 11:53 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>> Any Source<T> can be used as the input to an MR/Spark job via
>> Pipeline.read, but a ReadableSource<T> can read data into the local client
>> as well-- I'm assuming you're hitting an error trying to use your
>> formattedFile source w/a MemPipeline job? MemPipeline requires
>> ReadableSources since everything it does runs client-side, while MRPipeline
>> and SparkPipeline are happy to use regular Sources, like the one returned
>> by formattedFile.
>>
>> The next question you would ask is "why doesn't formattedFile return a
>> ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
>> good reason for it or if I was just being lazy. Will take a look and report
>> back.
>>
>> J
>>
>> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <je...@nunahealth.com> wrote:
>>
>>> Hello,
>>>
>>> I would like to know the byte offset (absolute offset, not relative to
>>> split) for each record inside of my crunch pipeline.
>>>
>>> My planned approach is to use a custom `InputFormat` class.
>>>
>>> I have tried tried using `From#formattedFile` to apply a custom
>>> `InputFormat` class, however the returned class does not implement
>>> `ReadableSource`, and thus cannot be used as a parameter for
>>> `Pipeline#read`.
>>>
>>> What is the purpose of the `From#formattedFile` method if the Source
>>> class it returns output cannot actually be read? Is using a custom
>>> `InputFormat` class possible or recommended?
>>>
>>> Thanks,
>>>
>>> Jeff Quinn
>>> Data Engineer
>>> Nuna
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com/>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Byte Offset for Records

Posted by Jeff Quinn <je...@nunahealth.com>.
Great. I would definitely agree, that sounds ideal.

Thanks,

Jeff


> On May 5, 2015, at 12:14 AM, Josh Wills <jw...@cloudera.com> wrote:
> 
> 
> On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <jeff@nunahealth.com <ma...@nunahealth.com>> wrote:
> Josh,
> 
> Thanks so much for your response, you’re correct I hit the error while using the MemPipeline. The difference between Source and ReadableSource makes much more sense to me now. 
> 
> It sounds like I just need to implement ReadableSource and override the #read and #asReadable methods with behavior that is equivalent to how my `InputFormat`  would act. Then I should be able to use my `InputFormat` in my test suite with MemPipeline, and in my real pipeline I can rest assured those methods will never be called.
> 
> That will work, but I still think the right thing to do is to make those formattedFile impls support ReadableSource. And there are definitely places in the MRPipeline and MemPipeline where ReadableSources would be useful w/formattedFiles (e.g., mapside joins) that we don't support right now.
>  
> 
> Best,
> 
> Jeff
> 
>> On May 4, 2015, at 11:53 PM, Josh Wills <jwills@cloudera.com <ma...@cloudera.com>> wrote:
>> 
>> Any Source<T> can be used as the input to an MR/Spark job via Pipeline.read, but a ReadableSource<T> can read data into the local client as well-- I'm assuming you're hitting an error trying to use your formattedFile source w/a MemPipeline job? MemPipeline requires ReadableSources since everything it does runs client-side, while MRPipeline and SparkPipeline are happy to use regular Sources, like the one returned by formattedFile.
>> 
>> The next question you would ask is "why doesn't formattedFile return a ReadableSource<T>?" -- and it's a good one. I don't remember if there's a good reason for it or if I was just being lazy. Will take a look and report back.
>> 
>> J
>> 
>> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <jeff@nunahealth.com <ma...@nunahealth.com>> wrote:
>> Hello,
>> 
>> I would like to know the byte offset (absolute offset, not relative to split) for each record inside of my crunch pipeline.
>> 
>> My planned approach is to use a custom `InputFormat` class.
>> 
>> I have tried tried using `From#formattedFile` to apply a custom `InputFormat` class, however the returned class does not implement `ReadableSource`, and thus cannot be used as a parameter for `Pipeline#read`. 
>> 
>> What is the purpose of the `From#formattedFile` method if the Source class it returns output cannot actually be read? Is using a custom `InputFormat` class possible or recommended?
>> 
>> Thanks,
>> 
>> Jeff Quinn
>> Data Engineer
>> Nuna
>> 
>> DISCLAIMER: The contents of this email, including any attachments, may contain information that is confidential, proprietary in nature, protected health information (PHI), or otherwise protected by law from disclosure, and is solely for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any use, disclosure or copying of this email, including any attachments, is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender of this email. Please delete this and all copies of this email from your system. Any opinions either expressed or implied in this email and all attachments, are those of its author only, and do not necessarily reflect those of Nuna Health, Inc.
>> 
>> 
>> 
>> -- 
>> Director of Data Science
>> Cloudera <http://www.cloudera.com/>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> 
> DISCLAIMER: The contents of this email, including any attachments, may contain information that is confidential, proprietary in nature, protected health information (PHI), or otherwise protected by law from disclosure, and is solely for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any use, disclosure or copying of this email, including any attachments, is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender of this email. Please delete this and all copies of this email from your system. Any opinions either expressed or implied in this email and all attachments, are those of its author only, and do not necessarily reflect those of Nuna Health, Inc.
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Byte Offset for Records

Posted by Josh Wills <jw...@cloudera.com>.
On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <je...@nunahealth.com> wrote:

> Josh,
>
> Thanks so much for your response, you’re correct I hit the error while
> using the MemPipeline. The difference between Source and ReadableSource
> makes much more sense to me now.
>
> It sounds like I just need to implement ReadableSource and override the
> #read and #asReadable methods with behavior that is equivalent to how my
> `InputFormat`  would act. Then I should be able to use my `InputFormat` in
> my test suite with MemPipeline, and in my real pipeline I can rest assured
> those methods will never be called.
>

That will work, but I still think the right thing to do is to make those
formattedFile impls support ReadableSource. And there are definitely places
in the MRPipeline and MemPipeline where ReadableSources would be useful
w/formattedFiles (e.g., mapside joins) that we don't support right now.


>
> Best,
>
> Jeff
>
> On May 4, 2015, at 11:53 PM, Josh Wills <jw...@cloudera.com> wrote:
>
> Any Source<T> can be used as the input to an MR/Spark job via
> Pipeline.read, but a ReadableSource<T> can read data into the local client
> as well-- I'm assuming you're hitting an error trying to use your
> formattedFile source w/a MemPipeline job? MemPipeline requires
> ReadableSources since everything it does runs client-side, while MRPipeline
> and SparkPipeline are happy to use regular Sources, like the one returned
> by formattedFile.
>
> The next question you would ask is "why doesn't formattedFile return a
> ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
> good reason for it or if I was just being lazy. Will take a look and report
> back.
>
> J
>
> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <je...@nunahealth.com> wrote:
>
>> Hello,
>>
>> I would like to know the byte offset (absolute offset, not relative to
>> split) for each record inside of my crunch pipeline.
>>
>> My planned approach is to use a custom `InputFormat` class.
>>
>> I have tried tried using `From#formattedFile` to apply a custom
>> `InputFormat` class, however the returned class does not implement
>> `ReadableSource`, and thus cannot be used as a parameter for
>> `Pipeline#read`.
>>
>> What is the purpose of the `From#formattedFile` method if the Source
>> class it returns output cannot actually be read? Is using a custom
>> `InputFormat` class possible or recommended?
>>
>> Thanks,
>>
>> Jeff Quinn
>> Data Engineer
>> Nuna
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Byte Offset for Records

Posted by Jeff Quinn <je...@nunahealth.com>.
Josh,

Thanks so much for your response, you’re correct I hit the error while using the MemPipeline. The difference between Source and ReadableSource makes much more sense to me now. 

It sounds like I just need to implement ReadableSource and override the #read and #asReadable methods with behavior that is equivalent to how my `InputFormat`  would act. Then I should be able to use my `InputFormat` in my test suite with MemPipeline, and in my real pipeline I can rest assured those methods will never be called.

Best,

Jeff

> On May 4, 2015, at 11:53 PM, Josh Wills <jw...@cloudera.com> wrote:
> 
> Any Source<T> can be used as the input to an MR/Spark job via Pipeline.read, but a ReadableSource<T> can read data into the local client as well-- I'm assuming you're hitting an error trying to use your formattedFile source w/a MemPipeline job? MemPipeline requires ReadableSources since everything it does runs client-side, while MRPipeline and SparkPipeline are happy to use regular Sources, like the one returned by formattedFile.
> 
> The next question you would ask is "why doesn't formattedFile return a ReadableSource<T>?" -- and it's a good one. I don't remember if there's a good reason for it or if I was just being lazy. Will take a look and report back.
> 
> J
> 
> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <jeff@nunahealth.com <ma...@nunahealth.com>> wrote:
> Hello,
> 
> I would like to know the byte offset (absolute offset, not relative to split) for each record inside of my crunch pipeline.
> 
> My planned approach is to use a custom `InputFormat` class.
> 
> I have tried tried using `From#formattedFile` to apply a custom `InputFormat` class, however the returned class does not implement `ReadableSource`, and thus cannot be used as a parameter for `Pipeline#read`. 
> 
> What is the purpose of the `From#formattedFile` method if the Source class it returns output cannot actually be read? Is using a custom `InputFormat` class possible or recommended?
> 
> Thanks,
> 
> Jeff Quinn
> Data Engineer
> Nuna
> 
> DISCLAIMER: The contents of this email, including any attachments, may contain information that is confidential, proprietary in nature, protected health information (PHI), or otherwise protected by law from disclosure, and is solely for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any use, disclosure or copying of this email, including any attachments, is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender of this email. Please delete this and all copies of this email from your system. Any opinions either expressed or implied in this email and all attachments, are those of its author only, and do not necessarily reflect those of Nuna Health, Inc.
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Byte Offset for Records

Posted by Josh Wills <jw...@cloudera.com>.
Any Source<T> can be used as the input to an MR/Spark job via
Pipeline.read, but a ReadableSource<T> can read data into the local client
as well-- I'm assuming you're hitting an error trying to use your
formattedFile source w/a MemPipeline job? MemPipeline requires
ReadableSources since everything it does runs client-side, while MRPipeline
and SparkPipeline are happy to use regular Sources, like the one returned
by formattedFile.

The next question you would ask is "why doesn't formattedFile return a
ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
good reason for it or if I was just being lazy. Will take a look and report
back.

J

On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <je...@nunahealth.com> wrote:

> Hello,
>
> I would like to know the byte offset (absolute offset, not relative to
> split) for each record inside of my crunch pipeline.
>
> My planned approach is to use a custom `InputFormat` class.
>
> I have tried tried using `From#formattedFile` to apply a custom
> `InputFormat` class, however the returned class does not implement
> `ReadableSource`, and thus cannot be used as a parameter for
> `Pipeline#read`.
>
> What is the purpose of the `From#formattedFile` method if the Source class
> it returns output cannot actually be read? Is using a custom `InputFormat`
> class possible or recommended?
>
> Thanks,
>
> Jeff Quinn
> Data Engineer
> Nuna
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.




-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>