You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by "Durfey,Stephen" <St...@Cerner.com> on 2014/01/23 21:37:41 UTC

Source creation with From.formattedFile

Recently I needed the ability to read in a CSV file with Crunch. Reading in the CSV file as a text file and then splitting at a delimiter wasn’t an option as the values in the CSV file could have had a new line character embedded inside quotes. So, myself and another guy on my team worked on creating our own custom input format to read from the file and properly generate splits at the end of a valid CSV line, rather than just the first new line character.

We started using From.formattedFile (I wasn’t aware of this until the user-guide, so thanks Josh for throwing that together) to create the TableSource we needed to read the file. After some testing we noticed that the getSplits method that we overrode in our InputFormat wasn’t being called. After some time debugging we found our way to ‘CrunchInputFormat’, and saw that our InputFormat was being replaced with the ‘CrunchCombineInputFormat’, and this was causing our splits to be incorrect. After disabling the config key so ‘CrunchCombineInputFormat’ wasn’t used, everything was working as it should.

I have two possible requests/suggestions:

  1.  If the desired behavior is to use the CrunchCombineInputFormat by default (even if developer specifies their own InputFormat), can this be mentioned in the Source section in the user-guide? The config key for disabling the combine is mentioned in the user-guide but not near the Source information, so we were unaware of this behavior until we debugged through the code.
  2.  If the developer uses From.formattedFile and specifically uses a certain InputFormat, can that be honored and have the use of CrunchCombineInputFormat be disabled without developer intervention?

I would think option 2 is preferred. My expectation was that my InputFormat would be used rather than the code defaulting to a different InputFormat.

Stephen Durfey
Software Engineer|The Record
816-201-2689 | Stephen.Durfey@cerner.com

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: Source creation with From.formattedFile

Posted by Josh Wills <jw...@cloudera.com>.
And filed: https://issues.apache.org/jira/browse/CRUNCH-331


On Thu, Jan 23, 2014 at 12:41 PM, Josh Wills <jw...@cloudera.com> wrote:

> I think option 2 makes sense-- let's file a JIRA for it.
>
> J
>
>
> On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen <
> Stephen.Durfey@cerner.com> wrote:
>
>>  Recently I needed the ability to read in a CSV file with Crunch.
>> Reading in the CSV file as a text file and then splitting at a delimiter
>> wasn’t an option as the values in the CSV file could have had a new line
>> character embedded inside quotes. So, myself and another guy on my team
>> worked on creating our own custom input format to read from the file and
>> properly generate splits at the end of a valid CSV line, rather than just
>> the first new line character.
>>
>>  We started using From.formattedFile (I wasn’t aware of this until the
>> user-guide, so thanks Josh for throwing that together) to create the
>> TableSource we needed to read the file. After some testing we noticed that
>> the getSplits method that we overrode in our InputFormat wasn’t being
>> called. After some time debugging we found our way to ‘CrunchInputFormat’,
>> and saw that our InputFormat was being replaced with the
>> ‘CrunchCombineInputFormat’, and this was causing our splits to be
>> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
>> wasn’t used, everything was working as it should.
>>
>>  I have two possible requests/suggestions:
>>
>>    1. If the desired behavior is to use the CrunchCombineInputFormat by
>>    default (even if developer specifies their own InputFormat), can this be
>>    mentioned in the Source section in the user-guide? The config key for
>>    disabling the combine is mentioned in the user-guide but not near the
>>    Source information, so we were unaware of this behavior until we debugged
>>    through the code.
>>    2. If the developer uses From.formattedFile and specifically uses a
>>    certain InputFormat, can that be honored and have the use of
>>    CrunchCombineInputFormat be disabled without developer intervention?
>>
>>
>>  I would think option 2 is preferred. My expectation was that my
>> InputFormat would be used rather than the code defaulting to a different
>> InputFormat.
>>
>>  Stephen Durfey
>> Software Engineer|The Record
>> 816-201-2689 | Stephen.Durfey@cerner.com
>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Source creation with From.formattedFile

Posted by Josh Wills <jw...@cloudera.com>.
I think option 2 makes sense-- let's file a JIRA for it.

J


On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen
<St...@cerner.com>wrote:

>  Recently I needed the ability to read in a CSV file with Crunch. Reading
> in the CSV file as a text file and then splitting at a delimiter wasn’t an
> option as the values in the CSV file could have had a new line character
> embedded inside quotes. So, myself and another guy on my team worked on
> creating our own custom input format to read from the file and properly
> generate splits at the end of a valid CSV line, rather than just the first
> new line character.
>
>  We started using From.formattedFile (I wasn’t aware of this until the
> user-guide, so thanks Josh for throwing that together) to create the
> TableSource we needed to read the file. After some testing we noticed that
> the getSplits method that we overrode in our InputFormat wasn’t being
> called. After some time debugging we found our way to ‘CrunchInputFormat’,
> and saw that our InputFormat was being replaced with the
> ‘CrunchCombineInputFormat’, and this was causing our splits to be
> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
> wasn’t used, everything was working as it should.
>
>  I have two possible requests/suggestions:
>
>    1. If the desired behavior is to use the CrunchCombineInputFormat by
>    default (even if developer specifies their own InputFormat), can this be
>    mentioned in the Source section in the user-guide? The config key for
>    disabling the combine is mentioned in the user-guide but not near the
>    Source information, so we were unaware of this behavior until we debugged
>    through the code.
>    2. If the developer uses From.formattedFile and specifically uses a
>    certain InputFormat, can that be honored and have the use of
>    CrunchCombineInputFormat be disabled without developer intervention?
>
>
>  I would think option 2 is preferred. My expectation was that my
> InputFormat would be used rather than the code defaulting to a different
> InputFormat.
>
>  Stephen Durfey
> Software Engineer|The Record
> 816-201-2689 | Stephen.Durfey@cerner.com
>  CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>