You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by "Kodimala,Rajashekar" <Ra...@cerner.com> on 2017/04/17 21:11:51 UTC

Regarding Combine File Flag for Sequence Files

Hello Team,

Recently we have observed that Crunch API by default disabling the combine file flag in sequence files, but it is not disabling when input files are avro files. Is their any specific reason for why combine file for sequence files is disabled by default.

seqFileSource.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE, "true");

Thanks
--
Rajashekar Kodimala
Software Engineer, Population Health Dev
Rajashekar.Kodimala@cerner.com
www.cerner.com<http://www.cerner.com/>



CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: Regarding Combine File Flag for Sequence Files

Posted by Micah Whitacre <mk...@gmail.com>.
Yes the double negative to enable combining the small files is a bit on the
confusing side.  I think that the SeqFileTableSource combining small files
by default is an oversight versus intentional.

On Mon, Apr 17, 2017 at 8:43 PM, Nithin Asokan <an...@gmail.com> wrote:

> I think we noticed this around SeqFileTableSource. It almost seems like
> the table source didn't explicitly sets those configs; and the
> CrunchInputFormat expects it to be set to *false *to enable combine
> files.
>
> https://github.com/apache/crunch/blob/apache-crunch-0.
> 15.0/crunch-core/src/main/java/org/apache/crunch/io/seq/
> SeqFileTableSource.java#L38-L48
> https://github.com/apache/crunch/blob/apache-crunch-0.
> 15.0/crunch-core/src/main/java/org/apache/crunch/impl/
> mr/run/CrunchInputFormat.java#L55-L57
>
> I believe Avro table source is working fine since it's an extension of the
> AvroFileSource; however SeqFileTableSource doesn't follow the same pattern;
> It is an extension of FileTableSourceImpl. And I wonder if it's part of the
> problem.
>
> Thanks,
> Nithin
>
>
> On Mon, Apr 17, 2017 at 8:25 PM Micah Whitacre <mk...@gmail.com>
> wrote:
>
>> It might have been me:
>> https://issues.apache.org/jira/browse/CRUNCH-331
>>
>> Also can you clarify where you see it being set to true?  In the current
>> stream of code they are both set the same[1][2].
>>
>> [1] - https://github.com/apache/crunch/blob/
>> 047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/
>> main/java/org/apache/crunch/io/seq/SeqFileSource.java#L42
>> [2] - https://github.com/apache/crunch/blob/
>> 047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/
>> main/java/org/apache/crunch/io/avro/AvroFileSource.java#L44
>>
>>
>> On Mon, Apr 17, 2017 at 7:33 PM, Josh Wills <jo...@gmail.com> wrote:
>>
>>> +tomwhite
>>>
>>> I think Tom was the one who set this originally, but it might be my
>>> faulty memory. :/
>>>
>>> J
>>>
>>> On Mon, Apr 17, 2017 at 2:11 PM, Kodimala,Rajashekar <
>>> Rajashekar.Kodimala@cerner.com> wrote:
>>>
>>>> Hello Team,
>>>>
>>>>
>>>>
>>>> Recently we have observed that Crunch API by default disabling the
>>>> combine file flag in sequence files, but it is not disabling when input
>>>> files are avro files. Is their any specific reason for why combine file for
>>>> sequence files is disabled by default.
>>>>
>>>>
>>>>
>>>> seqFileSource.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE,
>>>> "true");
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> --
>>>>
>>>> *Rajashekar Kodimala*
>>>>
>>>> Software Engineer, Population Health Dev
>>>>
>>>> Rajashekar.Kodimala@cerner.com
>>>>
>>>> www.cerner.com
>>>>
>>>>
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>>> from Cerner Corporation and are intended only for the addressee. The
>>>> information contained in this message is confidential and may constitute
>>>> inside or non-public information under international, federal, or state
>>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>>> or use of such information is strictly prohibited and may be unlawful. If
>>>> you are not the addressee, please promptly delete this message and notify
>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
>>>> <(816)%20221-1024>.
>>>>
>>>
>>>
>>

Re: Regarding Combine File Flag for Sequence Files

Posted by Nithin Asokan <an...@gmail.com>.
I think we noticed this around SeqFileTableSource. It almost seems like the
table source didn't explicitly sets those configs; and the
CrunchInputFormat expects it to be set to *false *to enable combine files.

https://github.com/apache/crunch/blob/apache-crunch-0.15.0/crunch-core/src/main/java/org/apache/crunch/io/seq/SeqFileTableSource.java#L38-L48
https://github.com/apache/crunch/blob/apache-crunch-0.15.0/crunch-core/src/main/java/org/apache/crunch/impl/mr/run/CrunchInputFormat.java#L55-L57

I believe Avro table source is working fine since it's an extension of the
AvroFileSource; however SeqFileTableSource doesn't follow the same pattern;
It is an extension of FileTableSourceImpl. And I wonder if it's part of the
problem.

Thanks,
Nithin


On Mon, Apr 17, 2017 at 8:25 PM Micah Whitacre <mk...@gmail.com> wrote:

> It might have been me:
> https://issues.apache.org/jira/browse/CRUNCH-331
>
> Also can you clarify where you see it being set to true?  In the current
> stream of code they are both set the same[1][2].
>
> [1] -
> https://github.com/apache/crunch/blob/047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/main/java/org/apache/crunch/io/seq/SeqFileSource.java#L42
> [2] -
> https://github.com/apache/crunch/blob/047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/main/java/org/apache/crunch/io/avro/AvroFileSource.java#L44
>
>
> On Mon, Apr 17, 2017 at 7:33 PM, Josh Wills <jo...@gmail.com> wrote:
>
>> +tomwhite
>>
>> I think Tom was the one who set this originally, but it might be my
>> faulty memory. :/
>>
>> J
>>
>> On Mon, Apr 17, 2017 at 2:11 PM, Kodimala,Rajashekar <
>> Rajashekar.Kodimala@cerner.com> wrote:
>>
>>> Hello Team,
>>>
>>>
>>>
>>> Recently we have observed that Crunch API by default disabling the
>>> combine file flag in sequence files, but it is not disabling when input
>>> files are avro files. Is their any specific reason for why combine file for
>>> sequence files is disabled by default.
>>>
>>>
>>>
>>> seqFileSource.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE, "true");
>>>
>>>
>>>
>>> Thanks
>>>
>>> --
>>>
>>> *Rajashekar Kodimala*
>>>
>>> Software Engineer, Population Health Dev
>>>
>>> Rajashekar.Kodimala@cerner.com
>>>
>>> www.cerner.com
>>>
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
>>> <(816)%20221-1024>.
>>>
>>
>>
>

Re: Regarding Combine File Flag for Sequence Files

Posted by Micah Whitacre <mk...@gmail.com>.
It might have been me:
https://issues.apache.org/jira/browse/CRUNCH-331

Also can you clarify where you see it being set to true?  In the current
stream of code they are both set the same[1][2].

[1] -
https://github.com/apache/crunch/blob/047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/main/java/org/apache/crunch/io/seq/SeqFileSource.java#L42
[2] -
https://github.com/apache/crunch/blob/047d8fd36773608a3d2cf6445881173e7d26377c/crunch-core/src/main/java/org/apache/crunch/io/avro/AvroFileSource.java#L44


On Mon, Apr 17, 2017 at 7:33 PM, Josh Wills <jo...@gmail.com> wrote:

> +tomwhite
>
> I think Tom was the one who set this originally, but it might be my faulty
> memory. :/
>
> J
>
> On Mon, Apr 17, 2017 at 2:11 PM, Kodimala,Rajashekar <
> Rajashekar.Kodimala@cerner.com> wrote:
>
>> Hello Team,
>>
>>
>>
>> Recently we have observed that Crunch API by default disabling the
>> combine file flag in sequence files, but it is not disabling when input
>> files are avro files. Is their any specific reason for why combine file for
>> sequence files is disabled by default.
>>
>>
>>
>> seqFileSource.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE, "true");
>>
>>
>>
>> Thanks
>>
>> --
>>
>> *Rajashekar Kodimala*
>>
>> Software Engineer, Population Health Dev
>>
>> Rajashekar.Kodimala@cerner.com
>>
>> www.cerner.com
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <(816)%20221-1024>.
>>
>
>

Re: Regarding Combine File Flag for Sequence Files

Posted by Josh Wills <jo...@gmail.com>.
+tomwhite

I think Tom was the one who set this originally, but it might be my faulty
memory. :/

J

On Mon, Apr 17, 2017 at 2:11 PM, Kodimala,Rajashekar <
Rajashekar.Kodimala@cerner.com> wrote:

> Hello Team,
>
>
>
> Recently we have observed that Crunch API by default disabling the combine
> file flag in sequence files, but it is not disabling when input files are
> avro files. Is their any specific reason for why combine file for sequence
> files is disabled by default.
>
>
>
> seqFileSource.inputConf(RuntimeParameters.DISABLE_COMBINE_FILE, "true");
>
>
>
> Thanks
>
> --
>
> *Rajashekar Kodimala*
>
> Software Engineer, Population Health Dev
>
> Rajashekar.Kodimala@cerner.com
>
> www.cerner.com
>
>
>
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <(816)%20221-1024>.
>