You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Arun Patel <ar...@gmail.com> on 2016/10/06 19:56:53 UTC

Re: HDFS small files to Sequence file using Hive

*Is there a way to increase the file/block size beyond 1MB? *

*Thank you!*

On Mon, Sep 26, 2016 at 7:50 PM, Arun Patel <ar...@gmail.com> wrote:

> Thanks Dudu and Gopal.
>
> I tried HAR files and it works.
>
> I want to use Sequence file because I want to expose data using a table
> (filename and content columns).  *Can this be done for HAR files?*
>
> This is what I am doing to create a sequencefile:
>
> create external table raw_files (raw_data string) location
> '/user/myid/myfiles';
> create table fies_seq (key string, value string) stored as sequencefile;
> insert overwrite table files_seq
>          select REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) as
> file_name, CONCAT_WS(' ', COLLECT_LIST(raw_data)) as              raw_data
> from raw_files group by INPUT__FILE__NAME;
>
> It works well.  But, I am seeing 1MB files in fies_seq directory.  I am
> using below parameters. * Is there a way to increase the file/block size?*
>
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=org.apache.hadoop.io.
> compress.GzipCodec;
> SET mapred.output.compression.type=BLOCK;
>
>
> On Fri, Sep 23, 2016 at 7:16 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>> > Is there a way to create an external table on a directory, extract
>> 'key' as file name and 'value' as file content and write to a sequence file
>> table?
>>
>> Do you care that it is a sequence file?
>>
>> The HDFS HAR format was invented for this particular problem, check if
>> the "hadoop archive" command works for you and offers a filesystem
>> abstraction.
>>
>> Otherwise, there's always the old Mahout "seqdirectory" job, which is
>> great if you have like .jpg files and want to pack them for HDFS to handle
>> better (like GPS tiles).
>>
>> Cheers,
>> Gopal
>>
>>
>>
>