You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Harsha HN <99...@gmail.com> on 2015/12/03 14:20:17 UTC

Handling LZO files

Hi,

We have LZO compressed JSON files in our HDFS locations. I am creating an
"External" table on the data in HDFS for the purpose of analytics.

There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61
MB respectively along with their index files.

When I run count(*) query on the table I observe only 10 mappers causing
performance bottleneck.

I even tried following, (going for 30MB split)
 1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;

2) set dfs.blocksize=31457280;

But still I am getting 10 mappers.

Can you please guide me in fixing the same?

Thanks,
Sree Harsha

Re: Handling LZO files

Posted by Jörn Franke <jo...@gmail.com>.

You can put several small files packed in one Hadoop archive (HAR). The other alternative was to set the split size of the execution engine (TEZ, mr,...), which you probably do not want to do on a global level. In general, one should replace xml, json etc with Avro where possible and then use for analytics the ORC or parquet format.

> On 03 Dec 2015, at 15:28, Jörn Franke <jo...@gmail.com> wrote:
> 
> Your Hive version is too old. You may want to use also another execution engine. I think your problem might then be related to external tables for which the parameter you set probably do not apply. I had once the same problem, but I needed to change the block size on the Hadoop level (hdfs-site.xml) or on the Hive level (hive-site.xml). It was definitely not possible as part of a hive session (set ...). I would need to check the documentation.
> In any case , loading it into ORC or parquet makes a lot of sense, but only with a recent Hive version and tez or spark as an execution engine.
> 
>> On 03 Dec 2015, at 14:58, Harsha HN <99...@gmail.com> wrote:
>> Hi Franke,
>> 
>> It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available when I ran my job. So infrastructure is not a problem here. 
>> Hive version is 0.13
>> 
>> About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or PARQUET format. Though it might be performance efficient, it increases data redundancy. 
>> But we will explore that option. 
>> 
>> Currently I want to understand when I am unable to scale up mappers.
>> 
>> Thanks,
>> Harsha
>> 
>>> On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jo...@gmail.com> wrote:
>>> 
>>> How many nodes, cores and memory do you have?
>>> What hive version?
>>> 
>>> Do you have the opportunity to use tez as an execution engine?
>>> Usually  I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics.
>>> This is much more performant than json or any other text-based format.
>>> 
>>>> On 03 Dec 2015, at 14:20, Harsha HN <99...@gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> We have LZO compressed JSON files in our HDFS locations. I am creating an "External" table on the data in HDFS for the purpose of analytics. 
>>>> 
>>>> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 MB respectively along with their index files. 
>>>> 
>>>> When I run count(*) query on the table I observe only 10 mappers causing performance bottleneck. 
>>>> 
>>>> I even tried following, (going for 30MB split)
>>>>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>>>> 2) set dfs.blocksize=31457280;
>>>> But still I am getting 10 mappers.
>>>> 
>>>> Can you please guide me in fixing the same?
>>>> 
>>>> Thanks,
>>>> Sree Harsha
>>

Re: Handling LZO files

Posted by Jörn Franke <jo...@gmail.com>.

Your Hive version is too old. You may want to use also another execution engine. I think your problem might then be related to external tables for which the parameter you set probably do not apply. I had once the same problem, but I needed to change the block size on the Hadoop level (hdfs-site.xml) or on the Hive level (hive-site.xml). It was definitely not possible as part of a hive session (set ...). I would need to check the documentation.
In any case , loading it into ORC or parquet makes a lot of sense, but only with a recent Hive version and tez or spark as an execution engine.

> On 03 Dec 2015, at 14:58, Harsha HN <99...@gmail.com> wrote:
> Hi Franke,
> 
> It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available when I ran my job. So infrastructure is not a problem here. 
> Hive version is 0.13
> 
> About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or PARQUET format. Though it might be performance efficient, it increases data redundancy. 
> But we will explore that option. 
> 
> Currently I want to understand when I am unable to scale up mappers.
> 
> Thanks,
> Harsha
> 
>> On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> How many nodes, cores and memory do you have?
>> What hive version?
>> 
>> Do you have the opportunity to use tez as an execution engine?
>> Usually  I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics.
>> This is much more performant than json or any other text-based format.
>> 
>>> On 03 Dec 2015, at 14:20, Harsha HN <99...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> We have LZO compressed JSON files in our HDFS locations. I am creating an "External" table on the data in HDFS for the purpose of analytics. 
>>> 
>>> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 MB respectively along with their index files. 
>>> 
>>> When I run count(*) query on the table I observe only 10 mappers causing performance bottleneck. 
>>> 
>>> I even tried following, (going for 30MB split)
>>>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>>> 2) set dfs.blocksize=31457280;
>>> But still I am getting 10 mappers.
>>> 
>>> Can you please guide me in fixing the same?
>>> 
>>> Thanks,
>>> Sree Harsha
>

Re: Handling LZO files

Posted by Harsha HN <99...@gmail.com>.

Hi Franke,

It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available
when I ran my job. So infrastructure is not a problem here.
Hive version is 0.13

About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or
PARQUET format. Though it might be performance efficient, it increases data
redundancy.
But we will explore that option.

Currently I want to understand when I am unable to scale up mappers.

Thanks,
Harsha

On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jo...@gmail.com> wrote:

>
> How many nodes, cores and memory do you have?
> What hive version?
>
> Do you have the opportunity to use tez as an execution engine?
> Usually  I use external tables only for reading them and inserting them
> into a table in Orc or parquet format for doing analytics.
> This is much more performant than json or any other text-based format.
>
> On 03 Dec 2015, at 14:20, Harsha HN <99...@gmail.com> wrote:
>
> Hi,
>
> We have LZO compressed JSON files in our HDFS locations. I am creating an
> "External" table on the data in HDFS for the purpose of analytics.
>
> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61
> MB respectively along with their index files.
>
> When I run count(*) query on the table I observe only 10 mappers causing
> performance bottleneck.
>
> I even tried following, (going for 30MB split)
>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>
> 2) set dfs.blocksize=31457280;
>
> But still I am getting 10 mappers.
>
> Can you please guide me in fixing the same?
>
> Thanks,
> Sree Harsha
>
>

Re: Handling LZO files

Posted by Jörn Franke <jo...@gmail.com>.

How many nodes, cores and memory do you have?
What hive version?

Do you have the opportunity to use tez as an execution engine?
Usually  I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics.
This is much more performant than json or any other text-based format.

> On 03 Dec 2015, at 14:20, Harsha HN <99...@gmail.com> wrote:
> 
> Hi,
> 
> We have LZO compressed JSON files in our HDFS locations. I am creating an "External" table on the data in HDFS for the purpose of analytics. 
> 
> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 MB respectively along with their index files. 
> 
> When I run count(*) query on the table I observe only 10 mappers causing performance bottleneck. 
> 
> I even tried following, (going for 30MB split)
>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
> 2) set dfs.blocksize=31457280;
> But still I am getting 10 mappers.
> 
> Can you please guide me in fixing the same?
> 
> Thanks,
> Sree Harsha