You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Saurabh Santhosh <sa...@gmail.com> on 2019/10/01 06:47:58 UTC

How to manage huge partitioned table with 1000+ columns in Hive

Hi,

I am facing the following problem while trying to store/use a huge
partitioned table with 1000+ columns in Hive. I would like to know how to
solve this problem either using hive or any other store.

Requirement:

1).There is a table with around 1000+ columns which is partitioned by date.
2).Every day consist of data about around 500 million entities. There will
be an id column with the id of the entity and around 1000+ columns which
represent attributes of given entity for each day.
3).We have to store data for around 2 years
4). New columns may be added/logic of existing column may be changed any
day and when this happens we have to populate data for the given column for
last 2 years


Our Solution 1:

1). We created a table with 1000+ columns and partitioned by date.
2). Every day we create a new partition and delete partition older than 2
years

Problems Faced in Solution 1:

Whenever we had to add/modify certain columns, the backfill of data took a
long time and it was taking months to backfill the data for 2 years (this
was because there is lot of IO due to the read/write of each partition)


Our Solution 2:

1). We created 10 tables with around 100+ columns each and each of them was
partitioned by date.
2). Every day we create a new partition in each of the small tables and
delete partition older than 2 years
3). Created a view which was a join between all the tables with id, date as
join key


Problems Faced in Solution 2:

Now the backfill time was considerably reduced from months to weeks as we
need to only refresh the small table which contained the columns to be
backfilled thus reducing the IO drastically.
But this lead to very slow queries on top of the view. Especially when we
query for 6 months data, the queries were taking more than 10 hrs due to
the shuffling of data

Our Solution 3:

1). We also tried to bucket each small table based on the id column, but
this did not give us the desired result as the shuffling was still happening


Can anyone suggest what is the best approach to go with in the above
scenario?

Re: How to manage huge partitioned table with 1000+ columns in Hive

Posted by Furcy Pin <pi...@gmail.com>.

Hello,

Sorry for the late reply, but this problem is very interesting. How did you
end up solving it in the end?

I have an idea which is very ugly but might work:

Create a big view that is an union of all partitions

SELECT
'2019-10-01' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-01' AND b.ds = '2019-10-01' AND c.ds = '2019-10-01'
UNION ALL
SELECT
'2019-10-02' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-02' AND b.ds = '2019-10-02' AND c.ds = '2019-10-02'
UNION ALL
...
;


That way, each part of the union will use the *sortedmerge* optimization,
and hopefully if the optimizer is smart enough, when you filter on several
days,
he will be able to prune the parts of the union that don't match.

I haven't tested it, so I can't tell if that works or not. It just an idea.
If Hive as limitation about maximum resolved query size, it might reach it,
though.

Hope this helps.

Furcy


On Wed, 2 Oct 2019 at 11:09, Pau Tallada <ta...@pic.es> wrote:

> Hi,
>
> I would say the most efficient way would be option (3), where all the
> subtables are partitioned by date, and clustered+**sorted** by id.
> This way, efficient SMB map joins can be performed over the 10 tables of
> the same partition.
>
> Unfortunately, I haven't found a way to achieve SMB map joins* over more
> than one* partition :(
>
> Example:
>
> CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> Over this tables, one can perform efficient single-stage SMB map joins, *if
> you filter on a single partition*:
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds = '2019-10-01'
> ;
>
> When you try to query on two partitions, then it does a shuffle :(
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds IN ('2019-10-01', '2019-10-02')
> ;
>
>
> My problem is very similar, so let's hope someone out there has the answer
> :)
>
> Cheers,
>
> Pau.
>
> Missatge de Saurabh Santhosh <sa...@gmail.com> del dia dt., 1
> d’oct. 2019 a les 8:48:
>
>> Hi,
>>
>> I am facing the following problem while trying to store/use a huge
>> partitioned table with 1000+ columns in Hive. I would like to know how to
>> solve this problem either using hive or any other store.
>>
>> Requirement:
>>
>> 1).There is a table with around 1000+ columns which is partitioned by
>> date.
>> 2).Every day consist of data about around 500 million entities. There
>> will be an id column with the id of the entity and around 1000+ columns
>> which represent attributes of given entity for each day.
>> 3).We have to store data for around 2 years
>> 4). New columns may be added/logic of existing column may be changed any
>> day and when this happens we have to populate data for the given column for
>> last 2 years
>>
>>
>> Our Solution 1:
>>
>> 1). We created a table with 1000+ columns and partitioned by date.
>> 2). Every day we create a new partition and delete partition older than 2
>> years
>>
>> Problems Faced in Solution 1:
>>
>> Whenever we had to add/modify certain columns, the backfill of data took
>> a long time and it was taking months to backfill the data for 2 years (this
>> was because there is lot of IO due to the read/write of each partition)
>>
>>
>> Our Solution 2:
>>
>> 1). We created 10 tables with around 100+ columns each and each of them
>> was partitioned by date.
>> 2). Every day we create a new partition in each of the small tables and
>> delete partition older than 2 years
>> 3). Created a view which was a join between all the tables with id, date
>> as join key
>>
>>
>> Problems Faced in Solution 2:
>>
>> Now the backfill time was considerably reduced from months to weeks as we
>> need to only refresh the small table which contained the columns to be
>> backfilled thus reducing the IO drastically.
>> But this lead to very slow queries on top of the view. Especially when we
>> query for 6 months data, the queries were taking more than 10 hrs due to
>> the shuffling of data
>>
>> Our Solution 3:
>>
>> 1). We also tried to bucket each small table based on the id column, but
>> this did not give us the desired result as the shuffling was still happening
>>
>>
>> Can anyone suggest what is the best approach to go with in the above
>> scenario?
>>
>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Dep. d'Astrofísica i Cosmologia
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>

Re: How to manage huge partitioned table with 1000+ columns in Hive

Posted by Pau Tallada <ta...@pic.es>.

Hi,

I would say the most efficient way would be option (3), where all the
subtables are partitioned by date, and clustered+**sorted** by id.
This way, efficient SMB map joins can be performed over the 10 tables of
the same partition.

Unfortunately, I haven't found a way to achieve SMB map joins* over more
than one* partition :(

Example:

CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

Over this tables, one can perform efficient single-stage SMB map joins, *if
you filter on a single partition*:

set hive.execution.engine=tez;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask.size=0;

EXPLAIN
SELECT *
FROM test_1 a
JOIN test_2 b
  ON a.id = b.id AND a.ds = b.ds
JOIN test_3 c
  ON b.id = c.id AND b.ds = c.ds
WHERE a.ds = '2019-10-01'
;

When you try to query on two partitions, then it does a shuffle :(

set hive.execution.engine=tez;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask.size=0;

EXPLAIN
SELECT *
FROM test_1 a
JOIN test_2 b
  ON a.id = b.id AND a.ds = b.ds
JOIN test_3 c
  ON b.id = c.id AND b.ds = c.ds
WHERE a.ds IN ('2019-10-01', '2019-10-02')
;


My problem is very similar, so let's hope someone out there has the answer
:)

Cheers,

Pau.

Missatge de Saurabh Santhosh <sa...@gmail.com> del dia dt., 1
d’oct. 2019 a les 8:48:

> Hi,
>
> I am facing the following problem while trying to store/use a huge
> partitioned table with 1000+ columns in Hive. I would like to know how to
> solve this problem either using hive or any other store.
>
> Requirement:
>
> 1).There is a table with around 1000+ columns which is partitioned by
> date.
> 2).Every day consist of data about around 500 million entities. There will
> be an id column with the id of the entity and around 1000+ columns which
> represent attributes of given entity for each day.
> 3).We have to store data for around 2 years
> 4). New columns may be added/logic of existing column may be changed any
> day and when this happens we have to populate data for the given column for
> last 2 years
>
>
> Our Solution 1:
>
> 1). We created a table with 1000+ columns and partitioned by date.
> 2). Every day we create a new partition and delete partition older than 2
> years
>
> Problems Faced in Solution 1:
>
> Whenever we had to add/modify certain columns, the backfill of data took a
> long time and it was taking months to backfill the data for 2 years (this
> was because there is lot of IO due to the read/write of each partition)
>
>
> Our Solution 2:
>
> 1). We created 10 tables with around 100+ columns each and each of them
> was partitioned by date.
> 2). Every day we create a new partition in each of the small tables and
> delete partition older than 2 years
> 3). Created a view which was a join between all the tables with id, date
> as join key
>
>
> Problems Faced in Solution 2:
>
> Now the backfill time was considerably reduced from months to weeks as we
> need to only refresh the small table which contained the columns to be
> backfilled thus reducing the IO drastically.
> But this lead to very slow queries on top of the view. Especially when we
> query for 6 months data, the queries were taking more than 10 hrs due to
> the shuffling of data
>
> Our Solution 3:
>
> 1). We also tried to bucket each small table based on the id column, but
> this did not give us the desired result as the shuffling was still happening
>
>
> Can anyone suggest what is the best approach to go with in the above
> scenario?
>


-- 
----------------------------------
Pau Tallada Crespí
Dep. d'Astrofísica i Cosmologia
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------