You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by naveen mahadevuni <nm...@gmail.com> on 2016/09/15 17:56:30 UTC

Load performance with partitioned table

Hi,

I'm using ORC format for our table storage. The table has a timestamp
column(say TS) and 25 other columns. The other ORC properties we are using
arestorage index and bloom filters. We are loading 100 million records in
to this table on a 4-node cluster.

Our source table is a text table with CSV format. In the source table
timestamp values come as BIGINT. In the INSERT SELECT, we use function
"from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp
in the target ORC table. So the first INSERT SELECT in to non-partitioned
table looks like this

1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE.

I wanted to test by partitioning the table by date derived from this
timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT
with dynamic partitioning. The second one is

2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
col2... to_date(from_unixtime(ts)) as datecol from SOURCE.

The load time increased by 50% from 1 to 2. I understand the second
statement involves creating many more partition directories and files.

Is there anyway we can improve the load time? In the second INSERT SELECT,
will the result of the expression "from_unixtime(ts)" be reused in
"to_date(from_unixtime(ts))"?

Thanks,
Naveen

Re: Load performance with partitioned table

Posted by naveen mahadevuni <nm...@gmail.com>.

hi Franke,

1) We are using 4 indentical AWS machines. 8 vCPUs, 32 GB RAM. 1 TB storage
2) Setting up bloom filters only on two other string columns. Not all of
them.
3) The data is any event data ex: Syslog.
4) Queries usually run on timestamp range with additional predicates on
other columns (mostly equality)
4) We use SNAPPY compression with 256 MB blocks.
5) ORC stripe size is 256MB, HDFS block size is 128 MB
6) The time for first INSERT is 206 seconds and the second one is 302
seconds.

Thanks,
Naveen

On Fri, Sep 16, 2016 at 4:57 AM, Jörn Franke <jo...@gmail.com> wrote:

> What is your hardware setup?
> Are the bloom filters necessary on all columns? Usually they make only
> sense for non-numeric columns. Updating bloom filters take time and should
> be avoided where they do not make sense.
> Can you provide an example of the data and the select queries that you
> execute on them?
> Do you use compression on the tables? If so which?
> What are the exact times and data volumes?
>
> > On 15 Sep 2016, at 19:56, naveen mahadevuni <nm...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using ORC format for our table storage. The table has a timestamp
> > column(say TS) and 25 other columns. The other ORC properties we are
> using
> > arestorage index and bloom filters. We are loading 100 million records in
> > to this table on a 4-node cluster.
> >
> > Our source table is a text table with CSV format. In the source table
> > timestamp values come as BIGINT. In the INSERT SELECT, we use function
> > "from_unixtime(sourceTable.TS)" to convert the BIGINT values to
> timestamp
> > in the target ORC table. So the first INSERT SELECT in to non-partitioned
> > table looks like this
> >
> > 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from
> SOURCE.
> >
> > I wanted to test by partitioning the table by date derived from this
> > timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT
> SELECT
> > with dynamic partitioning. The second one is
> >
> > 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
> > col2... to_date(from_unixtime(ts)) as datecol from SOURCE.
> >
> > The load time increased by 50% from 1 to 2. I understand the second
> > statement involves creating many more partition directories and files.
> >
> > Is there anyway we can improve the load time? In the second INSERT
> SELECT,
> > will the result of the expression "from_unixtime(ts)" be reused in
> > "to_date(from_unixtime(ts))"?
> >
> > Thanks,
> > Naveen
>

Re: Load performance with partitioned table

Posted by Jörn Franke <jo...@gmail.com>.

What is your hardware setup?
Are the bloom filters necessary on all columns? Usually they make only sense for non-numeric columns. Updating bloom filters take time and should be avoided where they do not make sense.
Can you provide an example of the data and the select queries that you execute on them?
Do you use compression on the tables? If so which?
What are the exact times and data volumes?

> On 15 Sep 2016, at 19:56, naveen mahadevuni <nm...@gmail.com> wrote:
> 
> Hi,
> 
> I'm using ORC format for our table storage. The table has a timestamp
> column(say TS) and 25 other columns. The other ORC properties we are using
> arestorage index and bloom filters. We are loading 100 million records in
> to this table on a 4-node cluster.
> 
> Our source table is a text table with CSV format. In the source table
> timestamp values come as BIGINT. In the INSERT SELECT, we use function
> "from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp
> in the target ORC table. So the first INSERT SELECT in to non-partitioned
> table looks like this
> 
> 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE.
> 
> I wanted to test by partitioning the table by date derived from this
> timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT
> with dynamic partitioning. The second one is
> 
> 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
> col2... to_date(from_unixtime(ts)) as datecol from SOURCE.
> 
> The load time increased by 50% from 1 to 2. I understand the second
> statement involves creating many more partition directories and files.
> 
> Is there anyway we can improve the load time? In the second INSERT SELECT,
> will the result of the expression "from_unixtime(ts)" be reused in
> "to_date(from_unixtime(ts))"?
> 
> Thanks,
> Naveen