You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Joaquin Alzola <Jo...@lebara.com> on 2016/12/09 01:08:47 UTC

Hive Stored Textfile to Stored ORC taking long time

HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>


1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by Joaquin Alzola <Jo...@lebara.com>.

Hi Jan

So you just load the .gz file into the STORED Textfile and then just use the INSERT to pass it from the TEXTFILE table to the ORC table?

From: Brotanek, Jan [mailto:Jan.Brotanek@adastragrp.com]
Sent: 09 December 2016 22:29
To: user@hive.apache.org
Subject: RE: Hive Stored Textfile to Stored ORC taking long time

I have this problem as well. It takes forever to insert into ORC table. I have original table text files gzipped. Having 4nodes with each 64gb and 16cores

From: Joaquin Alzola [mailto:Joaquin.Alzola@lebara.com]
Sent: pátek 9. prosince 2016 12:34
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Hive Stored Textfile to Stored ORC taking long time

Hi Jorn

Yes I will do that test. Same file size but with less columns.

I created a table with simple columns (all strings) and not nested and I do not do any transformations. Attach both tables schema.

As per default the hive.vectorized.execution.enabled is set to false.
I have not enable it.

Just an example that it took 1 hours :
0: jdbc:hive2://localhost:10000> insert into table ret_rec_cdrs_orc PARTITION (country='DE',year='2016',month='12') select * from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (3837.457 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (24.722 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs_orc where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (82.071 seconds)

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 10:22
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

Ok.
No do no split in smaller files. This is done automatically. Your behavior looks strange. For that file size I would expect that it takes below one minute.
Maybe you hit a bug in the spark on hive engine. You could try with a file with less columns, but the same size. I assume that this is a hive table with simple columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework (mr, tez, spark etc), because performance does not differ so much in these cases, especially for the small amount of data you process.

On 9 Dec 2016, at 11:02, Joaquin Alzola <Jo...@lebara.com>> wrote:
Hi Jorn

The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.

ORC is compress as Zlib.

I am using a standalone solution before expanding it, so everything is on the same node.
Hive 2.0.1 --> Spark 1.6.3 --> HDFS 2.6.5

The configuration is much more as standard and have not change anything much.

It cannot be a network issue because all the apps are on the same node.

Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some other way) not on the stop of the “stack”

We take the files every day once so if I put them in textfile and then to ORC it will take me almost half a day just to display the data.

It is basicly a time consuming task, and want to do it much quicker. A better solution of course would be to put smaller files with FLUME but this I will do it in the future.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 09:48
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?).

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com>> wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>

1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by "Brotanek, Jan" <Ja...@adastragrp.com>.

I have this problem as well. It takes forever to insert into ORC table. I have original table text files gzipped. Having 4nodes with each 64gb and 16cores

From: Joaquin Alzola [mailto:Joaquin.Alzola@lebara.com]
Sent: pátek 9. prosince 2016 12:34
To: user@hive.apache.org
Subject: RE: Hive Stored Textfile to Stored ORC taking long time

Hi Jorn

Yes I will do that test. Same file size but with less columns.

I created a table with simple columns (all strings) and not nested and I do not do any transformations. Attach both tables schema.

As per default the hive.vectorized.execution.enabled is set to false.
I have not enable it.

Just an example that it took 1 hours :
0: jdbc:hive2://localhost:10000> insert into table ret_rec_cdrs_orc PARTITION (country='DE',year='2016',month='12') select * from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (3837.457 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (24.722 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs_orc where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (82.071 seconds)

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 10:22
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

Ok.
No do no split in smaller files. This is done automatically. Your behavior looks strange. For that file size I would expect that it takes below one minute.
Maybe you hit a bug in the spark on hive engine. You could try with a file with less columns, but the same size. I assume that this is a hive table with simple columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework (mr, tez, spark etc), because performance does not differ so much in these cases, especially for the small amount of data you process.

On 9 Dec 2016, at 11:02, Joaquin Alzola <Jo...@lebara.com>> wrote:
Hi Jorn

The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.

ORC is compress as Zlib.

I am using a standalone solution before expanding it, so everything is on the same node.
Hive 2.0.1 --> Spark 1.6.3 --> HDFS 2.6.5

The configuration is much more as standard and have not change anything much.

It cannot be a network issue because all the apps are on the same node.

Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some other way) not on the stop of the “stack”

We take the files every day once so if I put them in textfile and then to ORC it will take me almost half a day just to display the data.

It is basicly a time consuming task, and want to do it much quicker. A better solution of course would be to put smaller files with FLUME but this I will do it in the future.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 09:48
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?).

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com>> wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>

1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by Joaquin Alzola <Jo...@lebara.com>.

Hi Jorn

Yes I will do that test. Same file size but with less columns.

I created a table with simple columns (all strings) and not nested and I do not do any transformations. Attach both tables schema.

As per default the hive.vectorized.execution.enabled is set to false.
I have not enable it.

Just an example that it took 1 hours :
0: jdbc:hive2://localhost:10000> insert into table ret_rec_cdrs_orc PARTITION (country='DE',year='2016',month='12') select * from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (3837.457 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (24.722 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs_orc where country='DE' and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (82.071 seconds)

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 10:22
To: user@hive.apache.org
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

Ok.
No do no split in smaller files. This is done automatically. Your behavior looks strange. For that file size I would expect that it takes below one minute.
Maybe you hit a bug in the spark on hive engine. You could try with a file with less columns, but the same size. I assume that this is a hive table with simple columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework (mr, tez, spark etc), because performance does not differ so much in these cases, especially for the small amount of data you process.

On 9 Dec 2016, at 11:02, Joaquin Alzola <Jo...@lebara.com>> wrote:
Hi Jorn

The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.

ORC is compress as Zlib.

I am using a standalone solution before expanding it, so everything is on the same node.
Hive 2.0.1 --> Spark 1.6.3 --> HDFS 2.6.5

The configuration is much more as standard and have not change anything much.

It cannot be a network issue because all the apps are on the same node.

Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some other way) not on the stop of the “stack”

We take the files every day once so if I put them in textfile and then to ORC it will take me almost half a day just to display the data.

It is basicly a time consuming task, and want to do it much quicker. A better solution of course would be to put smaller files with FLUME but this I will do it in the future.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 09:48
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?).

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com>> wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>


1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

Re: Hive Stored Textfile to Stored ORC taking long time

Posted by Jörn Franke <jo...@gmail.com>.

Ok.
No do no split in smaller files. This is done automatically. Your behavior looks strange. For that file size I would expect that it takes below one minute. 
Maybe you hit a bug in the spark on hive engine. You could try with a file with less columns, but the same size. I assume that this is a hive table with simple columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework (mr, tez, spark etc), because performance does not differ so much in these cases, especially for the small amount of data you process.

> On 9 Dec 2016, at 11:02, Joaquin Alzola <Jo...@lebara.com> wrote:
> 
> Hi Jorn
>  
> The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.
>  
> ORC is compress as Zlib.
>  
> I am using a standalone solution before expanding it, so everything is on the same node.
> Hive 2.0.1 à Spark 1.6.3 à HDFS 2.6.5
>  
> The configuration is much more as standard and have not change anything much.
>  
> It cannot be a network issue because all the apps are on the same node.
>  
> Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some other way) not on the stop of the “stack”
>  
> We take the files every day once so if I put them in textfile and then to ORC it will take me almost half a day just to display the data.
>  
> It is basicly a time consuming task, and want to do it much quicker. A better solution of course would be to put smaller files with FLUME but this I will do it in the future.
>  
> From: Jörn Franke [mailto:jornfranke@gmail.com] 
> Sent: 09 December 2016 09:48
> To: user@hive.apache.org
> Subject: Re: Hive Stored Textfile to Stored ORC taking long time
>  
> How large is the file? Might IO be an issue? How many disks have you on the only node?
>  
> Do you compress the ORC (snappy?). 
>  
> What is the Hadoop distribution? Configuration baseline? Hive version?
>  
> Not sure if i understood your setup, but might network be an issue?
> 
> On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com> wrote:
> 
> HI List
>  
> The transformation from textfile table to stored ORC table takes quiet a long time.
>  
> Steps follow>
>  
> 1.Create one normal table using textFile format
> 
> 2.Load the data normally into this table
> 
> 3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
> 
> 4.Insert overwrite query to copy the data from textFile table to orcfile table
> 
>  
> I have about 1,5 million records with about 550 fields in each row.
>  
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>  
> I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.
>  
> BR
>  
> Joaquin
> This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
> This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by Joaquin Alzola <Jo...@lebara.com>.

Hi Jorn

The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.

ORC is compress as Zlib.

I am using a standalone solution before expanding it, so everything is on the same node.
Hive 2.0.1 --> Spark 1.6.3 --> HDFS 2.6.5

The configuration is much more as standard and have not change anything much.

It cannot be a network issue because all the apps are on the same node.

Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some other way) not on the stop of the “stack”

We take the files every day once so if I put them in textfile and then to ORC it will take me almost half a day just to display the data.

It is basicly a time consuming task, and want to do it much quicker. A better solution of course would be to put smaller files with FLUME but this I will do it in the future.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 09:48
To: user@hive.apache.org
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?).

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com>> wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>


1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

Re: Hive Stored Textfile to Stored ORC taking long time

Posted by Jörn Franke <jo...@gmail.com>.

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?). 

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

> On 9 Dec 2016, at 02:08, Joaquin Alzola <Jo...@lebara.com> wrote:
> 
> HI List
>  
> The transformation from textfile table to stored ORC table takes quiet a long time.
>  
> Steps follow>
>  
> 1.Create one normal table using textFile format
> 
> 2.Load the data normally into this table
> 
> 3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
> 
> 4.Insert overwrite query to copy the data from textFile table to orcfile table
> 
>  
> I have about 1,5 million records with about 550 fields in each row.
>  
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>  
> I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.
>  
> BR
>  
> Joaquin
> This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by Joaquin Alzola <Jo...@lebara.com>.

HI Gopal,

Hive version 2.0.1 with spark 1.6.3

The textfile was loaded to Hive as plain text then created the ORC table and then INSERT into ORC table.

Would it be faster to input into the STORED TEXTFILE as gzip already?



From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal Vijayaraghavan
Sent: 09 December 2016 04:17
To: user@hive.apache.org
Subject: Re: Hive Stored Textfile to Stored ORC taking long time



> I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

Which version of Hive was this?

And was the input text file compressed with something like gzip?

Cheers,
Gopal

This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

Re: Hive Stored Textfile to Stored ORC taking long time

Posted by Gopal Vijayaraghavan <go...@apache.org>.

 

 

> I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

 

Which version of Hive was this? 

And was the input text file compressed with something like gzip?

 

Cheers,

Gopal

RE: Hive Stored Textfile to Stored ORC taking long time

Posted by Joaquin Alzola <Jo...@lebara.com>.

Did you do anything to mitigate this issue? Like putting it directly on the HDFS? Or thourg spark instead of going through Hive?

From: Qiuzhuang Lian [mailto:qiuzhuang.lian@gmail.com]
Sent: 09 December 2016 04:02
To: user@hive.apache.org
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

Yes, we did run into this issue too. Typically if the text hive table exceeds 100 million when converting txt table into ORC table.

On Fri, Dec 9, 2016 at 9:08 AM, Joaquin Alzola <Jo...@lebara.com>> wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>

1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with 25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

Re: Hive Stored Textfile to Stored ORC taking long time

Posted by Qiuzhuang Lian <qi...@gmail.com>.

Yes, we did run into this issue too. Typically if the text hive table
exceeds 100 million when converting txt table into ORC table.

On Fri, Dec 9, 2016 at 9:08 AM, Joaquin Alzola <Jo...@lebara.com>
wrote:

> HI List
>
>
>
> The transformation from textfile table to stored ORC table takes quiet a
> long time.
>
>
>
> Steps follow>
>
>
>
> 1.Create one normal table using textFile format
>
> 2.Load the data normally into this table
>
> 3.Create one table with the schema of the expected results of your normal
> hive table using stored as orcfile
>
> 4.Insert overwrite query to copy the data from textFile table to orcfile
> table
>
>
>
> I have about 1,5 million records with about 550 fields in each row.
>
>
>
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>
>
>
> I have spark with only one worker (same for HDFS) so running now a
> standalone server but with 25G and 14 cores on that worker.
>
>
>
> BR
>
>
>
> Joaquin
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>