You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Patrick Tucci <pa...@gmail.com> on 2023/06/26 15:32:28 UTC

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

Hello,

I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.

I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I
uploaded the gzipped file to HDFS and created an external table using the
attached script. I tried two simpler queries on the same table, and they
finished in ~5 and ~10 minutes respectively:

SELECT COUNT(*) FROM ClaimsImport;
SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;

However, when I tried to create a table stored as ORC using this table as
the input, the query ran for almost 4 hours:

CREATE TABLE Claims STORED AS ORC
AS
SELECT *
FROM ClaimsImport
--Exclude the header record
WHERE ClaimID <> 'ClaimID';

[image: image.png]

Why is there such a speed disparity between these different operations? I
understand that this job cannot be parallelized because the file is
compressed with gzip. I also understand that creating an ORC table from the
input will take more time than a simple COUNT(*). But it doesn't feel like
the CREATE TABLE operation should take more than 24x longer than a simple
SELECT COUNT(*) statement.

Thanks for any help. Please let me know if I can provide any additional
information.

Patrick

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK, good news. You have made some progress here :)

bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file.  You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:


DROP TABLE IF EXISTS dummy;

CREATE TABLE dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 19:35, Patrick Tucci <pa...@gmail.com> wrote:

> Hi Mich,
>
> Thanks for the reply. I started running ANALYZE TABLE on the external
> table, but the progress was very slow. The stage had only read about 275MB
> in 10 minutes. That equates to about 5.5 hours just to analyze the table.
>
> This might just be the reality of trying to process a 240m record file
> with 80+ columns, unless there's an obvious issue with my setup that
> someone sees. The solution is likely going to involve increasing
> parallelization.
>
> To that end, I extracted and re-zipped this file in bzip. Since bzip is
> splittable and gzip is not, Spark can process the bzip file in parallel.
> The same CTAS query only took about 45 minutes. This is still a bit slower
> than I had hoped, but the import from bzip fully utilized all available
> cores. So we can give the cluster more resources if we need the process to
> go faster.
>
> Patrick
>
> On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> OK for now have you analyzed statistics in Hive external table
>>
>> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
>> COLUMNS;
>> spark-sql (default)> DESC EXTENDED test.stg_t2;
>>
>> Hive external tables have little optimization
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci <pa...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>>> and 64GB of RAM.
>>>
>>> I'm trying to process a large pipe delimited file that has been
>>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>>> columns). I uploaded the gzipped file to HDFS and created an external table
>>> using the attached script. I tried two simpler queries on the same table,
>>> and they finished in ~5 and ~10 minutes respectively:
>>>
>>> SELECT COUNT(*) FROM ClaimsImport;
>>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>>
>>> However, when I tried to create a table stored as ORC using this table
>>> as the input, the query ran for almost 4 hours:
>>>
>>> CREATE TABLE Claims STORED AS ORC
>>> AS
>>> SELECT *
>>> FROM ClaimsImport
>>> --Exclude the header record
>>> WHERE ClaimID <> 'ClaimID';
>>>
>>> [image: image.png]
>>>
>>> Why is there such a speed disparity between these different operations?
>>> I understand that this job cannot be parallelized because the file is
>>> compressed with gzip. I also understand that creating an ORC table from the
>>> input will take more time than a simple COUNT(*). But it doesn't feel like
>>> the CREATE TABLE operation should take more than 24x longer than a simple
>>> SELECT COUNT(*) statement.
>>>
>>> Thanks for any help. Please let me know if I can provide any additional
>>> information.
>>>
>>> Patrick
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

Posted by Patrick Tucci <pa...@gmail.com>.

Hi Mich,

Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.

This might just be the reality of trying to process a 240m record file with
80+ columns, unless there's an obvious issue with my setup that someone
sees. The solution is likely going to involve increasing parallelization.

To that end, I extracted and re-zipped this file in bzip. Since bzip is
splittable and gzip is not, Spark can process the bzip file in parallel.
The same CTAS query only took about 45 minutes. This is still a bit slower
than I had hoped, but the import from bzip fully utilized all available
cores. So we can give the cluster more resources if we need the process to
go faster.

Patrick

On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> OK for now have you analyzed statistics in Hive external table
>
> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
> COLUMNS;
> spark-sql (default)> DESC EXTENDED test.stg_t2;
>
> Hive external tables have little optimization
>
> HTH
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci <pa...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>> and 64GB of RAM.
>>
>> I'm trying to process a large pipe delimited file that has been
>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>> columns). I uploaded the gzipped file to HDFS and created an external table
>> using the attached script. I tried two simpler queries on the same table,
>> and they finished in ~5 and ~10 minutes respectively:
>>
>> SELECT COUNT(*) FROM ClaimsImport;
>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>
>> However, when I tried to create a table stored as ORC using this table as
>> the input, the query ran for almost 4 hours:
>>
>> CREATE TABLE Claims STORED AS ORC
>> AS
>> SELECT *
>> FROM ClaimsImport
>> --Exclude the header record
>> WHERE ClaimID <> 'ClaimID';
>>
>> [image: image.png]
>>
>> Why is there such a speed disparity between these different operations? I
>> understand that this job cannot be parallelized because the file is
>> compressed with gzip. I also understand that creating an ORC table from the
>> input will take more time than a simple COUNT(*). But it doesn't feel like
>> the CREATE TABLE operation should take more than 24x longer than a simple
>> SELECT COUNT(*) statement.
>>
>> Thanks for any help. Please let me know if I can provide any additional
>> information.
>>
>> Patrick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK for now have you analyzed statistics in Hive external table

spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;

Hive external tables have little optimization

HTH



Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 16:33, Patrick Tucci <pa...@gmail.com> wrote:

> Hello,
>
> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
> and 64GB of RAM.
>
> I'm trying to process a large pipe delimited file that has been compressed
> with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I
> uploaded the gzipped file to HDFS and created an external table using the
> attached script. I tried two simpler queries on the same table, and they
> finished in ~5 and ~10 minutes respectively:
>
> SELECT COUNT(*) FROM ClaimsImport;
> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>
> However, when I tried to create a table stored as ORC using this table as
> the input, the query ran for almost 4 hours:
>
> CREATE TABLE Claims STORED AS ORC
> AS
> SELECT *
> FROM ClaimsImport
> --Exclude the header record
> WHERE ClaimID <> 'ClaimID';
>
> [image: image.png]
>
> Why is there such a speed disparity between these different operations? I
> understand that this job cannot be parallelized because the file is
> compressed with gzip. I also understand that creating an ORC table from the
> input will take more time than a simple COUNT(*). But it doesn't feel like
> the CREATE TABLE operation should take more than 24x longer than a simple
> SELECT COUNT(*) statement.
>
> Thanks for any help. Please let me know if I can provide any additional
> information.
>
> Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org