You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Julien Tane <ju...@solute.de> on 2021/02/12 15:10:06 UTC

Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Dear all,

we are in the process of switching from our old cluster with HDP 2.5:

HDFS    2.7.3
YARN    2.7.3
Tez     0.7.0
Hive    1.2.1000

to a new cluster with HDP 3.1:

HDFS    3.1.1.3.1
YARN    3.1.0
HIVE    3.0.0.3.1
Tez     0.9.0.3.1

We (1st) query and (2nd) retrieve data from table_0 from old cluster to target-machine machine_a.
We (1st) query and (2nd) retrieve data from table_1 from new cluster to same target-machine machine_a.
table_0 and table_1 are defined in the exact same way (partitioned by t_date) and hold the exact same data.

The querying from table_0 on old cluster and the querying from table_1 from the new cluster show the same performance. All is good so far.

After the query is processed and data is ready to be retrieved, we start data retrieval with JDBC-driver. The data-retrieval-performance from the old cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from the new cluster is ca. 20'000 rows/sec. This big performance decrease is a problem!

Things we tried:
    - Made sure that there's no bandwidth-issue with the new version.
    - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than old cluster.
    - We made following observations while experimenting:
        When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.

        When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.

        The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for low number of days.

        So it seems that if number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.

Questions:
    q1) Do you have an idea about what this low data-retrieval-performance could be caused by?
    q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q4) What are the parameters and settings we could use to make sure the data-retrieval-performance is (as) high (as possible)?
    q5) Could the garbage collector be slowing down the data-retrieval to this extent? How can we find out?

Looking forward to your ideas,
Julien Tane



Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




AW: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Posted by Julien Tane <ju...@solute.de>.
Hello Mich,


Thank you very much for your answers!

I will have a look at these points in the morning tomorrow.


Kudos for the answers,


J



Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




________________________________
Von: Mich Talebzadeh <mi...@gmail.com>
Gesendet: Sonntag, 14. Februar 2021 12:01:05
An: user
Betreff: Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Hi Julien,

I am not aware that either Hive or HDFS logs provide matrix on performance of either. However, tools like Ganglia<http://ganglia.info/> should give you some performance matrix. Your Haddop administrator ideally should already have such tools somewhere and through Hortonworks Data Platform. Obviously the show stopper seems to be the performance of the new (target) cluster with the new version of Hive. Does Hive use the same underlying engine (Tez, MapReduce or Spark) in the new cluster?

Pointing to your observation

"Things we tried:

    - Made sure that there's no bandwidth-issue with the new version.
    - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than the old cluster.
    - We made the following observations while experimenting:

        *   When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.
        *   When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.
        *   The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for a low number of days.

- So it seems that if the number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.

"

IMO this points to the performance of the disks in the new cluster. That is something that requires focus. Another less likely scenario would be the statistics maintenance on the new Hive tables. I assume that you are using parquet for your new tables  and you are updating statistics after load. The performance of three days data can be attributed to the data cached in the file system. Typically a physical IO (read from disk) is around 20 times slower than logical IO. Looking for three days with the underlying 2 years data still has to scan a larger data set to get only the relevant data.You mentioned a third party tool to analyse your data something like Tableau. That uses JDBC connection.

Your UNIX admin looking after the cluster should provide some of these answers. How devices are allocated, any RAID that you are using or any different set-up in your new cluster? Any other artefacts that exhibit the same behaviour? How about Hbase? Can you read through PySpark on the new cluster which is native to your Hive (no need for JDBC here) and see the performance?


HTH


[https://docs.google.com/uc?export=download&id=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6&revid=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Sun, 14 Feb 2021 at 09:27, Julien Tane <ju...@solute.de>> wrote:

Hello Mitch, Hello all,


First of all. Thanks to you. we appreciate your input, yet we would hope for more specific hints or details on

how to thoroughly evaluate the speed of the retrieval. The disk performance on the machine was already

a good step.


yes you understood well. but the newer hive version is on a new cluster which has actually more processing power and RAM.

However, my main question would be:

Another thing is that you try to read from the same table in both old and new, find the timings and compare reads.

we are doing that from the driver with JDBC...

BUT, I am really interested in finding solutions to:

    q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?

In particular which metrics should provide information on the throughput.


To answer your other questions

1) we tried different drivers from two different machines but mainly:

   - Amazon Driver

   - Cloudera Driver

the (bad) performance stays quite stable

2) no both are not kerberized

3) well we tried at least two and the performances were similar.


I will have a look at the performance of the disks with iostat.


And regarding this question:

"As a matter of interest are you doing all these through Java, Python etc interface?"


We are mainly doing the ETL stuff through pyhive (soon spark also).

But we have an OLAP tool which retrieves the data through JDBC.

This has been working reliably for years on the old infrastructure,

the poor performance with the new cluster block the migration to the new cluster.


Thanks,


Julien





Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany



[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php



________________________________
Von: Mich Talebzadeh <mi...@gmail.com>>
Gesendet: Samstag, 13. Februar 2021 10:09:10
An: user
Betreff: Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Hi Juuien,

I assume you mean you are using JDBC drivers to retrieve from the source table in Hive (older version) to the target table in Hive (newer version).

1) what JDBC drivers are you using?
2) Are these environments kerberized in both cases?
3) Have you considered other JDBC drivers for Hive, Example,

 hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct

I think besides JDBC there may be other disk read issues (that you can get stats from Unix tools, like iostat etc

Another thing is that you try to read from the same table in both old and new, find the timings and compare reads.

If the issue is throughput of writes through JDBC then you can test another driver for it.

As a matter of interest are you doing all these through Java, Python etc interface?

HTH

[https://docs.google.com/uc?export=download&id=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6&revid=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Fri, 12 Feb 2021 at 15:10, Julien Tane <ju...@solute.de>> wrote:

Dear all,

we are in the process of switching from our old cluster with HDP 2.5:

HDFS    2.7.3
YARN    2.7.3
Tez     0.7.0
Hive    1.2.1000

to a new cluster with HDP 3.1:

HDFS    3.1.1.3.1
YARN    3.1.0
HIVE    3.0.0.3.1
Tez     0.9.0.3.1

We (1st) query and (2nd) retrieve data from table_0 from old cluster to target-machine machine_a.
We (1st) query and (2nd) retrieve data from table_1 from new cluster to same target-machine machine_a.
table_0 and table_1 are defined in the exact same way (partitioned by t_date) and hold the exact same data.

The querying from table_0 on old cluster and the querying from table_1 from the new cluster show the same performance. All is good so far.

After the query is processed and data is ready to be retrieved, we start data retrieval with JDBC-driver. The data-retrieval-performance from the old cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from the new cluster is ca. 20'000 rows/sec. This big performance decrease is a problem!

Things we tried:
    - Made sure that there's no bandwidth-issue with the new version.
    - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than old cluster.
    - We made following observations while experimenting:
        When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.

        When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.

        The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for low number of days.

        So it seems that if number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.

Questions:
    q1) Do you have an idea about what this low data-retrieval-performance could be caused by?
    q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q4) What are the parameters and settings we could use to make sure the data-retrieval-performance is (as) high (as possible)?
    q5) Could the garbage collector be slowing down the data-retrieval to this extent? How can we find out?

Looking forward to your ideas,
Julien Tane



Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany



[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php



Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Julien,

I am not aware that either Hive or HDFS logs provide matrix on performance
of either. However, tools like Ganglia <http://ganglia.info/> should give
you some performance matrix. Your Haddop administrator ideally should
already have such tools somewhere and through Hortonworks Data Platform.
Obviously the show stopper seems to be the performance of the new (target)
cluster with the new version of Hive. Does Hive use the same underlying
engine (Tez, MapReduce or Spark) in the new cluster?

Pointing to your observation

"Things we tried:

    - Made sure that there's no bandwidth-issue with the new version.
    - We tried downloading and uploading from and to HDFS on both old
and new cluster using HDFScli. We observed a difference in
data-transfer-performance with new cluster being a factor ca. 1.5x
slower than the old cluster.
    - We made the following observations while experimenting:


   1. When we filled the table only with 3 days worth of data, then
the new cluster loaded faster then the old one.
         2. When we filled the table with 2 years worth of data and
selected in the SQL statement only 3 days, then the new cluster loaded
slower than the old one.
         3. The old cluster loaded with the same speed each time
(regardless of number of days) whereas the new cluster changed from
25.000 to 42.000 rows/s for a low number of days.

- So it seems that if the number of partitions increases, the
data-retrieval-performance from the new cluster decreases whereas the
data-retrieval-performance from the old cluster stays approx. the
same.

"

IMO this points to the performance of the disks in the new cluster. That is
something that requires focus. Another less likely scenario would be the
statistics maintenance on the new Hive tables. I assume that you are using
parquet for your new tables  and you are updating statistics after load.
The performance of three days data can be attributed to the data cached in
the file system. Typically a physical IO (read from disk) is around 20
times slower than logical IO. Looking for three days with the underlying 2
years data still has to scan a larger data set to get only the relevant
data.You mentioned a third party tool to analyse your data something like
Tableau. That uses JDBC connection.

Your UNIX admin looking after the cluster should provide some of these
answers. How devices are allocated, any RAID that you are using or any
different set-up in your new cluster? Any other artefacts that exhibit the
same behaviour? How about Hbase? Can you read through PySpark on the new
cluster which is native to your Hive (no need for JDBC here) and see the
performance?


HTH



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 14 Feb 2021 at 09:27, Julien Tane <ju...@solute.de> wrote:

> Hello Mitch, Hello all,
>
>
> First of all. Thanks to you. we appreciate your input, yet we would hope
> for more specific hints or details on
>
> how to thoroughly evaluate the speed of the retrieval. The disk
> performance on the machine was already
>
> a good step.
>
>
> yes you understood well. but the newer hive version is on a new cluster
> which has actually more processing power and RAM.
>
> However, my main question would be:
>
> Another thing is that you try to read from the same table in both old and
> new, find the timings and compare reads.
>
> we are doing that from the driver with JDBC...
>
> BUT, I am really interested in finding solutions to:
>
>     q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
>     q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
>
> In particular which metrics should provide information on the throughput.
>
>
> To answer your other questions
>
> 1) we tried different drivers from two different machines but mainly:
>
>    - Amazon Driver
>
>    - Cloudera Driver
>
> the (bad) performance stays quite stable
>
> 2) no both are not kerberized
>
> 3) well we tried at least two and the performances were similar.
>
>
> I will have a look at the performance of the disks with iostat.
>
>
> And regarding this question:
>
> "As a matter of interest are you doing all these through Java, Python etc
> interface?"
>
>
> We are mainly doing the ETL stuff through pyhive (soon spark also).
>
> But we have an OLAP tool which retrieves the data through JDBC.
>
> This has been working reliably for years on the old infrastructure,
>
> the poor performance with the new cluster block the migration to the new
> cluster.
>
>
> Thanks,
>
>
> Julien
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] juta@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> ------------------------------
> *Von:* Mich Talebzadeh <mi...@gmail.com>
> *Gesendet:* Samstag, 13. Februar 2021 10:09:10
> *An:* user
> *Betreff:* Re: Migration of Hadoop Warehouse to newer versions lead to
> bad performance for JDBC-data-retrieval
>
> Hi Juuien,
>
> I assume you mean you are using JDBC drivers to retrieve from the source
> table in Hive (older version) to the target table in Hive (newer version).
>
> 1) what JDBC drivers are you using?
> 2) Are these environments kerberized in both cases?
> 3) Have you considered other JDBC drivers for Hive, Example,
>
>  hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
> hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
> hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
> hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct
>
> I think besides JDBC there may be other disk read issues (that you can get
> stats from Unix tools, like iostat etc
>
> Another thing is that you try to read from the same table in both old and
> new, find the timings and compare reads.
>
> If the issue is throughput of writes through JDBC then you can test
> another driver for it.
>
> As a matter of interest are you doing all these through Java, Python etc
> interface?
>
> HTH
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 12 Feb 2021 at 15:10, Julien Tane <ju...@solute.de> wrote:
>
>> Dear all,
>>
>> we are in the process of switching from our old cluster with HDP 2.5:
>>
>> HDFS    2.7.3
>> YARN    2.7.3
>> Tez     0.7.0
>> Hive    1.2.1000
>> to a new cluster with HDP 3.1:
>>
>> HDFS    3.1.1.3.1
>> YARN    3.1.0
>> HIVE    3.0.0.3.1
>> Tez     0.9.0.3.1
>>
>> We (1st) query and (2nd) retrieve data from table_0 from old cluster to target-machine machine_a.
>> We (1st) query and (2nd) retrieve data from table_1 from new cluster to same target-machine machine_a.
>> table_0 and table_1 are defined in the exact same way (partitioned by t_date) and hold the exact same data.
>>
>> The querying from table_0 on old cluster and the querying from table_1 from the new cluster show the same performance. All is good so far.
>>
>> After the query is processed and data is ready to be retrieved, we start data retrieval with JDBC-driver. The data-retrieval-performance from the old cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from the new cluster is ca. 20'000 rows/sec. This big performance decrease is a problem!
>>
>> Things we tried:
>>     - Made sure that there's no bandwidth-issue with the new version.
>>     - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than old cluster.
>>     - We made following observations while experimenting:
>>         When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.
>>
>>         When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.
>>
>>         The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for low number of days.
>>
>>         So it seems that if number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.
>>
>> Questions:
>>     q1) Do you have an idea about what this low data-retrieval-performance could be caused by?
>>     q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
>>     q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
>>     q4) What are the parameters and settings we could use to make sure the data-retrieval-performance is (as) high (as possible)?
>>     q5) Could the garbage collector be slowing down the data-retrieval to this extent? How can we find out?
>>
>> Looking forward to your ideas,
>> Julien Tane
>>
>>
>>
>> Julien Tane
>> Big Data Engineer
>>
>> [image: Tel.] +49 721 98993-393
>> [image: Fax] +49 721 98993-66
>> [image: E-Mail] juta@solute.de
>>
>> solute GmbH
>> Zeppelinstraße 15
>> 76185 Karlsruhe
>> Germany
>>
>>
>> [image: Logo Solute]
>>
>> Marken der solute GmbH | brands of solute GmbH
>> [image: Marken]
>> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
>> Webseite | www.solute.de
>> Sitz | Registered Office: Karlsruhe
>> Registergericht | Register Court: Amtsgericht Mannheim
>> Registernummer | Register No.: HRB 110579
>> USt-ID | VAT ID: DE234663798
>>
>> *Informationen zum Datenschutz | Information about privacy policy*
>> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>>
>>
>>
>>
>

AW: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Posted by Julien Tane <ju...@solute.de>.
Hello Mitch, Hello all,


First of all. Thanks to you. we appreciate your input, yet we would hope for more specific hints or details on

how to thoroughly evaluate the speed of the retrieval. The disk performance on the machine was already

a good step.


yes you understood well. but the newer hive version is on a new cluster which has actually more processing power and RAM.

However, my main question would be:

Another thing is that you try to read from the same table in both old and new, find the timings and compare reads.

we are doing that from the driver with JDBC...

BUT, I am really interested in finding solutions to:

    q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?

In particular which metrics should provide information on the throughput.


To answer your other questions

1) we tried different drivers from two different machines but mainly:

   - Amazon Driver

   - Cloudera Driver

the (bad) performance stays quite stable

2) no both are not kerberized

3) well we tried at least two and the performances were similar.


I will have a look at the performance of the disks with iostat.


And regarding this question:

"As a matter of interest are you doing all these through Java, Python etc interface?"


We are mainly doing the ETL stuff through pyhive (soon spark also).

But we have an OLAP tool which retrieves the data through JDBC.

This has been working reliably for years on the old infrastructure,

the poor performance with the new cluster block the migration to the new cluster.


Thanks,


Julien





Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




________________________________
Von: Mich Talebzadeh <mi...@gmail.com>
Gesendet: Samstag, 13. Februar 2021 10:09:10
An: user
Betreff: Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Hi Juuien,

I assume you mean you are using JDBC drivers to retrieve from the source table in Hive (older version) to the target table in Hive (newer version).

1) what JDBC drivers are you using?
2) Are these environments kerberized in both cases?
3) Have you considered other JDBC drivers for Hive, Example,

 hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct

I think besides JDBC there may be other disk read issues (that you can get stats from Unix tools, like iostat etc

Another thing is that you try to read from the same table in both old and new, find the timings and compare reads.

If the issue is throughput of writes through JDBC then you can test another driver for it.

As a matter of interest are you doing all these through Java, Python etc interface?

HTH

[https://docs.google.com/uc?export=download&id=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6&revid=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Fri, 12 Feb 2021 at 15:10, Julien Tane <ju...@solute.de>> wrote:

Dear all,

we are in the process of switching from our old cluster with HDP 2.5:

HDFS    2.7.3
YARN    2.7.3
Tez     0.7.0
Hive    1.2.1000

to a new cluster with HDP 3.1:

HDFS    3.1.1.3.1
YARN    3.1.0
HIVE    3.0.0.3.1
Tez     0.9.0.3.1

We (1st) query and (2nd) retrieve data from table_0 from old cluster to target-machine machine_a.
We (1st) query and (2nd) retrieve data from table_1 from new cluster to same target-machine machine_a.
table_0 and table_1 are defined in the exact same way (partitioned by t_date) and hold the exact same data.

The querying from table_0 on old cluster and the querying from table_1 from the new cluster show the same performance. All is good so far.

After the query is processed and data is ready to be retrieved, we start data retrieval with JDBC-driver. The data-retrieval-performance from the old cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from the new cluster is ca. 20'000 rows/sec. This big performance decrease is a problem!

Things we tried:
    - Made sure that there's no bandwidth-issue with the new version.
    - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than old cluster.
    - We made following observations while experimenting:
        When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.

        When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.

        The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for low number of days.

        So it seems that if number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.

Questions:
    q1) Do you have an idea about what this low data-retrieval-performance could be caused by?
    q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
    q4) What are the parameters and settings we could use to make sure the data-retrieval-performance is (as) high (as possible)?
    q5) Could the garbage collector be slowing down the data-retrieval to this extent? How can we find out?

Looking forward to your ideas,
Julien Tane



Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]        juta@solute.de<ma...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany



[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php



Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Juuien,

I assume you mean you are using JDBC drivers to retrieve from the source
table in Hive (older version) to the target table in Hive (newer version).

1) what JDBC drivers are you using?
2) Are these environments kerberized in both cases?
3) Have you considered other JDBC drivers for Hive, Example,

 hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct

I think besides JDBC there may be other disk read issues (that you can get
stats from Unix tools, like iostat etc

Another thing is that you try to read from the same table in both old and
new, find the timings and compare reads.

If the issue is throughput of writes through JDBC then you can test another
driver for it.

As a matter of interest are you doing all these through Java, Python etc
interface?

HTH



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 12 Feb 2021 at 15:10, Julien Tane <ju...@solute.de> wrote:

> Dear all,
>
> we are in the process of switching from our old cluster with HDP 2.5:
>
> HDFS    2.7.3
> YARN    2.7.3
> Tez     0.7.0
> Hive    1.2.1000
> to a new cluster with HDP 3.1:
>
> HDFS    3.1.1.3.1
> YARN    3.1.0
> HIVE    3.0.0.3.1
> Tez     0.9.0.3.1
>
> We (1st) query and (2nd) retrieve data from table_0 from old cluster to target-machine machine_a.
> We (1st) query and (2nd) retrieve data from table_1 from new cluster to same target-machine machine_a.
> table_0 and table_1 are defined in the exact same way (partitioned by t_date) and hold the exact same data.
>
> The querying from table_0 on old cluster and the querying from table_1 from the new cluster show the same performance. All is good so far.
>
> After the query is processed and data is ready to be retrieved, we start data retrieval with JDBC-driver. The data-retrieval-performance from the old cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from the new cluster is ca. 20'000 rows/sec. This big performance decrease is a problem!
>
> Things we tried:
>     - Made sure that there's no bandwidth-issue with the new version.
>     - We tried downloading and uploading from and to HDFS on both old and new cluster using HDFScli. We observed a difference in data-transfer-performance with new cluster being a factor ca. 1.5x slower than old cluster.
>     - We made following observations while experimenting:
>         When we filled the table only with 3 days worth of data, then the new cluster loaded faster then the old one.
>
>         When we filled the table with 2 years worth of data and selected in the SQL statement only 3 days, then the new cluster loaded slower than the old one.
>
>         The old cluster loaded with the same speed each time (regardless of number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s for low number of days.
>
>         So it seems that if number of partitions increases, the data-retrieval-performance from the new cluster decreases whereas the data-retrieval-performance from the old cluster stays approx. the same.
>
> Questions:
>     q1) Do you have an idea about what this low data-retrieval-performance could be caused by?
>     q2) How do we use the Hive Logging/Debug Infrastructure to find out what the throughput of the rows are?
>     q3) How do we use the HDFS Logging/Debug Infrastructure to find out what the throughput of the rows are?
>     q4) What are the parameters and settings we could use to make sure the data-retrieval-performance is (as) high (as possible)?
>     q5) Could the garbage collector be slowing down the data-retrieval to this extent? How can we find out?
>
> Looking forward to your ideas,
> Julien Tane
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] juta@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
>