You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by David Nies <da...@adition.com> on 2016/06/20 14:43:13 UTC

Network throughput from HiveServer2 to JDBC client too low

Dear Hive mailing list,

in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:

I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.

I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.

I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput.

As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:

beeline \
-u jdbc:hive2://srv:10000/db \
-n user -p password \
--outputformat=csv2 \
--incremental=true \
--hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
-e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null

I already tried playing around with additional `—hiveconf`s:

--hiveconf hive.exec.compress.output=true \
--hiveconf mapred.output.compression.type=BLOCK \
--hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \

without success.

In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?

Thank you in advance!

Yours

David Nies
Entwickler Business Intelligence

ADITION technologies AG

Oststraße 55, D-40211 Düsseldorf
Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau

T +49 211 987400 30
F +49 211 987400 33
E david.nies@adition.com <ma...@adition.com>

Technischen Support erhalten Sie unter der +49 1805 2348466
(Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)

Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com <http://www.adition.com/>.

Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
Aufsichtsratsvorsitzender: Joachim Schneidmadl
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
UStIDNr.: DE 218 858 434

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Mich Talebzadeh <mi...@gmail.com>.

this is a classic issue. are there other users using the same network to
connect to Hive.

Can your unix admin use a network sniffer to determine the issue with your
case?

in normal operations with modest amount of data do you see the same issue
or this is purely due to your load (the number of rows returned) of 100M
rows.

Yes I noticed your version of Hive at 1.1 on a vendor's package.

At this stage the question is what other alternatives are there to fetch
that 100Miilom rows.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 08:15, David Nies <da...@adition.com> wrote:

>
>
> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan <go...@apache.org>:
>
>
> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
> Our network is capable of much more. Playing around with `fetchSize` did
> not increase throughput.
>
> ...
>
> --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> \
>
>
> The current implementation you have is CPU bound in HiveServer2, the
> compression generally makes it worse.
>
> The fetch size does help, but it only prevents the system from doing
> synchronized operations frequently (pausing every 50 rows is too often,
> the default is now 10000 rows).
>
>   -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
>
> Quick q - are year/month/day/hour partition columns? If so, there might be
> a very different fix to this problem.
>
>
> Yes, year, month, day and hour are partition columns. I.e. I want to
> export exactly one partition. In my real use case, I want to use another
> filter (WHERE some_other_column = <x>), but for this case right here, it is
> exactly the data of one partition I want.
>
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
>
> A series of work-items are in progress for fixing the large row-set
> performance in HiveServer2
>
> https://issues.apache.org/jira/browse/HIVE-11527
>
> https://issues.apache.org/jira/browse/HIVE-12427
>
> What would be great would be to attach a profiler to your HiveServer2 &
> see which functions are hot, that will help fix those codepaths as part of
> the joint effort with the ODBC driver teams.
>
>
> I’ll see what I can do. I can’t restart the server at will though, since
> other teams are using it as well.
>
>
> Cheers,
> Gopal
>
>
> Thank you :)
> -David
>
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by David Nies <da...@adition.com>.


> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan <go...@apache.org>:
> 
> 
>> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
>> Our network is capable of much more. Playing around with `fetchSize` did
>> not increase throughput.
> ...
>> --hiveconf 
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>> \
> 
> The current implementation you have is CPU bound in HiveServer2, the
> compression generally makes it worse.
> 
> The fetch size does help, but it only prevents the system from doing
> synchronized operations frequently (pausing every 50 rows is too often,
> the default is now 10000 rows).
> 
>>   -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
>> month=6 AND day=1 AND hour=10)' > /dev/null
> 
> Quick q - are year/month/day/hour partition columns? If so, there might be
> a very different fix to this problem.

Yes, year, month, day and hour are partition columns. I.e. I want to export exactly one partition. In my real use case, I want to use another filter (WHERE some_other_column = <x>), but for this case right here, it is exactly the data of one partition I want.

> 
>> In all cases, Hive is able only to utilize a tiny fraction of the
>> bandwidth that is available. Is there a possibility to increase network
>> throughput?
> 
> A series of work-items are in progress for fixing the large row-set
> performance in HiveServer2
> 
> https://issues.apache.org/jira/browse/HIVE-11527
> 
> https://issues.apache.org/jira/browse/HIVE-12427
> 
> What would be great would be to attach a profiler to your HiveServer2 &
> see which functions are hot, that will help fix those codepaths as part of
> the joint effort with the ODBC driver teams.

I’ll see what I can do. I can’t restart the server at will though, since other teams are using it as well. 

> 
> Cheers,
> Gopal
> 

Thank you :)
-David

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
>Our network is capable of much more. Playing around with `fetchSize` did
>not increase throughput.
...
> --hiveconf 
>mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>\

The current implementation you have is CPU bound in HiveServer2, the
compression generally makes it worse.

The fetch size does help, but it only prevents the system from doing
synchronized operations frequently (pausing every 50 rows is too often,
the default is now 10000 rows).

>    -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
>month=6 AND day=1 AND hour=10)' > /dev/null

Quick q - are year/month/day/hour partition columns? If so, there might be
a very different fix to this problem.

> In all cases, Hive is able only to utilize a tiny fraction of the
>bandwidth that is available. Is there a possibility to increase network
>throughput?

A series of work-items are in progress for fixing the large row-set
performance in HiveServer2

https://issues.apache.org/jira/browse/HIVE-11527

https://issues.apache.org/jira/browse/HIVE-12427

What would be great would be to attach a profiler to your HiveServer2 &
see which functions are hot, that will help fix those codepaths as part of
the joint effort with the ODBC driver teams.

Cheers,
Gopal

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Jörn Franke <jo...@gmail.com>.

Well see also comment that it is NOT advisable to use jdbc for these data transfers but to consider the alternatives mention below. The alternatives are more reliable and you will save yourself a lot of troubles.
I also doubt that beeline is suitable for this volumes in general. So yes it could be that beeline is not efficient for these purposes.

> On 21 Jun 2016, at 08:52, David Nies <da...@adition.com> wrote:
> 
> In my test case below, I’m using `beeline` as the Java application receiving the JDBC stream. As I understand, this is the reference command line interface to Hive. Are you saying that the reference command line interface is not efficiently implemented? :)
> 
> -David Nies
> 
>> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jo...@gmail.com>:
>> 
>> Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this.
>> 
>>> On 20 Jun 2016, at 17:28, Jörn Franke <jo...@gmail.com> wrote:
>>> 
>>> Hallo,
>>> 
>>> For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view.
>>> 
>>> Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS to a sftp or other server. There are alternatives depending on your use case.
>>> 
>>> Best regards
>>> 
>>>> On 20 Jun 2016, at 16:43, David Nies <da...@adition.com> wrote:
>>>> 
>>>> Dear Hive mailing list,
>>>> 
>>>> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:
>>>> 
>>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>>>> 
>>>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.
>>>> 
>>>> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput. 
>>>> 
>>>> As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:
>>>> 
>>>> beeline \
>>>>     -u jdbc:hive2://srv:10000/db \
>>>>     -n user -p password \
>>>>     --outputformat=csv2 \
>>>>     --incremental=true \
>>>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null
>>>> 
>>>> I already tried playing around with additional `—hiveconf`s:
>>>> 
>>>>     --hiveconf hive.exec.compress.output=true \
>>>>     --hiveconf mapred.output.compression.type=BLOCK \
>>>>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>>>> 
>>>> without success. 
>>>> 
>>>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?
>>>> 
>>>> Thank you in advance!
>>>> 
>>>> Yours
>>>> 
>>>> David Nies
>>>> Entwickler Business Intelligence
>>>>  
>>>> ADITION technologies AG
>>>>  
>>>> Oststraße 55, D-40211 Düsseldorf
>>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>>>  
>>>> T +49 211 987400 30
>>>> F +49 211 987400 33
>>>> E david.nies@adition.com
>>>>  
>>>> Technischen Support erhalten Sie unter der +49 1805 2348466
>>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>>>  
>>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>>>>  
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>>>> UStIDNr.: DE 218 858 434
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by David Nies <da...@adition.com>.

> Am 21.06.2016 um 08:59 schrieb Mich Talebzadeh <mi...@gmail.com>:
> 
> is the underlying table partitioned i.e.
> 
> 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)‘
Yes, it is, year, month, day and hour are partition columns.

> 
> and also what is the RS size it is expected.
I don’t know if I understand the question. Do you ask if I expect the size of the result set to be like that? As I mentioned in my eMail that started this thread, I’m expecting around 100 million rows. The raw data on HDFS is about 5 GB in size.

> 
> JDBC on its own should work. Is this an ORC table?
> 
> What version of Hive are you using?
Kindly find the answer to these questions in my first eMail :)

> 
> HTH

-David

> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 21 June 2016 at 07:52, David Nies <david.nies@adition.com <ma...@adition.com>> wrote:
> In my test case below, I’m using `beeline` as the Java application receiving the JDBC stream. As I understand, this is the reference command line interface to Hive. Are you saying that the reference command line interface is not efficiently implemented? :)
> 
> -David Nies
> 
>> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>>:
>> 
>> Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this.
>> 
>> On 20 Jun 2016, at 17:28, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Hallo,
>>> 
>>> For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view.
>>> 
>>> Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS to a sftp or other server. There are alternatives depending on your use case.
>>> 
>>> Best regards
>>> 
>>> On 20 Jun 2016, at 16:43, David Nies <david.nies@adition.com <ma...@adition.com>> wrote:
>>> 
>>>> Dear Hive mailing list,
>>>> 
>>>> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:
>>>> 
>>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>>>> 
>>>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.
>>>> 
>>>> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput. 
>>>> 
>>>> As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:
>>>> 
>>>> beeline \
>>>>     -u jdbc:hive2://srv:10000/db \
>>>>     -n user -p password \
>>>>     --outputformat=csv2 \
>>>>     --incremental=true \
>>>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null
>>>> 
>>>> I already tried playing around with additional `—hiveconf`s:
>>>> 
>>>>     --hiveconf hive.exec.compress.output=true \
>>>>     --hiveconf mapred.output.compression.type=BLOCK \
>>>>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>>>> 
>>>> without success. 
>>>> 
>>>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?
>>>> 
>>>> Thank you in advance!
>>>> 
>>>> Yours
>>>> 
>>>> David Nies
>>>> Entwickler Business Intelligence
>>>>  
>>>> ADITION technologies AG
>>>>  
>>>> Oststraße 55, D-40211 Düsseldorf
>>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>>>  
>>>> T +49 211 987400 30 <tel:%2B49%20211%20987400%2030>
>>>> F +49 211 987400 33 <tel:%2B49%20211%20987400%2033>
>>>> E david.nies@adition.com <ma...@adition.com>
>>>>  
>>>> Technischen Support erhalten Sie unter der +49 1805 2348466 <tel:%2B49%201805%202348466>
>>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>>>  
>>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com <http://www.adition.com/>.
>>>>  
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>>>> UStIDNr.: DE 218 858 434
>>>> 
> 
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Mich Talebzadeh <mi...@gmail.com>.

is the underlying table partitioned i.e.

'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6
AND day=1 AND hour=10)'

and also what is the RS size it is expected.

JDBC on its own should work. Is this an ORC table?

What version of Hive are you using?

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 07:52, David Nies <da...@adition.com> wrote:

> In my test case below, I’m using `beeline` as the Java application
> receiving the JDBC stream. As I understand, this is the reference command
> line interface to Hive. Are you saying that the reference command line
> interface is not efficiently implemented? :)
>
> -David Nies
>
> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jo...@gmail.com>:
>
> Aside from this the low network performance could also stem from the Java
> application receiving the JDBC stream (not threaded / not efficiently
> implemented etc). However that being said, do not use jdbc for this.
>
> On 20 Jun 2016, at 17:28, Jörn Franke <jo...@gmail.com> wrote:
>
> Hallo,
>
> For no databases (including traditional ones) it is advisable to fetch
> this amount through jdbc. Jdbc is not designed for this (neither for import
> nor for export of large data volumes). It is a highly questionable approach
> from a reliability point of view.
>
> Export it as file to HDFS and fetch it from there or use oozie to dump the
> file from HDFS to a sftp or other server. There are alternatives depending
> on your use case.
>
> Best regards
>
> On 20 Jun 2016, at 16:43, David Nies <da...@adition.com> wrote:
>
> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by David Nies <da...@adition.com>.

In my test case below, I’m using `beeline` as the Java application receiving the JDBC stream. As I understand, this is the reference command line interface to Hive. Are you saying that the reference command line interface is not efficiently implemented? :)

-David Nies

> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jo...@gmail.com>:
> 
> Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this.
> 
> On 20 Jun 2016, at 17:28, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> 
>> Hallo,
>> 
>> For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view.
>> 
>> Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS to a sftp or other server. There are alternatives depending on your use case.
>> 
>> Best regards
>> 
>> On 20 Jun 2016, at 16:43, David Nies <david.nies@adition.com <ma...@adition.com>> wrote:
>> 
>>> Dear Hive mailing list,
>>> 
>>> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:
>>> 
>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>>> 
>>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.
>>> 
>>> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput. 
>>> 
>>> As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:
>>> 
>>> beeline \
>>>     -u jdbc:hive2://srv:10000/db \
>>>     -n user -p password \
>>>     --outputformat=csv2 \
>>>     --incremental=true \
>>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null
>>> 
>>> I already tried playing around with additional `—hiveconf`s:
>>> 
>>>     --hiveconf hive.exec.compress.output=true \
>>>     --hiveconf mapred.output.compression.type=BLOCK \
>>>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>>> 
>>> without success. 
>>> 
>>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?
>>> 
>>> Thank you in advance!
>>> 
>>> Yours
>>> 
>>> David Nies
>>> Entwickler Business Intelligence
>>>  
>>> ADITION technologies AG
>>>  
>>> Oststraße 55, D-40211 Düsseldorf
>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>>  
>>> T +49 211 987400 30
>>> F +49 211 987400 33
>>> E david.nies@adition.com <ma...@adition.com>
>>>  
>>> Technischen Support erhalten Sie unter der +49 1805 2348466
>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>>  
>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com <http://www.adition.com/>.
>>>  
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>>> UStIDNr.: DE 218 858 434
>>>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Jörn Franke <jo...@gmail.com>.

Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this.

> On 20 Jun 2016, at 17:28, Jörn Franke <jo...@gmail.com> wrote:
> 
> Hallo,
> 
> For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view.
> 
> Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS to a sftp or other server. There are alternatives depending on your use case.
> 
> Best regards
> 
>> On 20 Jun 2016, at 16:43, David Nies <da...@adition.com> wrote:
>> 
>> Dear Hive mailing list,
>> 
>> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:
>> 
>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>> 
>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.
>> 
>> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput. 
>> 
>> As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:
>> 
>> beeline \
>>     -u jdbc:hive2://srv:10000/db \
>>     -n user -p password \
>>     --outputformat=csv2 \
>>     --incremental=true \
>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null
>> 
>> I already tried playing around with additional `—hiveconf`s:
>> 
>>     --hiveconf hive.exec.compress.output=true \
>>     --hiveconf mapred.output.compression.type=BLOCK \
>>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>> 
>> without success. 
>> 
>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?
>> 
>> Thank you in advance!
>> 
>> Yours
>> 
>> David Nies
>> Entwickler Business Intelligence
>>  
>> ADITION technologies AG
>>  
>> Oststraße 55, D-40211 Düsseldorf
>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>  
>> T +49 211 987400 30
>> F +49 211 987400 33
>> E david.nies@adition.com
>>  
>> Technischen Support erhalten Sie unter der +49 1805 2348466
>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>  
>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>>  
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>> UStIDNr.: DE 218 858 434
>>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Jörn Franke <jo...@gmail.com>.

Hallo,

For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view.

Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS to a sftp or other server. There are alternatives depending on your use case.

Best regards

> On 20 Jun 2016, at 16:43, David Nies <da...@adition.com> wrote:
> 
> Dear Hive mailing list,
> 
> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck and I’m seeking a way do increase throughput. Let me elaborate my use case:
> 
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
> 
> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something around 100 million rows. The Hive table I’m querying is an external table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option, sadly, since I need all the data.
> 
> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing around with `fetchSize` did not increase throughput. 
> 
> As I identified network throughput to be the bottleneck, I restricted my efforts to trying to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java) via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like that:
> 
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND day=1 AND hour=10)' > /dev/null
> 
> I already tried playing around with additional `—hiveconf`s:
> 
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
> 
> without success. 
> 
> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available. Is there a possibility to increase network throughput?
> 
> Thank you in advance!
> 
> Yours
> 
> David Nies
> Entwickler Business Intelligence
>  
> ADITION technologies AG
>  
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>  
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>  
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>  
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>  
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
> UStIDNr.: DE 218 858 434
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi David,

What are you actually trying to do with the data.

Hive and map-reduce are notoriously slow for this type of operations. Hive
is good for storage that is what I vouch for.

There are other alternatives.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 June 2016 at 15:43, David Nies <da...@adition.com> wrote:

> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>

Re: Network throughput from HiveServer2 to JDBC client too low

Posted by Deepak Goel <de...@gmail.com>.

Could you increase this number (Mebbe three times the current value) and
see if it has any impact on throughput:

--hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Mon, Jun 20, 2016 at 8:13 PM, David Nies <da...@adition.com> wrote:

> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>