You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mohit <mo...@open-insights.co.in> on 2018/06/27 13:14:25 UTC

SelectHiveQl gets stuck when query table containning 12 Billion rows

Hi all,

 

I'm trying to fetch data from hive using SelectHiveQL. It works fine for
small to medium sized tables, but when I try to fetch data from large table
with around 12 billion rows it gets stuck for hours but do nothing.  I have
set the Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each. 

Is there any configuration which is to be manipulated to make this work?

 

Regards,

Mohit


RE: SelectHiveQl gets stuck when query table containning 12 Billion rows

Posted by Mohit <mo...@open-insights.co.in>.
Thanks Shawn,

I followed a similar approach.

 

Regards,

Mohit

 

From: Shawn Weeks <sw...@weeksconsulting.us> 
Sent: 27 June 2018 19:22
To: users@nifi.apache.org
Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion
rows

 

Well to get the partitions you can execute a 'show partitions table_name',
then you can use the SplitRecord with an AvroReader and JSON Writer to
generate a flow file for partition. That flow file can then be read with
EvaluateJsonPath to pull the partition_name into an attribute on the flow
file. Then finally a ReplaceText to actual write out the select statement
substituting the partition variable.

 

Thanks

Shawn

  _____  

From: Mohit <mohit.jain@open-insights.co.in
<ma...@open-insights.co.in> >
Sent: Wednesday, June 27, 2018 8:40:20 AM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: RE: SelectHiveQl gets stuck when query table containning 12 Billion
rows 

 

Hi,

 

Yes I tried to fetch around 40 million rows which took time but it was
executed. I'll try with the Avro thing.

 

How to break the  select into multiple part? Can you explain in brief the
partition flow to start with?

 

Thanks,

Mohit

 

From: Shawn Weeks <sweeks@weeksconsulting.us
<ma...@weeksconsulting.us> > 
Sent: 27 June 2018 18:51
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion
rows

 

It's probably not stuck doing nothing, using a JDBC connection to fetch 12
Billion rows is going to be painful no matter what you do. At those kind of
sizes you're probably better off having Hive create a temporary table in
Avro format and then consuming the Avro files from HDFS into NiFi. The
largest number of rows I've pulled into NiFi via JDBC in a single query is
around 10-20 Million and that took a long time. You can also try breaking
the select into multiple parts and running them simultaneously. I've done
something similar where I first ran a query to get all of the partitions and
then I executed a select for each partition in parallel.

 

Thanks

Shawn

  _____  

From: Mohit <mohit.jain@open-insights.co.in
<ma...@open-insights.co.in> >
Sent: Wednesday, June 27, 2018 8:14:25 AM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: SelectHiveQl gets stuck when query table containning 12 Billion
rows 

 

Hi all,

 

I'm trying to fetch data from hive using SelectHiveQL. It works fine for
small to medium sized tables, but when I try to fetch data from large table
with around 12 billion rows it gets stuck for hours but do nothing.  I have
set the Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each. 

Is there any configuration which is to be manipulated to make this work?

 

Regards,

Mohit


Re: SelectHiveQl gets stuck when query table containning 12 Billion rows

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
Well to get the partitions you can execute a 'show partitions table_name', then you can use the SplitRecord with an AvroReader and JSON Writer to generate a flow file for partition. That flow file can then be read with EvaluateJsonPath to pull the partition_name into an attribute on the flow file. Then finally a ReplaceText to actual write out the select statement substituting the partition variable.


Thanks

Shawn

________________________________
From: Mohit <mo...@open-insights.co.in>
Sent: Wednesday, June 27, 2018 8:40:20 AM
To: users@nifi.apache.org
Subject: RE: SelectHiveQl gets stuck when query table containning 12 Billion rows


Hi,



Yes I tried to fetch around 40 million rows which took time but it was executed. I’ll try with the Avro thing.



How to break the  select into multiple part? Can you explain in brief the partition flow to start with?



Thanks,

Mohit



From: Shawn Weeks <sw...@weeksconsulting.us>
Sent: 27 June 2018 18:51
To: users@nifi.apache.org
Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion rows



It's probably not stuck doing nothing, using a JDBC connection to fetch 12 Billion rows is going to be painful no matter what you do. At those kind of sizes you're probably better off having Hive create a temporary table in Avro format and then consuming the Avro files from HDFS into NiFi. The largest number of rows I've pulled into NiFi via JDBC in a single query is around 10-20 Million and that took a long time. You can also try breaking the select into multiple parts and running them simultaneously. I've done something similar where I first ran a query to get all of the partitions and then I executed a select for each partition in parallel.



Thanks

Shawn

________________________________

From: Mohit <mo...@open-insights.co.in>>
Sent: Wednesday, June 27, 2018 8:14:25 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: SelectHiveQl gets stuck when query table containning 12 Billion rows



Hi all,



I’m trying to fetch data from hive using SelectHiveQL. It works fine for small to medium sized tables, but when I try to fetch data from large table with around 12 billion rows it gets stuck for hours but do nothing.  I have set the Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each.

Is there any configuration which is to be manipulated to make this work?



Regards,

Mohit

RE: SelectHiveQl gets stuck when query table containning 12 Billion rows

Posted by Mohit <mo...@open-insights.co.in>.
Hi,

 

Yes I tried to fetch around 40 million rows which took time but it was
executed. I'll try with the Avro thing.

 

How to break the  select into multiple part? Can you explain in brief the
partition flow to start with?

 

Thanks,

Mohit

 

From: Shawn Weeks <sw...@weeksconsulting.us> 
Sent: 27 June 2018 18:51
To: users@nifi.apache.org
Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion
rows

 

It's probably not stuck doing nothing, using a JDBC connection to fetch 12
Billion rows is going to be painful no matter what you do. At those kind of
sizes you're probably better off having Hive create a temporary table in
Avro format and then consuming the Avro files from HDFS into NiFi. The
largest number of rows I've pulled into NiFi via JDBC in a single query is
around 10-20 Million and that took a long time. You can also try breaking
the select into multiple parts and running them simultaneously. I've done
something similar where I first ran a query to get all of the partitions and
then I executed a select for each partition in parallel.

 

Thanks

Shawn

  _____  

From: Mohit <mohit.jain@open-insights.co.in
<ma...@open-insights.co.in> >
Sent: Wednesday, June 27, 2018 8:14:25 AM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: SelectHiveQl gets stuck when query table containning 12 Billion
rows 

 

Hi all,

 

I'm trying to fetch data from hive using SelectHiveQL. It works fine for
small to medium sized tables, but when I try to fetch data from large table
with around 12 billion rows it gets stuck for hours but do nothing.  I have
set the Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each. 

Is there any configuration which is to be manipulated to make this work?

 

Regards,

Mohit


Re: SelectHiveQl gets stuck when query table containning 12 Billion rows

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
It's probably not stuck doing nothing, using a JDBC connection to fetch 12 Billion rows is going to be painful no matter what you do. At those kind of sizes you're probably better off having Hive create a temporary table in Avro format and then consuming the Avro files from HDFS into NiFi. The largest number of rows I've pulled into NiFi via JDBC in a single query is around 10-20 Million and that took a long time. You can also try breaking the select into multiple parts and running them simultaneously. I've done something similar where I first ran a query to get all of the partitions and then I executed a select for each partition in parallel.


Thanks

Shawn

________________________________
From: Mohit <mo...@open-insights.co.in>
Sent: Wednesday, June 27, 2018 8:14:25 AM
To: users@nifi.apache.org
Subject: SelectHiveQl gets stuck when query table containning 12 Billion rows


Hi all,



I’m trying to fetch data from hive using SelectHiveQL. It works fine for small to medium sized tables, but when I try to fetch data from large table with around 12 billion rows it gets stuck for hours but do nothing.  I have set the Max Row per flowfile property to 10 million.

We have a 4 node NiFi cluster with 150GB RAM memory each.

Is there any configuration which is to be manipulated to make this work?



Regards,

Mohit