You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2016/06/09 20:01:45 UTC

How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Hi,

How to insert data into 2000 partitions(directories) of ORC/parquet  at a
time using Spark SQL? It seems to be not performant when I try to insert
2000 directories of Parquet/ORC using Spark SQL. Did anyone face this issue?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

400 cores are assigned to this job.

On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch <ja...@gmail.com> wrote:

> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>> time using Spark SQL? It seems to be not performant when I try to insert
>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>> issue?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

Hi Mich,

No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?

Thanks,
Swetha



On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi Swetha,
>
> Have you actually tried doing this in Hive using Hive CLI or beeline?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 18:43, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> In all probability there is no user database created in Hive
>>
>> Create a database yourself
>>
>> sql("create if not exists database test")
>>
>> It would be helpful if you grasp some concept of Hive databases etc?
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 15:40, Sree Eedupuganti <sr...@inndata.in> wrote:
>>
>>> Hi Spark users, i am new to spark. I am trying to connect hive using
>>> SparkJavaContext. Unable to connect to the database. By executing the below
>>> code i can see only "default" database. Can anyone help me out. What i need
>>> is a sample program for Querying Hive results using SparkJavaContext. Need
>>> to pass any values like this.
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>>
>>>          public static void  main(String[] args ) throws Exception {
>>>                   SparkConf sparkConf = new
>>> SparkConf().setAppName("SparkSQL").setMaster("local");
>>>                   SparkContext  ctx=new SparkContext(sparkConf);
>>>                   HiveContext  hiveql=new
>>> org.apache.spark.sql.hive.HiveContext(ctx);
>>>                   DataFrame df=hiveql.sql("show databases");
>>>                   df.show();
>>>                   }
>>>
>>> Any suggestions please....Thanks.
>>>
>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Swetha,

Have you actually tried doing this in Hive using Hive CLI or beeline?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 18:43, Mich Talebzadeh <mi...@gmail.com> wrote:

> In all probability there is no user database created in Hive
>
> Create a database yourself
>
> sql("create if not exists database test")
>
> It would be helpful if you grasp some concept of Hive databases etc?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 15:40, Sree Eedupuganti <sr...@inndata.in> wrote:
>
>> Hi Spark users, i am new to spark. I am trying to connect hive using
>> SparkJavaContext. Unable to connect to the database. By executing the below
>> code i can see only "default" database. Can anyone help me out. What i need
>> is a sample program for Querying Hive results using SparkJavaContext. Need
>> to pass any values like this.
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>
>>          public static void  main(String[] args ) throws Exception {
>>                   SparkConf sparkConf = new
>> SparkConf().setAppName("SparkSQL").setMaster("local");
>>                   SparkContext  ctx=new SparkContext(sparkConf);
>>                   HiveContext  hiveql=new
>> org.apache.spark.sql.hive.HiveContext(ctx);
>>                   DataFrame df=hiveql.sql("show databases");
>>                   df.show();
>>                   }
>>
>> Any suggestions please....Thanks.
>>
>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Mich Talebzadeh <mi...@gmail.com>.

In all probability there is no user database created in Hive

Create a database yourself

sql("create if not exists database test")

It would be helpful if you grasp some concept of Hive databases etc?

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 15:40, Sree Eedupuganti <sr...@inndata.in> wrote:

> Hi Spark users, i am new to spark. I am trying to connect hive using
> SparkJavaContext. Unable to connect to the database. By executing the below
> code i can see only "default" database. Can anyone help me out. What i need
> is a sample program for Querying Hive results using SparkJavaContext. Need
> to pass any values like this.
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
>
>          public static void  main(String[] args ) throws Exception {
>                   SparkConf sparkConf = new
> SparkConf().setAppName("SparkSQL").setMaster("local");
>                   SparkContext  ctx=new SparkContext(sparkConf);
>                   HiveContext  hiveql=new
> org.apache.spark.sql.hive.HiveContext(ctx);
>                   DataFrame df=hiveql.sql("show databases");
>                   df.show();
>                   }
>
> Any suggestions please....Thanks.
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Sree Eedupuganti <sr...@inndata.in>.

Hi Spark users, i am new to spark. I am trying to connect hive using
SparkJavaContext. Unable to connect to the database. By executing the below
code i can see only "default" database. Can anyone help me out. What i need
is a sample program for Querying Hive results using SparkJavaContext. Need
to pass any values like this.

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")

         public static void  main(String[] args ) throws Exception {
                  SparkConf sparkConf = new
SparkConf().setAppName("SparkSQL").setMaster("local");
                  SparkContext  ctx=new SparkContext(sparkConf);
                  HiveContext  hiveql=new
org.apache.spark.sql.hive.HiveContext(ctx);
                  DataFrame df=hiveql.sql("show databases");
                  df.show();
                  }

Any suggestions please....Thanks.

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

Hi Bijay,

This approach might not work for me as I have to do partial
inserts/overwrites in a given table and data_frame.write.partitionBy will
overwrite the entire table.

Thanks,
Swetha

On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak <bi...@cloudwick.com>
wrote:

> Hi Swetha,
>
> One option is to use Hive with the above issues fixed which is Hive 2.0 or
> Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember
> is it's not the Hive you have installed but the Hive Spark is using which
> in Spark 1.6 is Hive version 1.2 as of now.
>
> The workaround I did for this issue was to write dataframe directly using
> dataframe write method and to create the Hive Table on top of that, doing
> which my processing time was down  from 4+ hrs to just under 1 hr.
>
>
>
> data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location")
>
> And ORC format is supported with HiveContext only.
>
> Thanks,
> Bijay
>
>
> On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
> swethakasireddy@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Following is  a sample code snippet:
>>
>>
>> *val *userDF =
>> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", "userRecord").persist()
>> System.*out*.println(" userRecsDF.partitions.size"+
>> userRecsDF.partitions.size)
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
>> sqlContext.sql(
>>   """ from userRecordsTemp ps   insert overwrite table users
>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>> """.stripMargin)
>>
>>
>> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
>> swethakasireddy@gmail.com> wrote:
>>
>>> Hi Bijay,
>>>
>>> If I am hitting this issue,
>>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be
>>> done? Incrementing to higher version of hive is the only solution?
>>>
>>> Thanks!
>>>
>>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
>>> swethakasireddy@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Following is  a sample code snippet:
>>>>
>>>>
>>>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner",
>>>> "userId", "userRecord").persist()
>>>> System.*out*.println(" userRecsDF.partitions.size"+
>>>> userRecsDF.partitions.size)
>>>>
>>>> userDF.registerTempTable("userRecordsTemp")
>>>>
>>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>>>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>>>> )
>>>> sqlContext.sql(
>>>>   """ from userRecordsTemp ps   insert overwrite table users
>>>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>>>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>>>> """.stripMargin)
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>>>> bijay.pathak@cloudwick.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Looks like you are hitting this:
>>>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>>>
>>>>> Thanks,
>>>>> Bijay
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> cam you provide a code snippet of how you are populating the target
>>>>>> table from temp table.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> No, I am reading the data from hdfs, transforming it , registering
>>>>>>> the data in a temp table using registerTempTable and then doing insert
>>>>>>> overwrite using Spark SQl' hiveContext.
>>>>>>>
>>>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> how are you doing the insert? from an existing table?
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>>>>
>>>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> How to insert data into 2000 partitions(directories) of
>>>>>>>>>> ORC/parquet  at a
>>>>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>>>>> insert
>>>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>>>>> this issue?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Bijay Pathak <bi...@cloudwick.com>.

Hi Swetha,

One option is to use Hive with the above issues fixed which is Hive 2.0 or
Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember
is it's not the Hive you have installed but the Hive Spark is using which
in Spark 1.6 is Hive version 1.2 as of now.

The workaround I did for this issue was to write dataframe directly using
dataframe write method and to create the Hive Table on top of that, doing
which my processing time was down  from 4+ hrs to just under 1 hr.


data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location")

And ORC format is supported with HiveContext only.

Thanks,
Bijay


On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
swethakasireddy@gmail.com> wrote:

> Hi Mich,
>
> Following is  a sample code snippet:
>
>
> *val *userDF =
> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", "userRecord").persist()
> System.*out*.println(" userRecsDF.partitions.size"+
> userRecsDF.partitions.size)
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
> sqlContext.sql(
>   """ from userRecordsTemp ps   insert overwrite table users
> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
> """.stripMargin)
>
>
> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
> swethakasireddy@gmail.com> wrote:
>
>> Hi Bijay,
>>
>> If I am hitting this issue,
>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
>> Incrementing to higher version of hive is the only solution?
>>
>> Thanks!
>>
>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
>> swethakasireddy@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Following is  a sample code snippet:
>>>
>>>
>>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner",
>>> "userId", "userRecord").persist()
>>> System.*out*.println(" userRecsDF.partitions.size"+
>>> userRecsDF.partitions.size)
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>>> )
>>> sqlContext.sql(
>>>   """ from userRecordsTemp ps   insert overwrite table users
>>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>>> """.stripMargin)
>>>
>>>
>>>
>>>
>>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>>> bijay.pathak@cloudwick.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Looks like you are hitting this:
>>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>>
>>>> Thanks,
>>>> Bijay
>>>>
>>>>
>>>>
>>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> cam you provide a code snippet of how you are populating the target
>>>>> table from temp table.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> No, I am reading the data from hdfs, transforming it , registering
>>>>>> the data in a temp table using registerTempTable and then doing insert
>>>>>> overwrite using Spark SQl' hiveContext.
>>>>>>
>>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> how are you doing the insert? from an existing table?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>>>>>
>>>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>>>
>>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> How to insert data into 2000 partitions(directories) of
>>>>>>>>> ORC/parquet  at a
>>>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>>>> insert
>>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>>>> this issue?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

Hi Mich,

Following is  a sample code snippet:


*val *userDF =
userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
"userRecord").persist()
System.*out*.println(" userRecsDF.partitions.size"+
userRecsDF.partitions.size)

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")
sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId STRING,
userRecord STRING) PARTITIONED BY (idPartitioner STRING, dtPartitioner
STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
sqlContext.sql(
  """ from userRecordsTemp ps   insert overwrite table users
partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)


On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasireddy@gmail.com> wrote:

> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
> Incrementing to higher version of hive is the only solution?
>
> Thanks!
>
> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
> swethakasireddy@gmail.com> wrote:
>
>> Hi,
>>
>> Following is  a sample code snippet:
>>
>>
>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
>> "userRecord").persist()
>> System.*out*.println(" userRecsDF.partitions.size"+
>> userRecsDF.partitions.size)
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>> )
>> sqlContext.sql(
>>   """ from userRecordsTemp ps   insert overwrite table users
>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>> """.stripMargin)
>>
>>
>>
>>
>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>> bijay.pathak@cloudwick.com> wrote:
>>
>>> Hello,
>>>
>>> Looks like you are hitting this:
>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>
>>> Thanks,
>>> Bijay
>>>
>>>
>>>
>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> cam you provide a code snippet of how you are populating the target
>>>> table from temp table.
>>>>
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
>>>> wrote:
>>>>
>>>>> No, I am reading the data from hdfs, transforming it , registering the
>>>>> data in a temp table using registerTempTable and then doing insert
>>>>> overwrite using Spark SQl' hiveContext.
>>>>>
>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> how are you doing the insert? from an existing table?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>>>>
>>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>>
>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> How to insert data into 2000 partitions(directories) of
>>>>>>>> ORC/parquet  at a
>>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>>> insert
>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>>> this issue?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

Hi Bijay,

If I am hitting this issue,
https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
Incrementing to higher version of hive is the only solution?

Thanks!

On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
swethakasireddy@gmail.com> wrote:

> Hi,
>
> Following is  a sample code snippet:
>
>
> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
> "userRecord").persist()
> System.*out*.println(" userRecsDF.partitions.size"+
> userRecsDF.partitions.size)
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
> )
> sqlContext.sql(
>   """ from userRecordsTemp ps   insert overwrite table users
> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
> """.stripMargin)
>
>
>
>
> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <bijay.pathak@cloudwick.com
> > wrote:
>
>> Hello,
>>
>> Looks like you are hitting this:
>> https://issues.apache.org/jira/browse/HIVE-11940.
>>
>> Thanks,
>> Bijay
>>
>>
>>
>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> cam you provide a code snippet of how you are populating the target
>>> table from temp table.
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
>>> wrote:
>>>
>>>> No, I am reading the data from hdfs, transforming it , registering the
>>>> data in a temp table using registerTempTable and then doing insert
>>>> overwrite using Spark SQl' hiveContext.
>>>>
>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> how are you doing the insert? from an existing table?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>>>
>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>
>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> How to insert data into 2000 partitions(directories) of ORC/parquet
>>>>>>> at a
>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>> insert
>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>> this issue?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

Hi,

Following is  a sample code snippet:


*val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
"userRecord").persist()
System.*out*.println(" userRecsDF.partitions.size"+
userRecsDF.partitions.size)

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")
sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId STRING,
userRecord STRING) PARTITIONED BY (idPartitioner STRING, dtPartitioner
STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
sqlContext.sql(
  """ from userRecordsTemp ps   insert overwrite table users
partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)




On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <bi...@cloudwick.com>
wrote:

> Hello,
>
> Looks like you are hitting this:
> https://issues.apache.org/jira/browse/HIVE-11940.
>
> Thanks,
> Bijay
>
>
>
> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> > wrote:
>
>> cam you provide a code snippet of how you are populating the target table
>> from temp table.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
>> wrote:
>>
>>> No, I am reading the data from hdfs, transforming it , registering the
>>> data in a temp table using registerTempTable and then doing insert
>>> overwrite using Spark SQl' hiveContext.
>>>
>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> how are you doing the insert? from an existing table?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>>
>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>
>>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How to insert data into 2000 partitions(directories) of ORC/parquet
>>>>>> at a
>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>> insert
>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>>>>>> issue?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Bijay Pathak <bi...@cloudwick.com>.

Hello,

Looks like you are hitting this:
https://issues.apache.org/jira/browse/HIVE-11940.

Thanks,
Bijay



On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> cam you provide a code snippet of how you are populating the target table
> from temp table.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com>
> wrote:
>
>> No, I am reading the data from hdfs, transforming it , registering the
>> data in a temp table using registerTempTable and then doing insert
>> overwrite using Spark SQl' hiveContext.
>>
>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> how are you doing the insert? from an existing table?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>>
>>>> How many workers (/cpu cores) are assigned to this job?
>>>>
>>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> How to insert data into 2000 partitions(directories) of ORC/parquet
>>>>> at a
>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>> insert
>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>>>>> issue?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Mich Talebzadeh <mi...@gmail.com>.

cam you provide a code snippet of how you are populating the target table
from temp table.


HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 23:43, swetha kasireddy <sw...@gmail.com> wrote:

> No, I am reading the data from hdfs, transforming it , registering the
> data in a temp table using registerTempTable and then doing insert
> overwrite using Spark SQl' hiveContext.
>
> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> > wrote:
>
>> how are you doing the insert? from an existing table?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>>
>>> How many workers (/cpu cores) are assigned to this job?
>>>
>>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> How to insert data into 2000 partitions(directories) of ORC/parquet  at
>>>> a
>>>> time using Spark SQL? It seems to be not performant when I try to insert
>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>>>> issue?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by swetha kasireddy <sw...@gmail.com>.

No, I am reading the data from hdfs, transforming it , registering the data
in a temp table using registerTempTable and then doing insert overwrite
using Spark SQl' hiveContext.

On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> how are you doing the insert? from an existing table?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:
>
>> How many workers (/cpu cores) are assigned to this job?
>>
>> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>>
>>> Hi,
>>>
>>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>>> time using Spark SQL? It seems to be not performant when I try to insert
>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>>> issue?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Mich Talebzadeh <mi...@gmail.com>.

how are you doing the insert? from an existing table?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 21:16, Stephen Boesch <ja...@gmail.com> wrote:

> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>> time using Spark SQL? It seems to be not performant when I try to insert
>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>> issue?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Posted by Stephen Boesch <ja...@gmail.com>.

How many workers (/cpu cores) are assigned to this job?

2016-06-09 13:01 GMT-07:00 SRK <sw...@gmail.com>:

> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
> time using Spark SQL? It seems to be not performant when I try to insert
> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
> issue?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>