You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chetan Khatri <ch...@gmail.com> on 2021/05/06 02:15:18 UTC

Performance Improvement: Collect in spark taking huge time

Hi All, Collect in spark is taking huge time. I want to get list of values
of one column to Scala collection. How can I do this?
 val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
            .select(col("reporting_table")).except(clientSchemaDF)
          logger.info(s"####### except with client-schema done " +
LocalDateTime.now())
          // newDynamicFieldTablesDF.cache()


            val detailsForCreateTableDF =
cachedPhoenixAppMetaDataForCreateTableDF
              .join(broadcast(newDynamicFieldTablesDF),
Seq("reporting_table"), "inner")
            logger.info(s"####### join with newDF done " +
LocalDateTime.now())

//            detailsForCreateTableDF.cache()

            val newDynamicFieldTablesList = newDynamicFieldTablesDF.map(r
=> r.getString(0)).collect().toSet


Later, I am iterating this list for one the use case to create a custom
definition table:

newDynamicFieldTablesList.foreach(table => { // running here Create table
DDL/SQL query })

Thank you so much

Re: Performance Improvement: Collect in spark taking huge time

Posted by Chetan Khatri <ch...@gmail.com>.
Hi All,

Do you think, replacing the collect() (for having scala collection for
loop) with below codeblock will have any benefit?

cachedColumnsAddTableDF.select("reporting_table").distinct().foreach(r => {
  r.getAs("reporting_table").asInstanceOf[String]
})


On Wed, May 5, 2021 at 10:15 PM Chetan Khatri <ch...@gmail.com>
wrote:

> Hi All, Collect in spark is taking huge time. I want to get list of values
> of one column to Scala collection. How can I do this?
>  val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
>             .select(col("reporting_table")).except(clientSchemaDF)
>           logger.info(s"####### except with client-schema done " +
> LocalDateTime.now())
>           // newDynamicFieldTablesDF.cache()
>
>
>             val detailsForCreateTableDF =
> cachedPhoenixAppMetaDataForCreateTableDF
>               .join(broadcast(newDynamicFieldTablesDF),
> Seq("reporting_table"), "inner")
>             logger.info(s"####### join with newDF done " +
> LocalDateTime.now())
>
> //            detailsForCreateTableDF.cache()
>
>             val newDynamicFieldTablesList = newDynamicFieldTablesDF.map(r
> => r.getString(0)).collect().toSet
>
>
> Later, I am iterating this list for one the use case to create a custom
> definition table:
>
> newDynamicFieldTablesList.foreach(table => { // running here Create table
> DDL/SQL query })
>
> Thank you so much
>