You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rohit Verma <ro...@rokittech.com> on 2017/02/08 10:58:53 UTC

Dataset count on database or parquet

Hi Which of the following is better approach for too many values in database


      final Dataset<Row> dataset = spark.sqlContext().read()
                .format("jdbc")
                .option("url", params.getJdbcUrl())
                .option("driver", params.getDriver())
                .option("dbtable", params.getSqlQuery())
//                .option("partitionColumn", hashFunction)
//                .option("lowerBound", 0)
//                .option("upperBound", 10)
//                .option("numPartitions", 10)
//                .option("oracle.jdbc.timezoneAsRegion", "false")
                .option("fetchSize", 100000)
                .load();
        dataset.write().parquet(params.getPath());

// target is to get count of persisted rows.


        // approach 1 i.e getting count directly from dataset
        // as I understood this count will be transalted to jdbcRdd.count and could be on database
        long count = dataset.count();
        //approach 2 i.e read back saved parquet and get count from it.
        long count = spark.read().parquet(params.getPath()).count();


Regards
Rohit

Re: Dataset count on database or parquet

Posted by Suresh Thalamati <su...@gmail.com>.

If you have to get the data into parquet format for other reasons   then I think count() on the parquet should be better.  If it just the count  you need using database  sending dbTable = (select count(*) from <tablename> ) might be quicker,  t will avoid unnecessary data transfer from the database to spark.


Hope that helps
-suresh

> On Feb 8, 2017, at 2:58 AM, Rohit Verma <ro...@rokittech.com> wrote:
> 
> Hi Which of the following is better approach for too many values in database
> 
>       final Dataset<Row> dataset = spark.sqlContext().read()
>                 .format("jdbc")
>                 .option("url", params.getJdbcUrl())
>                 .option("driver", params.getDriver())
>                 .option("dbtable", params.getSqlQuery())
> //                .option("partitionColumn", hashFunction)
> //                .option("lowerBound", 0)
> //                .option("upperBound", 10)
> //                .option("numPartitions", 10)
> //                .option("oracle.jdbc.timezoneAsRegion", "false")
>                 .option("fetchSize", 100000)
>                 .load();
>         dataset.write().parquet(params.getPath());
> 
> // target is to get count of persisted rows.
> 
> 
>         // approach 1 i.e getting count directly from dataset
>         // as I understood this count will be transalted to jdbcRdd.count and could be on database
>         long count = dataset.count();
>         //approach 2 i.e read back saved parquet and get count from it. 
>         long count = spark.read().parquet(params.getPath()).count();
> 
> 
> Regards
> Rohit