You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rohit Verma <ro...@rokittech.com> on 2017/02/08 10:58:53 UTC
Dataset count on database or parquet
Hi Which of the following is better approach for too many values in database
final Dataset<Row> dataset = spark.sqlContext().read()
.format("jdbc")
.option("url", params.getJdbcUrl())
.option("driver", params.getDriver())
.option("dbtable", params.getSqlQuery())
// .option("partitionColumn", hashFunction)
// .option("lowerBound", 0)
// .option("upperBound", 10)
// .option("numPartitions", 10)
// .option("oracle.jdbc.timezoneAsRegion", "false")
.option("fetchSize", 100000)
.load();
dataset.write().parquet(params.getPath());
// target is to get count of persisted rows.
// approach 1 i.e getting count directly from dataset
// as I understood this count will be transalted to jdbcRdd.count and could be on database
long count = dataset.count();
//approach 2 i.e read back saved parquet and get count from it.
long count = spark.read().parquet(params.getPath()).count();
Regards
Rohit
Re: Dataset count on database or parquet
Posted by Suresh Thalamati <su...@gmail.com>.
If you have to get the data into parquet format for other reasons then I think count() on the parquet should be better. If it just the count you need using database sending dbTable = (select count(*) from <tablename> ) might be quicker, t will avoid unnecessary data transfer from the database to spark.
Hope that helps
-suresh
> On Feb 8, 2017, at 2:58 AM, Rohit Verma <ro...@rokittech.com> wrote:
>
> Hi Which of the following is better approach for too many values in database
>
> final Dataset<Row> dataset = spark.sqlContext().read()
> .format("jdbc")
> .option("url", params.getJdbcUrl())
> .option("driver", params.getDriver())
> .option("dbtable", params.getSqlQuery())
> // .option("partitionColumn", hashFunction)
> // .option("lowerBound", 0)
> // .option("upperBound", 10)
> // .option("numPartitions", 10)
> // .option("oracle.jdbc.timezoneAsRegion", "false")
> .option("fetchSize", 100000)
> .load();
> dataset.write().parquet(params.getPath());
>
> // target is to get count of persisted rows.
>
>
> // approach 1 i.e getting count directly from dataset
> // as I understood this count will be transalted to jdbcRdd.count and could be on database
> long count = dataset.count();
> //approach 2 i.e read back saved parquet and get count from it.
> long count = spark.read().parquet(params.getPath()).count();
>
>
> Regards
> Rohit