You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arkadiusz Bicz <ar...@gmail.com> on 2016/01/06 13:08:17 UTC
Spark DataFrame limit question
Hi,
Does limit working for DataFrames, Spark SQL and Hive Context without
full scan for parquet in Spark 1.6 ?
I just used it to create small parquet file from large number of
parquet files and found out that it doing full scan of all data
instead just read limited number:
All of bellow commands doing full scan
val results = sqlContext.read.load("/largenumberofparquetfiles/")
results.limit(1).write.parquet("/tmp/smallresults1")
result.registerTempTable("resultTemp")
val select = sqlContext.sql("select * from resultTemp limit 1")
select.write.parquet("/tmp/smallresults2")
The same when I create external table in hive context as results table
hiveContext.sql("select * from results limit
1").write.parquet("/tmp/results/one3")
Thanks,
Arkadiusz Bicz
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org