You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arkadiusz Bicz <ar...@gmail.com> on 2016/01/06 13:08:17 UTC

Spark DataFrame limit question

Hi,

Does limit working for DataFrames, Spark SQL and Hive Context without
full scan for parquet in Spark 1.6 ?

I just used it to create small parquet file from large number of
parquet files and found out that it doing full scan of all data
instead just read limited number:

All of bellow commands doing full scan

val results = sqlContext.read.load("/largenumberofparquetfiles/")

results.limit(1).write.parquet("/tmp/smallresults1")

result.registerTempTable("resultTemp")

val select = sqlContext.sql("select * from resultTemp limit 1")

select.write.parquet("/tmp/smallresults2")

The same when I create external table in hive context as results table

hiveContext.sql("select * from results limit
1").write.parquet("/tmp/results/one3")


Thanks,

Arkadiusz Bicz

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org