You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rajat mishra <mi...@gmail.com> on 2018/08/22 04:15:12 UTC
CBO not predicting cardinality on partition columns for Parquet tables

Hi All,

I have an external table in spark whose underlying data files are in
parquet format.
The table is partitioned. When I try to computed the statistics for a query
where partition column is in where clause, the statistics returned contains
only the sizeInBytes and not the no of rows count.

  val ddl = """create external table test_p (Address String, Age
String, CustomerID string, CustomerName string, CustomerSuffix string,
Location string, Mobile String, Occupation String, Salary String )
PARTITIONED BY (Country string) Stored as PARQUET LOCATION
'/dev/test3'"""
  spark.sql(ddl)
  spark.sql("msck repair table test_p")

  spark.sql("Analyze table test_p compute statistics for columns
Address,Age,CustomerID,CustomerName,CustomerSuffix,Location,Mobile,Occupation,Salary,Country").show()
  spark.sql("Analyze table test_p partition(Country) compute statistics").show()

println(spark.sql("select * from test_p where
country='Korea'").queryExecution.toStringWithStats)


The output I get is :

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName:
string, CustomerSuffix: string, Location: string, Mobile: string,
Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3,
CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8,
Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9]
parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3,
CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8,
Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)),
Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9]
parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet
default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9]
Batched: true, Format: Parquet, Location:
PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea],
PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9
= Korea)], PushedFilters: [], ReadSchema:
struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...


The same is working fine if I have an table whose underlying data file
format is TextFile.

Am I missing any step above or is it a known thing in spark.

Any help would be appreciated.


Thanks,Rajat