You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jaeboo jung (JIRA)" <ji...@apache.org> on 2015/06/11 05:45:00 UTC
[jira] [Created] (SPARK-8304) Table with a large number of columns
jaeboo jung created SPARK-8304:
----------------------------------
Summary: Table with a large number of columns
Key: SPARK-8304
URL: https://issues.apache.org/jira/browse/SPARK-8304
Project: Spark
Issue Type: Bug
Affects Versions: 1.3.1
Reporter: jaeboo jung
SQLContext can't handle any table with a large number of columns. Making dataframe is ok but when a user try to execute query on it, spark doesn't respond. To test, run below code from spark-shell.
{code:java}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val arr = (1 to 500000)
val columns = StructType(arr.map(x => StructField("columnNum_"+x , StringType, true)))
val data = arr.map(x => arr)
val rdd = sc.parallelize(data , 1000).map(Row.fromSeq(_))
val df = sqlContext.createDataFrame(rdd,columns)
//select few columns among 500,000 columns
def select1() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1")
println( System.currentTimeMillis - t1 )
}
def select2() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2")
println( System.currentTimeMillis - t1 )
}
def select3() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2","columnNum_3")
println( System.currentTimeMillis - t1 )
}
def select4() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2","columnNum_3","columnNum_4")
println( System.currentTimeMillis - t1 )
}
def select5() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2","columnNum_3","columnNum_4","columnNum_5")
println( System.currentTimeMillis - t1 )
}
def select6() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2","columnNum_3","columnNum_4","columnNum_5","columnNum_6")
println( System.currentTimeMillis - t1 )
}
def select7() = {
val t1 = System.currentTimeMillis
df.select("columnNum_1","columnNum_2","columnNum_3","columnNum_4","columnNum_5","columnNum_6","columnNum_7")
println( System.currentTimeMillis - t1 )
}
{code}
And the result is,
{code}
select1
20552
select2
25391
select3
29619
select4
33695
select5
42220
select6
44790
select7
49101
{code}
Elapsed time for selecting columns is increased about 4000ms after each addition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org