You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jifei_yang (JIRA)" <ji...@apache.org> on 2018/01/30 01:10:00 UTC
[jira] [Closed] (SPARK-21664) Use the column name as the file
name.
[ https://issues.apache.org/jira/browse/SPARK-21664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jifei_yang closed SPARK-21664.
------------------------------
We can use the partition to save the column names, such as:
{code:java}
case class UserInfo(name:String,favorite_number:Int,favorite_color:String) extends Serializable{}
def mainSaveAsParquet(args: Array[String]) {
val fileName=new Random().nextInt(43952858)
val outPath = s"G:/project/idea15/xlwl/bigdata002/bigdata/sparkmvn/outpath/user/spark/parquet/temp/$fileName"
val sparkConf = new SparkConf().setAppName("Spark Avro Test").setMaster("local[4]")
MyKryoRegistrator.register(sparkConf)
val sc = new SparkContext(sparkConf)
val sqlContext=new SQLContext(sc)
val array=new Array[UserInfo](3001)
for(i <- 0 to 3000){
val choose=i % 2
choose match {
case 0 =>array(i)= UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", 256+(i/102), "blue")
case 1 =>array(i)= UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063", 256+i, "blue")
}
}
import sqlContext.implicits._
val records: DataFrame = sc.parallelize(array).toDF()
records.repartition(1).write.partitionBy("name","favorite_number").format("parquet").mode(SaveMode.ErrorIfExists).save(outPath)
sc.stop()
}
{code}
This will handle the column name and favorite_number as input fields.
> Use the column name as the file name.
> --------------------------------------
>
> Key: SPARK-21664
> URL: https://issues.apache.org/jira/browse/SPARK-21664
> Project: Spark
> Issue Type: Question
> Components: Input/Output
> Affects Versions: 2.2.0
> Reporter: jifei_yang
> Priority: Major
>
> When we save the dataframe, we want to use the column name as the file name. PairRDDFunctions are achievable. Can Dataframe be implemented? Thank you.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org