You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jifei_yang (JIRA)" <ji...@apache.org> on 2018/01/30 01:10:00 UTC
[jira] [Closed] (SPARK-21664) Use the column name as the file name.

     [ https://issues.apache.org/jira/browse/SPARK-21664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jifei_yang closed SPARK-21664.
------------------------------

We can use the partition to save the column names, such as:

{code:java}
case class UserInfo(name:String,favorite_number:Int,favorite_color:String) extends Serializable{}
def mainSaveAsParquet(args: Array[String]) {
    val fileName=new Random().nextInt(43952858)
    val outPath = s"G:/project/idea15/xlwl/bigdata002/bigdata/sparkmvn/outpath/user/spark/parquet/temp/$fileName"
    val sparkConf = new SparkConf().setAppName("Spark Avro Test").setMaster("local[4]")
	
    MyKryoRegistrator.register(sparkConf)

    val sc = new SparkContext(sparkConf)

    val sqlContext=new SQLContext(sc)

    val array=new Array[UserInfo](3001)
    for(i <- 0 to 3000){
      val choose=i % 2
      choose match {
        case 0 =>array(i)=  UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", 256+(i/102), "blue")
        case 1 =>array(i)=  UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063", 256+i, "blue")
      }
    }

    import sqlContext.implicits._
    val records: DataFrame = sc.parallelize(array).toDF()
    records.repartition(1).write.partitionBy("name","favorite_number").format("parquet").mode(SaveMode.ErrorIfExists).save(outPath)
    sc.stop()
  }
{code}

This will handle the column name and favorite_number as input fields.

>  Use the column name as the file name.
> --------------------------------------
>
>                 Key: SPARK-21664
>                 URL: https://issues.apache.org/jira/browse/SPARK-21664
>             Project: Spark
>          Issue Type: Question
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: jifei_yang
>            Priority: Major
>
> When we save the dataframe, we want to use the column name as the file name. PairRDDFunctions are achievable. Can Dataframe be implemented? Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org