You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Barona, Ricardo" <ri...@intel.com> on 2017/06/09 17:17:56 UTC
RDD saveAsText and DataFrame write.mode(SaveMode).text(Path)
duplicating rows
In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do:
dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”)
then when I check for the results file, I see 900+ rows. Doing further analysis I found some of the rows are being duplicated.
Does anyone know if this is something that has been reported before?
The only outstanding characteristic of my data is that I have a column that exceeds 2000 characters.
Appreciate your help, thanks.
Cheers,
Ricardo Barona
Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path)
duplicating rows
Posted by "Barona, Ricardo" <ri...@intel.com>.
Thanks Manjunath, please take a look at line 64
https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsAnalysis.scala
I’m trying to get sample data but no luck for now. I will let you know if I get some.
Thanks.
From: "Manjunath, Kiran" <ki...@akamai.com>
Date: Friday, June 9, 2017 at 1:47 PM
To: "Barona, Ricardo" <ri...@intel.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows
Can you post your code and sample input?
That should help us understand if there is a bug in the code written or with the platform.
Regards,
Kiran
From: "Barona, Ricardo" <ri...@intel.com>
Date: Friday, June 9, 2017 at 10:47 PM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows
In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do:
dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”)
then when I check for the results file, I see 900+ rows. Doing further analysis I found some of the rows are being duplicated.
Does anyone know if this is something that has been reported before?
The only outstanding characteristic of my data is that I have a column that exceeds 2000 characters.
Appreciate your help, thanks.
Cheers,
Ricardo Barona
Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path)
duplicating rows
Posted by "Manjunath, Kiran" <ki...@akamai.com>.
Can you post your code and sample input?
That should help us understand if there is a bug in the code written or with the platform.
Regards,
Kiran
From: "Barona, Ricardo" <ri...@intel.com>
Date: Friday, June 9, 2017 at 10:47 PM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows
In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do:
dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”)
then when I check for the results file, I see 900+ rows. Doing further analysis I found some of the rows are being duplicated.
Does anyone know if this is something that has been reported before?
The only outstanding characteristic of my data is that I have a column that exceeds 2000 characters.
Appreciate your help, thanks.
Cheers,
Ricardo Barona