You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2014/11/30 00:24:33 UTC

Appending with saveAsTextFile?

I am using Spark to aggregate logs that land in HDFS throughout the day.  The
job kicks off 15min after the hour and processes anything that landed the
previous hour.  

For example, the 2:15pm job will process anything that came in from
1:00pm-2:00pm.  99.9% of that data will consist of logs actually from the
1:00pm-2:00pm timespan.  But 0.1% will be data that, for one of several
reasons, trickled in from the 12:00pm hour or even earlier.

What I'd like to do is split by RDD by timestamp into several RDDs, then use
saveAsTextFile() to write each RDD to disk to its proper location.  So 99.9%
of the example data will go to /user/me/output/2014-11-29/13, while a small
portion will go to /user/me/output/2014-11-29/12, and maybe a couple rows
trickle in from the 10pm hour, and that aggregation goes to
/user/me/output/2014-11-29/10.

But when I run the job, I get error for the trickle /12 and /10 data saying
those directories already exist.  Is there a way I can do something like an
INSERT INTO using saveAsTextFile to "append"?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-with-saveAsTextFile-tp20031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org