You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mrm <ma...@skimlinks.com> on 2014/06/19 17:54:35 UTC

Getting different answers running same line of code

Hi,

I have had this issue for some time already, where I get different answers
when I run the same line of code twice. I have run some experiments to see
what is happening, please help me! Here is the code and the answers that I
get. I suspect I have a problem when reading large datasets from S3.

rd1 = sc.textFile('s3n://blabla')
*rd1.persist()*
rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))

Note: both filter1() and map1() are deterministic

rd2.count()  ==> 294928559
rd2.count()  ==> 294928559

So far so good, I get the same counts. Now when I unpersist rd1, that's when
I start getting problems!

*rd1.unpersist()*
rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))
rd2.count()  ==> 294928559
rd2.count()  ==> 294509501
rd2.count()  ==> 294679795
...

I would appreciate it if you could help me!

Thanks,
Maria





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Getting different answers running same line of code

Posted by Daniel Darabos <da...@lynxanalytics.com>.
The easiest explanation would be if some other process is continuously
modifying the files. You could make a copy in a new directory and run on
that to eliminate this possibility.

What do you see if you print "rd1.count()" multiple times?

Have you tried the experiment on a smaller set of files? I don't know why a
file would cause this problem, but maybe you can find it that way.

Are you using a wildcard ("s3n://blabla/*.txt") or direct filenames? Maybe
S3 forgets about the existence of some files some of the time. There could
be a limit on the number of files returned to a directory query, and maybe
the order is not fixed, so different files get cut off at times.

(Sorry about the wild, uneducated guesses.)


On Thu, Jun 19, 2014 at 5:54 PM, mrm <ma...@skimlinks.com> wrote:

> Hi,
>
> I have had this issue for some time already, where I get different answers
> when I run the same line of code twice. I have run some experiments to see
> what is happening, please help me! Here is the code and the answers that I
> get. I suspect I have a problem when reading large datasets from S3.
>
> rd1 = sc.textFile('s3n://blabla')
> *rd1.persist()*
> rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))
>
> Note: both filter1() and map1() are deterministic
>
> rd2.count()  ==> 294928559
> rd2.count()  ==> 294928559
>
> So far so good, I get the same counts. Now when I unpersist rd1, that's
> when
> I start getting problems!
>
> *rd1.unpersist()*
> rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))
> rd2.count()  ==> 294928559
> rd2.count()  ==> 294509501
> rd2.count()  ==> 294679795
> ...
>
> I would appreciate it if you could help me!
>
> Thanks,
> Maria
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>