You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Steve Lewis <lo...@gmail.com> on 2014/12/06 02:21:23 UTC

Problems creating and reading a large test file

I am trying to look at problems reading a data file over 4G. In my testing
I am trying to create such a file.
My plan is to create a fasta file (a simple format used in biology)
looking like
>1
TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTGAAAA
>2
GTCTGATCTAAATGCGACGACGTCTTTAGTGCTAAGTGGAACCCAATCTTAAGACCCAGGCTCTTAAGCAGAAACAGACCGTCCCTGCCTCCTGGAGTAT
>3
...
I create a list with 5000 structures - use flatMap to add 5000 per entry
and then either call saveAsText or dnaFragmentIterator =
mySet.toLocalIterator(); and write to HDFS

Then I try to call JavaRDD<String> lines = ctx.textFile(hdfsFileName);

what I get on a 16 node cluster
14/12/06 01:49:21 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(pltrd007.labs.uninett.no,50119)
java.nio.channels.ClosedChannelException

2 14/12/06 01:49:35 ERROR BlockManagerMasterActor: Got two different block
manager registrations on 20140711-081617-711206558-5050-2543-13

The code is at the line below - I did not want to spam the group although
it is only a couple of pages -
I am baffled - there are no issues when I create  a few thousand records
but things blow up when I try 25 million records or a file of 6B or so

Can someone take a look - it is not a lot of code

https://drive.google.com/file/d/0B4cgoSGuA4KWUmo3UzBZRmU5M3M/view?usp=sharing