You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Victor Bashurov (JIRA)" <ji...@apache.org> on 2015/04/20 02:38:58 UTC

[jira] [Created] (SPARK-7001) Partitions for a long single line file

Victor Bashurov created SPARK-7001:
--------------------------------------

             Summary: Partitions for a long single line file
                 Key: SPARK-7001
                 URL: https://issues.apache.org/jira/browse/SPARK-7001
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.3.0, 1.2.1, 1.2.0
            Reporter: Victor Bashurov


Here is the issue from  stackoverflow.com (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)

I am using Spark 1.2.1 (local mode) to extract and process log information from file. The size of the file could be more than 100Mb. The file contains a very long single line so I'm using regular expression to split this file into log data rows.

MyApp.java
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> txtFileRdd = sc.textFile(filename);
JavaRDD<MyLog> logRDD = txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();

LogParser.java
public static Iterable<MyLog> parseFromLogLine(String logline) {
        List<MyLog> logs = new LinkedList<MyLog>();
        Matcher m = PATTERN.matcher(logline);
        while (m.find()) {          
            logs.add(new MyLog(m.group(0)));            
        }   
        System.out.println("Logs detected " + logs.size());
        return logs;
}

Actual size of the processed file is about 100 Mb and it actually contains 323863 log items. When I use Spark to extract my log items from file I get 455651 [logRDD.count()] log items which is not correct.
I think it happens because of file partitions, checking the output I see the following:
Logs detected 18694
Logs detected 113104
Logs detected 323863
And the total sum is 455651! So I see that my partitions are merged with each other keeping duplicate items and I need to prevent that behavior. The workaround is the following method:

txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();

And it'll give me the desired result 323863, but I don't think that it's good for performance




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org