You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Victor Bashurov (JIRA)" <ji...@apache.org> on 2015/04/20 02:38:58 UTC
[jira] [Created] (SPARK-7001) Partitions for a long single line
file
Victor Bashurov created SPARK-7001:
--------------------------------------
Summary: Partitions for a long single line file
Key: SPARK-7001
URL: https://issues.apache.org/jira/browse/SPARK-7001
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.3.0, 1.2.1, 1.2.0
Reporter: Victor Bashurov
Here is the issue from stackoverflow.com (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)
I am using Spark 1.2.1 (local mode) to extract and process log information from file. The size of the file could be more than 100Mb. The file contains a very long single line so I'm using regular expression to split this file into log data rows.
MyApp.java
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> txtFileRdd = sc.textFile(filename);
JavaRDD<MyLog> logRDD = txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
LogParser.java
public static Iterable<MyLog> parseFromLogLine(String logline) {
List<MyLog> logs = new LinkedList<MyLog>();
Matcher m = PATTERN.matcher(logline);
while (m.find()) {
logs.add(new MyLog(m.group(0)));
}
System.out.println("Logs detected " + logs.size());
return logs;
}
Actual size of the processed file is about 100 Mb and it actually contains 323863 log items. When I use Spark to extract my log items from file I get 455651 [logRDD.count()] log items which is not correct.
I think it happens because of file partitions, checking the output I see the following:
Logs detected 18694
Logs detected 113104
Logs detected 323863
And the total sum is 455651! So I see that my partitions are merged with each other keeping duplicate items and I need to prevent that behavior. The workaround is the following method:
txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
And it'll give me the desired result 323863, but I don't think that it's good for performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org