You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ding Fei (JIRA)" <ji...@apache.org> on 2016/09/28 10:53:20 UTC

[jira] [Comment Edited] (SPARK-17633) texFile() and wholeTextFiles() count difference

    [ https://issues.apache.org/jira/browse/SPARK-17633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529252#comment-15529252 ] 

Ding Fei edited comment on SPARK-17633 at 9/28/16 10:53 AM:
------------------------------------------------------------

I think the count problem could be viewed as a bug issue.

HadoopRDD uses TextInputFormat which uses LineRecordReader to read lines.
According to the comments in LineRecordReader:

  // If this is not the first split, we always throw away first record
  // because we always (except the last split) read one extra line in
  // next() method.

and

  // We always read one extra line, which lies outside the upper
  // split limit i.e. (end - 1)

So if we have data.csv of:

	1,"a"
	2,"b"
	3,"c"
	4,"d"

and

  sc.textFile("data.csv", 2).count // two partitions here

if evenly partitioned, logically the partitions would be

  // split 1
    1,"a"
	2,"b"
  // split 2
	3,"c"
	4,"d"

but, LineRecordReader works like this:

  // split 1
    1,"a"
	2,"b"
	3,"c"
  // split 2
    4,"d"

the real count is like [3, 1].sum = 4. For the last split, the extra
record is still read, but only an empty Text(), EOF makes sure this process
finished, thus indicating the last record invalid (hasNext() == false).

So if the file appends, the last split won't stop reading the extra
record. EOF doesn't work now, but file pointer (seeking pointer) would
exceed split boundary, so the process stopped, with an extra record which
increments our count by only 1.

If the file's RDD has more than 1 partition, and the file thrinks
too much, the LineRecordReader will definitely throw EOFException (I'v verified).
And the job fails.

We have defined that RDD as immutable, but the files under it could mutate.


was (Author: danix800):
I think the count problem could be viewed as a bug issue.

HadoopRDD uses TextInputFormat which uses LineRecordReader to read lines.
According to the comments in LineRecordReader:

  // If this is not the first split, we always throw away first record
  // because we always (except the last split) read one extra line in
  // next() method.

and

  // We always read one extra line, which lies outside the upper
  // split limit i.e. (end - 1)

So if we have data.csv of:

	1,"a"
	2,"b"
	3,"c"
	4,"d"

and

  val lines = sc.textFile("data.csv", 2).count // two partitions here

if evenly partitioned, logically the partitions would be

  // split 1
    1,"a"
	2,"b"
  // split 2
	3,"c"
	4,"d"

but, LineRecordReader works like this:

  // split 1
    1,"a"
	2,"b"
	3,"c"
  // split 2
    4,"d"

the real count is like [3, 1].sum = 4. For the last split, the extra
record is still read, but only an empty Text(), EOF makes sure this process
finished, thus indicating the last record invalid (hasNext() == false).

So if the file appends, the last split won't stop reading the extra
record. EOF doesn't work now, but file pointer (seeking pointer) would
exceed split boundary, so the process stopped, with an extra record which
increments our count by only 1.

If the file's RDD has more than 1 partition, and the file thrinks
too much, the LineRecordReader will definitely throw EOFException (I'v verified).
And the job fails.

We have defined that RDD as immutable, but the files under it could mutate.

> texFile() and wholeTextFiles() count difference
> -----------------------------------------------
>
>                 Key: SPARK-17633
>                 URL: https://issues.apache.org/jira/browse/SPARK-17633
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.6.2
>         Environment: Unix/Linux
>            Reporter: Anshul
>
> sc.textFile() creates an RDD of string from a text file.
> After that when count is performed, the line count is correct, but if more than one line is appended to the file manually and counting the same RDD of string increments the output/result only by 1. 
> But in case of sc.wholeTextFiles() the output/result is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org