You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Yulei Li (JIRA)" <ji...@apache.org> on 2016/09/22 06:26:20 UTC

[jira] [Assigned] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

     [ https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yulei Li reassigned HADOOP-13619:
---------------------------------

    Assignee: Yulei Li

> missing data intermittently when reading avro file in Spark from Swift storage
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-13619
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13619
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/swift
>    Affects Versions: 2.6.0
>         Environment: Linux EL6
>            Reporter: Steve Yang
>            Assignee: Yulei Li
>            Priority: Blocker
>
> library used: org.apache.hadoop:hadoop-openstack:2.6.0
> We are loading avro files from Oracle Storage Service server (i.e., Swift server) into Spark DataFrame object through the Spark Data Source API. For example:
> return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);
> The number of records is less than the actual record count in the avro file when reading the avro file from Storage Service server using OpenStack Swift API.
> If we run a SQL on top of the returned data frome like "select count(\*) as C1 from <temp table>" we can see the record count is smaller when reading the same avro file from local file system.
> For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 records vs. 60855). From the log file we can see the reading os the file is splitted into 4:
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432
> For a smaller avro file (wine.avro, 19M) the count sometimes is correct (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 10 times back-to-back produces the following record count results:
> run 1: 26999
> run 2: 26999
> run 3: 57076
> run 4: 57056
> run 5: 57076
> run 6: 26999
> run 7: 57076
> run 8: 57076
> run 9: 57076
> run 10: 57076
> For this wine.avro test case there are two splits:
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269
> I will attach a zip file containing the smaller avro file in question and the debugged log file section of reading wine.avro file - one with successful reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org