You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Abhishek Tiwari (JIRA)" <ji...@apache.org> on 2018/06/14 23:07:00 UTC
[jira] [Resolved] (GOBBLIN-98) HiveSerDeConverter. Write to ORC records duplication with queue.capacity=1

     [ https://issues.apache.org/jira/browse/GOBBLIN-98?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Abhishek Tiwari resolved GOBBLIN-98.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 0.13.0

Issue resolved by pull request #2283
[https://github.com/apache/incubator-gobblin/pull/2283]

> HiveSerDeConverter. Write to ORC records duplication with queue.capacity=1
> --------------------------------------------------------------------------
>
>                 Key: GOBBLIN-98
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-98
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Abhishek Tiwari
>            Priority: Major
>              Labels: help, wanted
>             Fix For: 0.13.0
>
>
> I'm using HiveSerDeConverter in my project to convert AVRO to ORC just like described here: [gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data](http://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/).
> The problem in duplicated records. I have that fancy `fork.record.queue.capacity=1` in my job file, but it seems like this parameter works partially. Let me explain it a bit more in-depth.
> What my project does: Connect to SFTP, read file list, extract all records from file inside ZIP archive, convert them to avro than to orc just like in the link above, write to output.
> I have  sftp server on my localhost for testing purpose. Inside file that need to be read I have 7 data records. When I look to the output orc file, it contains: 
> - third record
> - third record
> - third record
> - 4 record
> - 5 record
> - 6 record
> - 7 record
> SO, first three records is replaced by THIRD record, than output is normal. I tried hard to understand this behavior, so I reimplemented classes `HiveWritableHdfsDataWriterBuilder` and `HiveWritableHdfsDataWriter` in my package and added some dirty debug info into write method. 
> ```
> public void write(Writable record) throws IOException {
>         Preconditions.checkNotNull(record);
>         log.info(MY WRITABLE: +record.toString());
>         this.writer.write(record);
>         this.count.incrementAndGet();
>     }
> ```
> Also I have some debug info in my converter class and in DataWriterBuilder.
> ```
> public DataWriter<Writable> build() throws IOException {
>         log.info(Data writer Builder start);
>         ...
> ```
> Lets look at logs. Debug message: is message from my Converter class. MY WRITABLE is from DataWriter method write. What I see here: 3 debug messages appeared before DataWriterBuilder class was called. That means that 3 records were processed before writer started to work, so queue.capacity=1 doesn't really work here, that's why we have 3 same records. I still may be ridiculously wrong so I need help) 
> ```
> gobblin.source.extractor.filebased.FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/01.zip:::file.csv
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
> org.apache.hadoop.hive.serde2.avro.AvroDeserializer: Adding new valid RRID :7c5adecd:155b9394a0f:-8000
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
> TSStatsConverter: ------------------------
> TSStatsConverter: Data writer Builder start
> TSStatsConverter: Preconditions ran succesfully
> TSStatsConverter: WRITER.WRITABLE.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
> TSStatsConverter: WRITER.OUTPUT.FORMAT.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
> TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
> gobblin.source.extractor.extract.sftp.SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/01.zip dest: ?? in 0 at 0.00
> gobblin.source.extractor.filebased.FileBasedExtractor: Unable to get file size. Will skip increment to bytes counter Failed to get size for file at path /home/remoteuser/Files/01.zip:::file.csv due to error No such file
> gobblin.source.extractor.filebased.FileBasedExtractor: Finished reading records from all files
> gobblin.runtime.Task: Extracted 7 data records
> ```
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086 
> *Github Reporter* : *skyshineb* 
> *Github Created At* : 2016-07-05T07:34:05Z 
> *Github Updated At* : 2016-10-29T06:39:15Z 
> h3. Comments 
> ----
> [~stakiar] wrote on 2016-07-09T03:12:27Z : In which file did you set `fork.record.queue.capacity=1`? Can you try setting it in both the `.job` file and `.properties` file?
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231511677 
> ----
> *skyshineb* wrote on 2016-07-11T04:11:45Z : I set it in a `.pull` file where is my job config lays. Putting it to `.properties` file didn't help. I tested it earlier, logging this property in different classes, looked like it works fine.
> @sahilTakiar, I found some new info. I created new converter similar to the original one, but it uses new `SerDe` object for every record. So, no duplicated records. But problem still here. First three records is third record. Maybe this is problem with writer? Logs: 
> ```
> FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/Terminal_Tethering01.zip:::Export Tethering Subscribers.csv
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
> AvroDeserializer: Adding new valid RRID :-4b494a8a:155c4244e69:-8000
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@61bbb8a4
> InstrumentedConverter: CONVERTED HASH:::1639692452
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@8e53cf0
> InstrumentedConverter: CONVERTED HASH:::149241072
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@1169d3a
> InstrumentedConverter: CONVERTED HASH:::18259258
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@35b0050
> InstrumentedConverter: CONVERTED HASH:::56295504
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@75b734cb
> InstrumentedConverter: CONVERTED HASH:::1974940875
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@212341d5
> InstrumentedConverter: CONVERTED HASH:::555958741
> TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
> InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@7f7305e8
> InstrumentedConverter: CONVERTED HASH:::2138244584
> SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/Terminal_Tethering01.zip dest: ?? in 0 at 0.00
> FsDataWriterBuilder: ------------------------
> FsDataWriterBuilder: Data writer Builder start
> FileBasedExtractor: Finished reading records from all files
> gobblin.runtime.Task: Extracted 7 data records
> gobblin.runtime.Task: Row quality checker finished with results: 
> FsDataWriter: MY WRITABLE HASH: 1639692452
> FsDataWriter: MY WRITABLE HASH: 149241072
> FsDataWriter: MY WRITABLE HASH: 18259258
> FsDataWriter: MY WRITABLE HASH: 56295504
> FsDataWriter: MY WRITABLE HASH: 1974940875
> FsDataWriter: MY WRITABLE HASH: 555958741
> FsDataWriter: MY WRITABLE HASH: 2138244584
> gobblin.publisher.TaskPublisher: All components finished successfully, checking quality tests
> gobblin.publisher.TaskPublisher: All required test passed for this task passed.
> gobblin.publisher.TaskPublisher: Cleanup for task publisher executed successfully.
> gobblin.runtime.Fork-0: Committing data for fork 0 of task task_cmd_pre_post_1467874476697_0
> ```
> More debug info now) In destination file rows: 
> ```
> 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
> 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
> 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
> ```
> look like:
> ```
> 2016-05-19 00:00:00 123456789012349 123456789012349 366.00  244.00  122.00  
> 2016-05-19 00:00:00 123456789012349 123456789012349 366.00  244.00  122.00  
> 2016-05-19 00:00:00 123456789012349 123456789012349 366.00  244.00  122.00
> ```
> I tested it with files different in size, in content, but still h*ck this first three records. I will made new gobblin build this week, and will create clean job files and stuff. Maybe something will change.
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231638514 
> ----
> [~abti] wrote on 2016-07-13T11:44:24Z : @kSky7000 Can you please share your pull file, changes you made to any class and some sample data. I will try to reproduce it locally and debug.
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232331431 
> ----
> *skyshineb* wrote on 2016-07-14T16:01:09Z : @abti Thank you very much.
> https://github.com/kSky7000/cuddly-pancake
> pull file, zip with data sample, sources. 
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232710146 
> ----
> *skyshineb* wrote on 2016-08-02T10:54:23Z : @abti Really sorry for bothering but I need help) Any news about this bug? Did you try to reproduce it?
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993 
> ----
> [~abti] wrote on 2016-08-03T05:14:44Z : Hi Nikolay,
> Apologies for dropping the ball on this. I missed to revisit this, and I am
> traveling right now. I will try to dig into this early next week.
> Regards
> Abhishek
> On Tue, Aug 2, 2016 at 4:24 PM, Nikolay Skovorodin <notifications@github.com
> > wrote:
> > 
> > @abti https://github.com/abti Really sorry for bothering but I need
> > help) Any news about this bug? Did you try to reproduce it?
> > 
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993,
> > or mute the thread
> > https://github.com/notifications/unsubscribe-auth/AAEPe-DoqX5Uh-gg6Gr5ViVCzxTYwx42ks5qbyHfgaJpZM4JE2Or
> > .
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-237141426 
> ----
> *skyshineb* wrote on 2016-08-08T11:10:34Z : @abti I found another intresting thing. Today I used example json-project with orc addition to it and achieved same result with first three records being duplicated. What do you think about it? For hadoop im using Hortonworks HDP 2.2
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-238206554 
> ----
> *jeffwang66* wrote on 2016-08-18T00:32:23Z : I hit the same issue. I used HiveSerDeConverter and HiveWritableHdfsDataWriterBuilder to write records to ORC files. I noticed that some records are duplicated and some others are missing even though I set fork.record.queue.capacity=1.  Is there any plan to fix this issue? @abti 
> @kSky7000 Did you find the solution for it?
> Thanks
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-240590679 
> ----
> *skyshineb* wrote on 2016-08-24T15:24:50Z : @jeffwang66 Unfortunately i did not resolve this issue. 
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-242104816 
> ----
> *ipostanogov* wrote on 2016-10-29T06:39:15Z : I have the same issue. `fork.record.queue.capacity=1` does not help.
>  
>  
> *Github Url* : https://github.com/linkedin/gobblin/issues/1086#issuecomment-257074751



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)