You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Laxman (Created) (JIRA)" <ji...@apache.org> on 2012/03/12 16:16:37 UTC

[jira] [Created] (HBASE-5564) Bulkload is discarding duplicate records

Bulkload is discarding duplicate records
----------------------------------------

                 Key: HBASE-5564
                 URL: https://issues.apache.org/jira/browse/HBASE-5564
             Project: HBase
          Issue Type: Bug
          Components: mapreduce
    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
         Environment: HBase 0.92
            Reporter: Laxman
            Assignee: Laxman


Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
Duplicate records are considered if the records are from diffrent different splits.

Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Anoop Sam John (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236441#comment-13236441 ] 

Anoop Sam John commented on HBASE-5564:
---------------------------------------

{quote}
In bulkload, if multiple records are having same timestamp, then the last KV entry processed by reducer only will be persisted (TreeSet in Reducer)
{quote}

The 1st KV processed by the Reducer right...

Yes agree with you which one is the latest might not be possible to be predicted in the reducer side...
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Anoop Sam John (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233238#comment-13233238 ] 

Anoop Sam John commented on HBASE-5564:
---------------------------------------

@Laxman
ImportTsv
{code}
+    // If timestamp option is not specified, use current system time.
+    long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis());
+
+    // Set it back to replace invalid timestamp (non-numeric) with current system time
+    conf.setLong(TIMESTAMP_CONF_KEY, timstamp);
{code}

Doing this will use the same TS across all the mappers. Is this the intention for this change? So in TsvImporterMapper, conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get from conf.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Anoop Sam John (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233417#comment-13233417 ] 

Anoop Sam John commented on HBASE-5564:
---------------------------------------

Comment from Jesse Yates
{quote}
The question is, if you have a TSV file with the same row key, which value should be considered the most recent version? Should any of them - maybe that is actually a problem and we want to have a warning/error when that occurs?
{quote}

Do we need to handle this? The issue is TreeSet used by PutSortReducer and KeyValueSortReducer as mentioned by Laxman. 
In normal data insertion using Puts, all the duplicate values will go into the memstore (and finally to HFiles) and while scan the last entered one will get retrieved. In this bulk load case the 1st data only will get inserted as DS avoid the duplicates. Is this a behaviour mismatch?  But this depends on which entry in the TSV file needs to be considered as the recent version.If we say that last entry coming in the file is the recent version.....

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)

Ted, Thanks for your review. Attached the patch after fixing the review comments.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227742#comment-13227742 ] 

Todd Lipcon commented on HBASE-5564:
------------------------------------

I think it's a feature, not a bug, that the timestamps are all identical. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it. If you want to use custom timestamps, you should specify a timestamp column in your data, or write your own MR job (ImportTsv is just an example which use useful for some cases, but for anything advanced I would expect users to write their own code)
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241307#comment-13241307 ] 

stack commented on HBASE-5564:
------------------------------

Committed to trunk.  Thanks for the patch Laxman.  Thanks for the reminder on updating the count Uma. It seems that my minor addition only stopped the count rising so I didn't have to change the findbugs count (the test build was seeing two new findbug warnings when in fact there were none -- a variable name change was making it think the findbugs count had gone up).
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233910#comment-13233910 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519054/HBASE-5564_trunk.1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 165 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol
                  org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1231//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1231//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1231//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238506#comment-13238506 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519960/HBASE-5564_trunk.2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1305//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1305//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1305//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260282#comment-13260282 ] 

stack commented on HBASE-5564:
------------------------------

@Laxman Any luck?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)

QA bot didn't pick up previous patch. so, resubmitting...
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240189#comment-13240189 ] 

Laxman commented on HBASE-5564:
-------------------------------

Its my mistake stack. While fixing findbug, I overlooked Base64 behavior. I was expecting UTF-8 encoding from this utiliny. Thanks for pointing it out. I will fix this.

Will also add some unit tests for parsing the timestamps properly.
Thanks stack for pointing out the problem.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240241#comment-13240241 ] 

Laxman commented on HBASE-5564:
-------------------------------

Another problem found in my testing. Invalid timestamp is not respecting skip.bad.lines configuration.
I will update the patch for this as well. Adding some unit tests too.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228212#comment-13228212 ] 

Laxman commented on HBASE-5564:
-------------------------------

bq. ts++, or ts--, could be an option?

ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new JVM and ts will be reset to initial value. so, still there is a chance of ts collision.

bq. that the timestamps are all identical. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it.

No Todd. At least the implementation is buggy enough and not matching with this expected behavior.
New timestamp is generated for each map task (i.e., for each split) in TsvImporterMapper.doSetup.
Please check my previous comments.

bq. So this is only about ImportTsv? Should change the title in that case.
I'm not aware what other tools comes under bulkload. Bulkload documentation talks only about importtsv.
http://hbase.apache.org/bulk-loads.html

But if you feel we should change the title, feel free to modify the title.

bq. If you want to use custom timestamps, you should specify a timestamp column in your data, or write your own MR job (ImportTsv is just an example which use useful for some cases, but for anything advanced I would expect users to write their own code)

I think we can provide the provision to specify the timestamp column (Like ROWKEY column) as arguments.
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code'

This makes importtsv more usable. Otherwise, user has to copy paste entire importtsv code and do this minor modification.

Please let me know your suggestions on this.

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242785#comment-13242785 ] 

stack commented on HBASE-5564:
------------------------------

Thanks Ted.  I reverted the patch for now.  Laxman, mind taking a looksee at the failures Ted found in TestImportTsv#testMROnTableWithCustomMapper?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233179#comment-13233179 ] 

Ted Yu commented on HBASE-5564:
-------------------------------

{code}
+    public int getTimestapKeyColumnIndex() {
{code}
Please fix typo in the above method name.
{code}
+      "  -D" + TIMESTAMP_CONF_KEY + "=currentTimeAsLong - use the specified timestamp for the import. This option is ignored if HBASE_TS_KEY is specfied in 'importtsv.columns'\n" +
{code}
Please wrap the long line above.
{code}
+    // Should never get 0.
+    ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0);
{code}
Please explain why 0 wouldn't be returned.
{code}
+      if (parser.getTimestapKeyColumnIndex() != -1)
+        ts = parsed.getTimestamp();
{code}
Please use curly braces around the assignment.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Uma Maheswara Rao G (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241167#comment-13241167 ] 

Uma Maheswara Rao G commented on HBASE-5564:
--------------------------------------------

Also don't forget to update the count in test-patch.properties according to the present count if we fix any existing findbugs.

+Uma
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234082#comment-13234082 ] 

Laxman commented on HBASE-5564:
-------------------------------

All MR tests seems to be failing. Failures are not because of the patch.
I will check these failures.

@anoop
In bulkload, if multiple records are having same timestamp, then the last KV entry processed by reducer only will be persisted (TreeSet in Reducer). I don't see this as behavior inconsistency. Bulkload can't judge which KV entry to be retained (Considering duplicate records exists across input splits/files). So, in this case, user can develop custom MR to achieve this functionality.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Status: Patch Available  (was: Reopened)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228297#comment-13228297 ] 

Laxman commented on HBASE-5564:
-------------------------------

Scope of this issue.

1) Avoid the behavioral inconsistency with timestamp parameter.

{noformat}
Currently in code,
a) If timstamp parameter is configured, duplicate records will be overwritten.
b) If not configured, some duplicate records are maintained as different version.
{noformat}

This fix should be inline with the expectation Todd has mentioned.

bq. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it.

2) Provide an option to look up timestamp column value from input data. (Like ROWKEY column)
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code'

I will submit the patch with the above mentioned approach.

Any other addons?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292746#comment-13292746 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
-----------------------------------------------

We got the problem.  It was because there was a space created in the latest patch in the testcase
'" = org.apache.hadoop.hbase.mapreduce.TsvImporterCustomTestMapper",'.  There should not be any space before and after '='.

Will rebase the patch so that it can be recommitted.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235327#comment-13235327 ] 

Zhihong Yu commented on HBASE-5564:
-----------------------------------

We have been using 80 characters as line length for a while.

At Google, line length is enforced, though the limit is bit longer.

Feel free to start discussion on dev@hbase about the acceptable limit.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.1.patch
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228467#comment-13228467 ] 

stack commented on HBASE-5564:
------------------------------

Googling it, its either something is already listening on the port of your 127.0.0.1 has been removed?   See http://www-01.ibm.com/support/docview.wss?uid=swg21233733
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239392#comment-13239392 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520096/HBASE-5564_trunk.3.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1319//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1319//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1319//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Attachment: HBASE-5564.patch

New patch for trunk.  This time the testcases should run.  Pls review and provide your comments.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5564:
-------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Jesse Yates (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227696#comment-13227696 ] 

Jesse Yates commented on HBASE-5564:
------------------------------------

Hmm, I think your right with this being a problem. It would be totally reasonable to change 
{code}
       KeyValue kv = new KeyValue(
            lineBytes, parsed.getRowKeyOffset(), parsed.getRowKeyLength(),
            parser.getFamily(i), 0, parser.getFamily(i).length,
            parser.getQualifier(i), 0, parser.getQualifier(i).length,
            ts,
            KeyValue.Type.Put,
            lineBytes, parsed.getColumnOffset(i), parsed.getColumnLength(i));
{code}

to use something like: {code}ts++{code}

The question is, if you have a TSV file with the same row key, which value should be considered the most recent version? Should any of them - maybe that is actually a problem and we want to have a warning/error when that occurs?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295345#comment-13295345 ] 

stack commented on HBASE-5564:
------------------------------

+1 on commit.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240345#comment-13240345 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520255/HBASE-5564_trunk.4_final.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1326//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1326//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1326//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243043#comment-13243043 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK-security #155 (See [https://builds.apache.org/job/HBase-TRUNK-security/155/])
    HBASE-5564 Bulkload is discarding duplicate records (Revision 1307629)

     Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235320#comment-13235320 ] 

Laxman commented on HBASE-5564:
-------------------------------

Ted, all these comments are related to line wrapping.
IMO, 80 characters length is too low & it makes the code bit ugly.

If you strongly feel we need to stick this 80-length, I will fix these comments.

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233424#comment-13233424 ] 

Zhihong Yu commented on HBASE-5564:
-----------------------------------

@Laxman:
Please take a look at https://builds.apache.org/job/PreCommit-HBASE-Build/1229/console and see which test timed out.

I have sent an email to builds@apache.org, informing them of the issue for Hadoop QA.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234244#comment-13234244 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
-----------------------------------------------

The test cases that fail is common in HadoopQA.  As your patch is changing the ImportTsv part people will be worried.
But as you have run it locally and ensured that it is passing the main build should be able to pass it.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242065#comment-13242065 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK-security #154 (See [https://builds.apache.org/job/HBase-TRUNK-security/154/])
    HBASE-5564 Bulkload is discarding duplicate records (Revision 1306907)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227597#comment-13227597 ] 

ramkrishna.s.vasudevan edited comment on HBASE-5564 at 3/12/12 3:21 PM:
------------------------------------------------------------------------

I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
{code}
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }
{code}
Changing this back to List and then sort explicitly will solve the issue.
                
      was (Author: lakshman):
    I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.
                  
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293469#comment-13293469 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
-----------------------------------------------

Ok.. I will make that change and reupload the patch..Thanks Ted.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228221#comment-13228221 ] 

Lars Hofhansl commented on HBASE-5564:
--------------------------------------

@Laxman: so what you have in your CSV file is entries like:
rowA, colA, val1
rowA, colA, val2

And the expectation is that HBase should create two versions:
(rowA, colA, ts1) -> val1
(rowA, colA, ts2) -> val2
?

Seems like a pretty constructed case to me :)
How would know ahead of time how many versions you'd need to configure for your column family? 3 is the default, but what if you have 100 versions of the same row/col combo in your CSV file?

But anyway, having an option to specify a column for the TS is a good idea.
Do you want to take a stab at it Laxman?

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293367#comment-13293367 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
-----------------------------------------------

All the tests are passing.. Will integrate tomorrow if there are no objections.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241485#comment-13241485 ] 

Jonathan Hsieh commented on HBASE-5564:
---------------------------------------

meant to say "ideally it does not go up".  I think stack's action (he didn't lower findbugs number on normal patch) captured the same idea.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5564:
-------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5564:
------------------------------

    Attachment: 5564.lint

@Laxman:
5564.lint contains the warnings 'arc lint' found w.r.t. your patch.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241341#comment-13241341 ] 

Jonathan Hsieh commented on HBASE-5564:
---------------------------------------

maybe not worry about find bugs for normal patches? (ideally it does go up though) the find bugs number isn't the focus of this patch.

Sent from my iPhone



                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.4_final.patch
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227926#comment-13227926 ] 

Lars Hofhansl commented on HBASE-5564:
--------------------------------------

So this is only about ImportTsv? Should change the title in that case.

I agree with Todd, at least for ImportTsv.
Import/Export should not (and hopefully do not) exhibit this behavior (since we want to be able to import/export KVs with multiple versions).

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5564:
-------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293815#comment-13293815 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531854/HBASE-5564_1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.coprocessor.TestClassLoading

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2153//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2153//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2153//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2153//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241076#comment-13241076 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
-----------------------------------------------

+1 on v5.  

Thanks for the patch Lakshman and Stack.  
@Stack
So any new patches that we give should not have any findbugs even if in the old existing code? Ok i will take care of this and ensure people submitting patches over here also do that. Thanks.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Lars Hofhansl (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-5564:
---------------------------------

    Fix Version/s: 0.96.0
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5564:
-------------------------

    Attachment: 5564v5.txt

Same as  v4 but uses Bytes to try and get rid of the findbug warnings (Laxman, you have probably noticed our new 'sensitivity' to the findbug output... you did not introduce these warnings, they were in the original code -- but let me try and get rid of them w/ this v5 ... thanks).
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Reopened) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu reopened HBASE-5564:
-------------------------------


By reverting the patch applied to trunk, TestImportTsv#testMROnTableWithCustomMapper passes.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245999#comment-13245999 ] 

Laxman commented on HBASE-5564:
-------------------------------

Yes Stack. I will take a look. Changes in this patch are in Default Mapper. IMO these changes shouldn't cause failures in custom mapper.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293375#comment-13293375 ] 

Zhihong Ted Yu commented on HBASE-5564:
---------------------------------------

Minor comment:
{code}
+          throw new BadTsvLineException("Invalid timestamp");
{code}
Can the timestamp string be included ?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233283#comment-13233283 ] 

Laxman commented on HBASE-5564:
-------------------------------

bq. Doing this will use the same TS across all the mappers. Is this the intention for this change? So in TsvImporterMapper, conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get from conf.

Yes Anoop. we should have same timestamp for all mappers.
Please check my previous comments on the scope of the issue.

https://issues.apache.org/jira/browse/HBASE-5564?focusedCommentId=13228297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13228297
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.patch

Initial patch on trunk for review.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235342#comment-13235342 ] 

Laxman commented on HBASE-5564:
-------------------------------

Thanks Ted, for taking pain in getting the lint comments.
As you suggested, I will start a discussion on dev@hbase.

I just wanted to quote one example from this patch here.
{code}
    long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

Above code snippet after formatting, it turned to
{code}
    long timstamp = conf
        .getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242825#comment-13242825 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK #2701 (See [https://builds.apache.org/job/HBase-TRUNK/2701/])
    HBASE-5564 Bulkload is discarding duplicate records (Revision 1307629)

     Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.3.patch

Final patch for commit to trunk.
Changes from previous patch
1) Minor improvements to getTimestamp (Readability).
2) Find bug - Default encoding - corrected using Base64 utility
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234281#comment-13234281 ] 

Laxman commented on HBASE-5564:
-------------------------------

thanks for the info Ram.

I had spent sometime in analyzing these failures. But couldn't get a clue.
Filed a separate JIRA HBASE-5608 to fix these test failures.

As mentioned earlier all these test are passing in my local environment.

Should we wait for HBASE-5608 or proceed with review & commit?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed to trunk.
Thanks for the patch Laxman.
Thanks for the review Stack, Ted, Lars, Todd, Jesse and Anoop.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227740#comment-13227740 ] 

stack commented on HBASE-5564:
------------------------------

The TreeSet is whats going to be used once the edits make it into the server so losing them in the reducer is probably optimal?  The Jesse ts++, or ts--, could be an option?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227678#comment-13227678 ] 

Laxman commented on HBASE-5564:
-------------------------------

I tested again with the proposed patch.
> > Changing this back to List and then sort explicitly will solve the issue.

Still the same problem persists making this issue bit more complicated. 
I think the usage of same timestamp for all records in split causing the issue.

Currently in code,
a) If configured, we are using static timestamp for all mappers.
b) If not configured, we are using current system time generated for each split.

TsvImporterMapper.doSetup
====================
{code}
ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

Should we think of an approach to generate a unique sequence number and use it as a timestamp?

Any other thoughts?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238452#comment-13238452 ] 

Laxman commented on HBASE-5564:
-------------------------------

@Stack, updated the patch after fixing your comments. Thanks for the review.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228936#comment-13228936 ] 

Laxman commented on HBASE-5564:
-------------------------------

Thanks Stack. Let me give a try.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227597#comment-13227597 ] 

Laxman commented on HBASE-5564:
-------------------------------

I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Status: Patch Available  (was: Open)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293238#comment-13293238 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531698/HBASE-5564.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2136//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2136//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2136//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2136//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242189#comment-13242189 ] 

Laxman commented on HBASE-5564:
-------------------------------

Thanks for the commit stack.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237781#comment-13237781 ] 

stack commented on HBASE-5564:
------------------------------

Patch seems reasonable.

Add curlies here:

{code}
+      if (parser.getTimestampKeyColumnIndex() != -1)
+        ts = parsed.getTimestamp();
{code}

Convention is you can do w/o curlies if all in one line (as you do later in this file) but if not on one line, need curlies.

Can you confirm that current behavior -- setting ts to System.currentTimeMillis -- is default?  It seems to be ... we set System.currentTimeMillis as time to use setting up the job.

A define for NO_TIMESTAMP_KEYCOLUMN_INDEX instead of using -1 directly might help for timestampKeyColumnIndex == -1?  Or put this test into a method whose name makes it obvious what the test is about ... e.g. hasTimeStampColumn()....

Patch adds nice usage commentary explaining new facility.

Looks good.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234457#comment-13234457 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519254/5564.lint
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1235//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5564:
-------------------------

    Affects Version/s:     (was: 0.92.2)
                           (was: 0.90.7)
                           (was: 0.94.0)
         Hadoop Flags: Reviewed

Patch looks good.  Is this right:

{code}
+        return Long.parseLong(Base64.encodeBytes(lineBytes,
+            getColumnOffset(timestampKeyColumnIndex), getColumnLength(timestampKeyColumnIndex)));
{code}

As I read it, encode some passed bytes into a base64 String and then try to parse it as a long (it doesn't look like parseLong can interpret base64'd longs)?  Am I reading it wrong?

I was going to mark this an incompatible change but thinking on it, setting timestamp for the MR job once rather than per mapper seems like a bug fix.

Please write a bit of a release note at least explaining the changed behavior.

If the above is right and I'm just reading it wrong, will commit.  Let me know.  Thanks Laxman.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228219#comment-13228219 ] 

Zhihong Yu commented on HBASE-5564:
-----------------------------------

bq. I think we can provide the provision to specify the timestamp column (Like ROWKEY column) as arguments.
The above is reasonable.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.1.patch
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233204#comment-13233204 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519017/HBASE-5564_trunk.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 165 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
     

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1229//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1229//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1229//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238095#comment-13238095 ] 

Laxman commented on HBASE-5564:
-------------------------------

@Anoop, thanks for clarification.

@Stack, thanks for the review. I will update the patch.

bq. need curlies
bq. NO_TIMESTAMP_KEYCOLUMN_INDEX 

I will update the patch for above 2 comments.

bq. Can you confirm that current behavior – setting ts to System.currentTimeMillis – is default? It seems to be ... we set System.currentTimeMillis as time to use setting up the job.

Before patch, we are setting ts to System.currentTimeMillis in TsvImporterMapper.doSetup. This setup methos will be called for each mapper, i.e, for each input split. That means it uses a new timestamp for each map task.

After patch, we are setting ts to conf.getLong which is same in all map tasks.

Hope, I understood your question correctly.

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13296053#comment-13296053 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #55 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/55/])
    HBASE-5564 Bulkload is discarding duplicate records

Submitted by:Laxman	
Reviewed by:iStack, Ted, Ram (Revision 1350691)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5564:
------------------------------------------

    Attachment: HBASE-5564_1.patch

Updated patch addressing Ted's comments.  This what am planning to commit if there is no objection.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242477#comment-13242477 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK #2698 (See [https://builds.apache.org/job/HBase-TRUNK/2698/])
    HBASE-5564 Bulkload is discarding duplicate records (Revision 1306907)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239222#comment-13239222 ] 

Laxman commented on HBASE-5564:
-------------------------------

Findbugs reported by QA bot are about usage of default encoding. This behavior is inline with existing code.


bug #1
{noformat}
TEST 	Unknown bug pattern DM_DEFAULT_ENCODING in org.apache.hadoop.hbase.mapreduce.ImportTsv$TsvParser$ParsedLine.getTimestamp()
{noformat}

bug #2
{noformat}
TEST 	Unknown bug pattern DM_DEFAULT_ENCODING in org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(Configuration, String[])
{noformat}

bug #2 already existing in code. just included in patch file with no changes.

And test case failures are not because of this patch. Test failures to be addressed as part of HBASE-5608
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233356#comment-13233356 ] 

Laxman commented on HBASE-5564:
-------------------------------

Any idea why QA bot is not testing this patch?
Can someone trigger this explicitly?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228406#comment-13228406 ] 

Laxman commented on HBASE-5564:
-------------------------------

While testing the patch in local, I'm getting the following error in trunk.
Any hints on this please?

{noformat}
java.lang.RuntimeException: java.io.IOException: Call to localhost/127.0.0.1:0 failed on local exception: java.net.BindException: Cannot assign requested address: no further information
	at org.apache.hadoop.mapred.MiniMRCluster.waitUntilIdle(MiniMRCluster.java:323)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:524)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:462)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:454)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:446)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:436)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:426)
	at org.apache.hadoop.mapred.MiniMRCluster.<init>(MiniMRCluster.java:417)
	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1269)
	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1255)
	at org.apache.hadoop.hbase.mapreduce.TestImportTsv.doMROnTableTest(TestImportTsv.java:189)
	at org.apache.hadoop.hbase.mapreduce.TestImportTsv.testMROnTable(TestImportTsv.java:162)
{noformat}
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241021#comment-13241021 ] 

Hadoop QA commented on HBASE-5564:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520371/5564v5.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1336//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1336//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1336//console

This message is automatically generated.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Release Note: 
1) Provision for using the existing timestamp (HBASE_TS_KEY)
2) Bug fix to use same timestamp across mappers.
          Status: Patch Available  (was: Open)

Attached the final patch for review and commit.

Changes from previous patch
1) Encoding issue
2) Proper handling for bad records (with invalid timestamp)
3) New unit tests to test the parser (with valid & invalid timestamp)

Note: QA may report 2 new findbugs. As explained earlier, these findings are due to usage of default encoding (String.getBytes, new String) which is inline with the existing behavior.

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Attachment: HBASE-5564_trunk.2.patch
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295775#comment-13295775 ] 

Hudson commented on HBASE-5564:
-------------------------------

Integrated in HBase-TRUNK #3030 (See [https://builds.apache.org/job/HBase-TRUNK/3030/])
    HBASE-5564 Bulkload is discarding duplicate records

Submitted by:Laxman	
Reviewed by:iStack, Ted, Ram (Revision 1350691)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234157#comment-13234157 ] 

Laxman commented on HBASE-5564:
-------------------------------

These tests are passing in my dev environment.

{noformat}
Running org.apache.hadoop.hbase.mapreduce.TestImportTsv
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 168.578 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-surefire-plugin:2.12-TRUNK-HBASE-2:test (secondPartTestsExecution) @ hbase ---
[INFO] Tests are skipped.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
{noformat}

Also, I can see these MR tests are failing in previous builds as well [HBase-5529].

Will check more. 
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241078#comment-13241078 ] 

Laxman commented on HBASE-5564:
-------------------------------

@stack, thanks for your review and clearing the findbugs.
I was avoiding these changes as these are unrelated to this JIRA.

@ram, thanks for reviewing the patch.
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>             Fix For: 0.96.0
>
>         Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5564:
------------------------------

    Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519254/5564.lint
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1235//console

This message is automatically generated.)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5564) Bulkload is discarding duplicate records

Posted by "Laxman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laxman updated HBASE-5564:
--------------------------

    Status: Open  (was: Patch Available)
    
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira