You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Arun A K (JIRA)" <ji...@apache.org> on 2012/08/06 13:03:02 UTC

[jira] [Created] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Arun A K created MAPREDUCE-4519:
-----------------------------------

             Summary: In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
                 Key: MAPREDUCE-4519
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.20.2
         Environment: Linux- Ubuntu 10.04
            Reporter: Arun A K
             Fix For: 0.20.2


Set textinputformat.record.delimiter as "</entity>"

Suppose the input is a text file with the following content
<entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>

Mapper was expected to get value as 

Value 1 - <entity><id>1</id><name>User1</name>
Value 2 - <entity><id>2</id><name>User2</name>
Value 3 - <entity><id>3</id><name>User3</name>
Value 4 - <entity><id>4</id><name>User4</name>
Value 5 - <entity><id>5</id><name>User5</name>

According to this bug Mapper gets value

Value 1 - entity><id>1</id><name>User1</name>
Value 2 - <entity>id>2</id><name>User2</name>
Value 3 - <entity><id>3id><name>User3</name>
Value 4 - <entity><id>4</id><name>User4name>
Value 5 - <entity><id>5</id><name>User5</name>

The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Posted by "Gelesh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429082#comment-13429082 ] 

Gelesh commented on MAPREDUCE-4519:
-----------------------------------

I have found a similar Bug And a fix, MAPREDUCE-4512. Please reffer the patch, and kindly encorporate the same.
While fixing I too have encounted such a senario, I think this occur at the end of the buffer which would capture 4096 Charactors.
My understanding is the ending and begining of next buffer can and the delimiter indexses are not properly handled.
This is resulting in some or the other bugs.

Tried solving , but the fix resulted in some new bugs. The once all the senario is caught we can ensure a posible fix.
                
> In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4519
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, textinputformat.record.delimiter
>             Fix For: 0.20.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429097#comment-13429097 ] 

Hadoop QA commented on MAPREDUCE-4519:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539291/MAPREDUCE-4519.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-common-project/hadoop-common.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2709//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2709//console

This message is automatically generated.
                
> In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4519
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, textinputformat.record.delimiter
>             Fix For: 0.20.2
>
>         Attachments: MAPREDUCE-4519.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Posted by "Meria Joseph (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Meria Joseph updated MAPREDUCE-4519:
------------------------------------

    Attachment: MAPREDUCE-4519.patch

A few lines of change in LineReader, also incorporaed the MAPREDUCE-4512 patch
                
> In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4519
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, textinputformat.record.delimiter
>             Fix For: 0.20.2
>
>         Attachments: MAPREDUCE-4519.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Posted by "Meria Joseph (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Meria Joseph updated MAPREDUCE-4519:
------------------------------------

    Release Note: A few lines of change in LineReader, also incorporaed the MAPREDUCE-4512 patch
    Hadoop Flags: Reviewed
          Status: Patch Available  (was: Open)

A few lines of change in LineReader, also incorporaed the MAPREDUCE-4512 patch
                
> In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4519
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, textinputformat.record.delimiter
>             Fix For: 0.20.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira