You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Chris Pimlott (JIRA)" <ji...@apache.org> on 2009/11/22 03:52:39 UTC

[jira] Created: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Handling of embedded padding in base64 encoded data changed in 1.4
------------------------------------------------------------------

                 Key: CODEC-91
                 URL: https://issues.apache.org/jira/browse/CODEC-91
             Project: Commons Codec
          Issue Type: Bug
    Affects Versions: 1.4
            Reporter: Chris Pimlott


1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.

For example, in 1.4,

b64.decode("Y29tbW9ucwo=".getBytes());

gives the same result as

b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874811#action_12874811 ] 

Julius Davies edited comment on CODEC-91 at 6/2/10 5:39 PM:
------------------------------------------------------------

I carefully analyzed the codec-1.3 logic.  It doesn't quite handle this scenario correctly.  There's a bug.  Result of running the test case originally supplied by Chris Pimlott against codec-1.3 is the following:

{code}
byte[] result = b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

// result now contains:
{'c', 'o', 'm', 'm', 'o', 'n', 's', NULL, 'c', 'o', 'm', 'm', 'o', 'n', 's'}
{code}

That NULL shouldn't be there (actually a zero since it's a primitive, but you know what I mean), so the decode in the commons-codec-1.3 logic is actually causing a corruption of the data in this scenario.

----------------------
So what should we do for 1.4.X?  Here's what's on the table at the moment:

Option 0:  Treat first PAD as EOF and stop all processing (current 1.4 behaviour).

Option 1:  Restart decode state when PAD is encountered in any location.

Option 2:  Only restart decode state if PAD's are correctly placed in the stream (ie. AB==, ABC=).  This is closest to the codec-1.3.jar behaviour, but without the NULL bug.


Personally I'm +1 to option 1, for the following reasons:

- Probably easier to program.  The implementation doesn't really need to know much about the PAD when it encounters one (ie. no need to ask is this a good PAD or a bad PAD?).

- Correctly decodes all Base64 inputs where PADs are in the correct place, so in a way we're already implementing option 2 here.

- Also properly decodes the invalid input "AB=AB=" if stream corruption happened because of: s/==/=/g 

- For all other inputs it's garbage-in garbage-out





      was (Author: juliusdavies):
    I carefully analyzed the codec-1.3 logic.  It doesn't quite handle this scenario correctly.  There's a bug.  Result of running the test case originally supplied by Chris Pimlott against codec-1.3 is the following:

{code}
byte[] result = b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

// result now contains:
{'c', 'o', 'm', 'm', 'o', 'n', 's', NULL, 'c', 'o', 'm', 'm', 'o', 'n', 's'}

That NULL shouldn't be there (actually a zero since it's a primitive, but you know what I mean), so the decode in the commons-codec-1.3 logic is actually causing a corruption of the data in this scenario.

----------------------
So what should we do for 1.4.X?  Here's what's on the table at the moment:

Option 0:  Treat first PAD as EOF and stop all processing (current 1.4 behaviour).

Option 1:  Restart decode state when PAD is encountered in any location.

Option 2:  Only restart decode state if PAD's are correctly placed in the stream (ie. AB==, ABC=).  This is closest to the codec-1.3.jar behaviour, but without the NULL bug.


Personally I'm +1 to option 1, for the following reasons:

- Probably easier to program.  The implementation doesn't really need to know much about the PAD when it encounters one (ie. no need to ask is this a good PAD or a bad PAD?).

- Correctly decodes all Base64 inputs where PADs are in the correct place, so in a way we're already implementing option 2 here.

- Also properly decodes the invalid input "AB=AB=" if stream corruption happened because of: s/==/=/g 

- For all other inputs it's garbage-in garbage-out




  
> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julius Davies updated CODEC-91:
-------------------------------

    Attachment: codec-91-actually-works-and-tests-yay.patch

This is a much better patch - it actually works.  And to respond to Sebb, I think we should bring back the original 1.3 behaviour as the default behaviour.  We can later consider allowing alternative behaviour via flags or config or some other mechanism.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Jochen Wiedmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875013#action_12875013 ] 

Jochen Wiedmann commented on CODEC-91:
--------------------------------------

I'm clearly in favour of the current behaviour: Any non-blank characters following the *proper* number of pad characters indicates a broken data stream.

Note, that the user is able to achieve his strange wish by parsing the data stream with a simpe regex before handling it over to commons codec. For example (not verified, in particular not for performance, just to give an impression):

    String base64String = "Y29tbW9ucwo=Y29tbW9ucwo=";
    Pattern p = Pattern.compile("\\=([^\\=\\s])");
    for (;;) {
        Matcher m = p.matcher(base64String);
        while (m.find()) {
            base64String = m.replaceFirst(m.group(1));
        }
    }



> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>            Assignee: Julius Davies
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Sebb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874333#action_12874333 ] 

Sebb commented on CODEC-91:
---------------------------

The buffer output code when PAD is detected is the same as for the final "if (eof && modulus != 0)".

This does nothing if modulus == 1; I wonder if this is the cause of the behaviour reported above by Raphael, which should probably be a new JIRA.

Seems to me there need to be tests which exercise all possible moduli.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julius Davies updated CODEC-91:
-------------------------------

    Attachment:     (was: codec-91.patch)

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874811#action_12874811 ] 

Julius Davies commented on CODEC-91:
------------------------------------

I carefully analyzed the codec-1.3 logic.  It doesn't quite handle this scenario correctly.  There's a bug.  Result of running the test case originally supplied by Chris Pimlott against codec-1.3 is the following:

{code}
byte[] result = b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

// result now contains:
{'c', 'o', 'm', 'm', 'o', 'n', 's', NULL, 'c', 'o', 'm', 'm', 'o', 'n', 's'}

That NULL shouldn't be there (actually a zero since it's a primitive, but you know what I mean), so the decode in the commons-codec-1.3 logic is actually causing a corruption of the data in this scenario.

----------------------
So what should we do for 1.4.X?  Here's what's on the table at the moment:

Option 0:  Treat first PAD as EOF and stop all processing (current 1.4 behaviour).

Option 1:  Restart decode state when PAD is encountered in any location.

Option 2:  Only restart decode state if PAD's are correctly placed in the stream (ie. AB==, ABC=).  This is closest to the codec-1.3.jar behaviour, but without the NULL bug.


Personally I'm +1 to option 1, for the following reasons:

- Probably easier to program.  The implementation doesn't really need to know much about the PAD when it encounters one (ie. no need to ask is this a good PAD or a bad PAD?).

- Correctly decodes all Base64 inputs where PADs are in the correct place, so in a way we're already implementing option 2 here.

- Also properly decodes the invalid input "AB=AB=" if stream corruption happened because of: s/==/=/g 

- For all other inputs it's garbage-in garbage-out





> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Chris Pimlott (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781465#action_12781465 ] 

Chris Pimlott edited comment on CODEC-91 at 11/23/09 4:38 PM:
--------------------------------------------------------------

It's not clear to me if the old behavior was strictly a bug, given the "may".  RFC4648 states:

   Furthermore, such specifications MAY ignore the pad
   character, "=", treating it as non-alphabet data, if it is present
   before the end of the encoded data.

http://tools.ietf.org/html/rfc4648#section-3.3

It sounds like either way (ignoring or stopping) of treating the embedded "=" is acceptable, which would recast the change as an implementation decision, rather than a bug fix.

At the very least, the change in behavior should have been made clear in the release notes and in the javadoc.  An option to restore the previous behavior would be appreciated as well.

      was (Author: pimlottc):
    It's not clear to me if the old behavior was strictly a bug, given the "may".  RFC4648 states:

   Furthermore, such specifications MAY ignore the pad
   character, "=", treating it as non-alphabet data, if it is present
   before the end of the encoded data.

http://tools.ietf.org/html/rfc4648#section-3.3

It sounds like either way (ignoring or stopping) of treating the embedded "=" is acceptable, which would recast the change as a implementation decision, rather than a bug fix.

At the very least, the change in behavior should have been made clear in the release notes and in the javadoc.  An option to restore the previous behavior would be appreciated as well.
  
> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Sebb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874371#action_12874371 ] 

Sebb commented on CODEC-91:
---------------------------

OK, understood.

But I still think it is wrong to restart after a single PAD, unless the modulus == 3.
For example, ABC=DEF= would be OK to treat as two encodings, but AB=CD and A=B= should not be treated as two encodings.
ISTM that either such PADs should be completely ignored, or an error should be generated - anything else has little meaning.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874348#action_12874348 ] 

Julius Davies commented on CODEC-91:
------------------------------------

When we encode we tend to have at least 1 byte to encode == 8 bits.  Base64 results in 6 bits of encoding per emitted byte, so modulus of 1 should be impossible.  That first encoded byte will always result in two emitted bytes:   [6 bits] [2 bits].

Still it's good info to see that codec-1.4 accidentally changed the behaviour here.  I'll create a new ticket based on Raphael's comment.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Sebb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850446#action_12850446 ] 

Sebb commented on CODEC-91:
---------------------------

I agree, the behaviour should be selectable. This means we will have to choose - and document - the default behaviour.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Raphael Ackermann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869146#action_12869146 ] 

Raphael Ackermann commented on CODEC-91:
----------------------------------------

This might be related: When decoding a one byte string as shown below, the new behaviour returns an empty byte array whereas before it would return a byte array with 0 in it.

String enc = "1";
byte[] result = Base64.decodeBase64(enc.getBytes())

where in 1.3 result would be [0]
and in 1.4 result is an empty byte array []



> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Gary Gregory (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781278#action_12781278 ] 

Gary Gregory commented on CODEC-91:
-----------------------------------

The old behavior was a bug I think. 

>From the spec http://tools.ietf.org/html/rfc2045#section-6.8:
{quote}
Because it is used only for padding at the end of the data, the
   occurrence of any "=" characters may be taken as evidence that the
   end of the data has been reached (without truncation in transit).
{quote}

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Gary Gregory (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781497#action_12781497 ] 

Gary Gregory commented on CODEC-91:
-----------------------------------

Good points, it sounds like the behavior should have a toggle ivar.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781297#action_12781297 ] 

Julius Davies commented on CODEC-91:
------------------------------------


FYI:  there are two additional changes to the decoding behaviour (as well as what you've identified here in CODEC-91):


1.  Ability to decode even when == is missing (at EOF):  https://issues.apache.org/jira/browse/CODEC-75?focusedCommentId=12712517&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12712517


2.  URL-Safe characters are now auto-decoded:  CODEC-75


I think these were the only three "on purpose" changes to base64 decoding in 1.4.



> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Sebb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874355#action_12874355 ] 

Sebb commented on CODEC-91:
---------------------------

OK, but it's still possible to have PAD after a single input character, so maybe your patch needs to allow for this, even if the final if clause does not need to.

However, I'm now wondering whether it makes sense to treat PAD differently from non-alphabet data.
Why should a single PAD restart the decoding? Was that how 1.3 behaved?


> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Chris Pimlott (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781465#action_12781465 ] 

Chris Pimlott edited comment on CODEC-91 at 11/23/09 4:37 PM:
--------------------------------------------------------------

It's not clear to me if the old behavior was strictly a bug, given the "may".  RFC4648 states:

   Furthermore, such specifications MAY ignore the pad
   character, "=", treating it as non-alphabet data, if it is present
   before the end of the encoded data.

http://tools.ietf.org/html/rfc4648#section-3.3

It sounds like either way (ignoring or stopping) of treating the embedded "=" is acceptable, which would recast the change as a implementation decision, rather than a bug fix.

At the very least, the change in behavior should have been made clear in the release notes and in the javadoc.  An option to restore the previous behavior would be appreciated as well.

      was (Author: pimlottc):
    It's not clear to me if the old behavior was strictly a bug, given the "may".  RFC4648 states:

   Furthermore, such specifications MAY ignore the pad
   character, "=", treating it as non-alphabet data, if it is present
   before the end of the encoded data.

http://tools.ietf.org/html/rfc4648#section-3.3

It sounds like either way (ignoring or stopping) of treating the embedded "=" is acceptable.

At the very least, the change in behavior should have been made clear in the release notes and in the javadoc.  An option to restore the previous behavior would be appreciated as well.
  
> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julius Davies updated CODEC-91:
-------------------------------

    Attachment: codec-91.patch


The attached patch brings back the original commons-codec-1.3.jar behaviour as requested by Chris Pimlott.

NOTE:  This is another way to unearth the bug in CODEC-98, so we'll need the CODEC-98 fix if we accept this patch.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874811#action_12874811 ] 

Julius Davies edited comment on CODEC-91 at 6/2/10 5:40 PM:
------------------------------------------------------------

I carefully analyzed the codec-1.3 logic.  It doesn't quite handle this scenario correctly.  There's a bug.  Result of running the test case originally supplied by Chris Pimlott against codec-1.3 is the following:

{code}
byte[] result = b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

// result now contains:
{'c', 'o', 'm', 'm', 'o', 'n', 's', NULL, 'c', 'o', 'm', 'm', 'o', 'n', 's'}
{code}

That NULL shouldn't be there (actually a zero since it's a primitive, but you know what I mean), so the decode in the commons-codec-1.3 logic is actually causing a corruption of the data in this scenario.

----------------------
So what should we do for 1.4.X?  Here's what's on the table at the moment:

Option 0:  Treat first PAD as EOF and stop all processing (current 1.4 behaviour).

Option 1:  Restart decode state when PAD is encountered in any location.

Option 2:  Only restart decode state if PAD's are correctly placed in the stream (ie. AB==, ABC=).  This is closest to the codec-1.3.jar behaviour, but without the NULL bug.


Personally I'm +1 to option 1, for the following reasons:

- Probably easier to program.  The implementation doesn't really need to know much about the PAD when it encounters one (ie. no need to ask is this a good PAD or a bad PAD?).

- Correctly decodes all Base64 inputs where PADs are in the correct place, so in a way we're already implementing option 2 here.

- Also properly decodes the invalid input "AB=AB=" if corruption happened because of: s/==/=/g 

- For all other inputs it's garbage-in garbage-out





      was (Author: juliusdavies):
    I carefully analyzed the codec-1.3 logic.  It doesn't quite handle this scenario correctly.  There's a bug.  Result of running the test case originally supplied by Chris Pimlott against codec-1.3 is the following:

{code}
byte[] result = b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

// result now contains:
{'c', 'o', 'm', 'm', 'o', 'n', 's', NULL, 'c', 'o', 'm', 'm', 'o', 'n', 's'}
{code}

That NULL shouldn't be there (actually a zero since it's a primitive, but you know what I mean), so the decode in the commons-codec-1.3 logic is actually causing a corruption of the data in this scenario.

----------------------
So what should we do for 1.4.X?  Here's what's on the table at the moment:

Option 0:  Treat first PAD as EOF and stop all processing (current 1.4 behaviour).

Option 1:  Restart decode state when PAD is encountered in any location.

Option 2:  Only restart decode state if PAD's are correctly placed in the stream (ie. AB==, ABC=).  This is closest to the codec-1.3.jar behaviour, but without the NULL bug.


Personally I'm +1 to option 1, for the following reasons:

- Probably easier to program.  The implementation doesn't really need to know much about the PAD when it encounters one (ie. no need to ask is this a good PAD or a bad PAD?).

- Correctly decodes all Base64 inputs where PADs are in the correct place, so in a way we're already implementing option 2 here.

- Also properly decodes the invalid input "AB=AB=" if stream corruption happened because of: s/==/=/g 

- For all other inputs it's garbage-in garbage-out




  
> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Gary Gregory (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874386#action_12874386 ] 

Gary Gregory commented on CODEC-91:
-----------------------------------

On a different tack, I think we should offer an argument to decide if you want to follow the specification WRT "Padding at the end of the data is performed using the "=" character." This would have to be a new API obviously. The default is TDB: should it be the [codec] 1.3 or 1.4 behavior.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Chris Pimlott (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781465#action_12781465 ] 

Chris Pimlott commented on CODEC-91:
------------------------------------

It's not clear to me if the old behavior was strictly a bug, given the "may".  RFC4648 states:

   Furthermore, such specifications MAY ignore the pad
   character, "=", treating it as non-alphabet data, if it is present
   before the end of the encoded data.

http://tools.ietf.org/html/rfc4648#section-3.3

It sounds like either way (ignoring or stopping) of treating the embedded "=" is acceptable.

At the very least, the change in behavior should have been made clear in the release notes and in the javadoc.  An option to restore the previous behavior would be appreciated as well.

> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CODEC-91) Handling of embedded padding in base64 encoded data changed in 1.4

Posted by "Julius Davies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874364#action_12874364 ] 

Julius Davies commented on CODEC-91:
------------------------------------

PAD after a single character should be impossible.  Something bad has happened in that case.  There's no way to stuff 8 bits into a single 6 bit container.

Look at the 4 possibilities:

{code}
ABCD --> 4 * 6 = 24 bits --> 3 bytes
ABC= --> 3 * 6 = 18 bits --> 2 bytes + 2 bits of padding
AB== --> 2 * 6 = 12 bits --> 1 byte + 4 bits of padding
A=== --> 1 * 6 =  6 bits --> original byte cannot be reconstructed
{code}


A single PAD restarts the decoding, but so does the 2nd PAD if it's there.  Both restart it.  Notice the 1st PAD will set modulus to 0, so the if (modulus != 0) branch will be skipped for the 2nd PAD.

I put the unit-test into Base64Codec13Test.java because the end result with codec-1.3.jar is the same, although the logic is a bit different.

I will still look into what we can do for modulus==1 case, even though 2 bits are definitely lost in that case.



> Handling of embedded padding in base64 encoded data changed in 1.4
> ------------------------------------------------------------------
>
>                 Key: CODEC-91
>                 URL: https://issues.apache.org/jira/browse/CODEC-91
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Chris Pimlott
>         Attachments: codec-91-actually-works-and-tests-yay.patch
>
>
> 1.4 changed the way that embedded padding characters (i.e. "=") were handled when decoding base64 data.  Previously, the decoder ignored them and decoded all the data.  Now it stops upon encountering the first padding byte.  This breaks compatibility with previous versions.
> For example, in 1.4,
> b64.decode("Y29tbW9ucwo=".getBytes());
> gives the same result as
> b64.decode("Y29tbW9ucwo=Y29tbW9ucwo=".getBytes());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.