You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Daniel Noll <da...@nuix.com.au> on 2005/03/30 03:12:01 UTC

Most recent release? (Codepage support in HPSF)

I'm in need of codepage support in HPSF, and noticed that according to 
CVS, codepage support was added on the 2nd of December, 2003.

The last release (poi-src-2.5.1-final-20040804) is dated the 4th of 
August, 2004, but as far as I can tell, this release has no codepage 
support.  In fact, it lacks the class which the support was added to... 
meaning that the release really dates to more than 19 months ago.

Given that POI CVS doesn't appear to have any branches in it, I'm just 
getting a little confused over when this release was actually cut.  
Perhaps it was supposed to say 2003 instead of 2004?

Either way I need this codepage support (at least for UTF-8), so how 
stable would people say the current CVS is?

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Patch: Dictionary should be read in the codepage of the section it's in.

Posted by Daniel Noll <da...@nuix.com.au>.
Rainer Klute wrote:

>Am Donnerstag, den 31.03.2005, 10:37 +1000 schrieb Daniel Noll:
>  
>
>>Danny Mui wrote:
>>
>>    
>>
>>>Can you attach it to a bug with a prefix of [PATCH]?  Easier to track 
>>>down changes down the road.
>>>
>>>http://issues.apache.org/bugzilla/enter_bug.cgi?product=POI
>>>      
>>>
>>Done: http://issues.apache.org/bugzilla/show_bug.cgi?id=34247
>>
>>Daniel
>>    
>>
>
>Just a short note to let you know that I am working on this. However,
>UTF-8 causes problems. It is sooo good to have those test cases!
>  
>
I was wondering about that.  Our particular UTF-8 example works, but 
it's only Russian, so might be one of the lucky cases which works.

I assume that the count is actually the number of characters... which 
makes life very interesting indeed for encodings which are not of a 
fixed byte count. :-)

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Patch: Dictionary should be read in the codepage of the section it's in.

Posted by Rainer Klute <kl...@rainer-klute.de>.
Am Donnerstag, den 31.03.2005, 10:37 +1000 schrieb Daniel Noll:
> Danny Mui wrote:
> 
> > Can you attach it to a bug with a prefix of [PATCH]?  Easier to track 
> > down changes down the road.
> >
> > http://issues.apache.org/bugzilla/enter_bug.cgi?product=POI
> 
> Done: http://issues.apache.org/bugzilla/show_bug.cgi?id=34247
> 
> Daniel

Just a short note to let you know that I am working on this. However,
UTF-8 causes problems. It is sooo good to have those test cases!

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

Inhibit software patents: http://www.nosoftwarepatents.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Patch: Dictionary should be read in the codepage of the section it's in.

Posted by Daniel Noll <da...@nuix.com.au>.
Danny Mui wrote:

> Can you attach it to a bug with a prefix of [PATCH]?  Easier to track 
> down changes down the road.
>
> http://issues.apache.org/bugzilla/enter_bug.cgi?product=POI

Done: http://issues.apache.org/bugzilla/show_bug.cgi?id=34247

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Patch: Dictionary should be read in the codepage of the section it's in.

Posted by Danny Mui <da...@muibros.com>.
Can you attach it to a bug with a prefix of [PATCH]?  Easier to track 
down changes down the road.

http://issues.apache.org/bugzilla/enter_bug.cgi?product=POI

Daniel Noll wrote:
> Daniel Noll wrote:
> 
>> I'll submit a patch in a few minutes if I can clean up my 
>> codesufficiently, but it won't be a clean patch so you'll probably 
>> have to rearrange it a little. :-/
> 
> 
> Patch attached as promised / threatened. ;-)
> 
> To compare before and after, you might want to construct a test 
> document, and create a custom property with non-latin name and value.  
> The previous code will work for the value but not the name, and this 
> update should hopefully make it work for the property name as well.
> 
> I'm not sure if this works in all cases, but it actually seems to behave 
> for our test UTF-8 custom properties, which should be the most exotic 
> encoding one would come across (fingers crossed.)
> 
> Daniel
> 
> 
> ------------------------------------------------------------------------
> 
> Index: src/java/org/apache/poi/hpsf/Property.java
> ===================================================================
> RCS file: /home/cvspublic/jakarta-poi/src/java/org/apache/poi/hpsf/Property.java,v
> retrieving revision 1.20
> diff -u -r1.20 Property.java
> --- src/java/org/apache/poi/hpsf/Property.java  31 Aug 2004 20:45:00 -0000      1.20
> +++ src/java/org/apache/poi/hpsf/Property.java  30 Mar 2005 06:30:44 -0000
> @@ -170,9 +170,12 @@
>       * @param length The dictionary contains at most this many bytes.
>       * @param codepage The codepage of the string values.
>       * @return The dictonary
> +     * @exception UnsupportedEncodingException if the specified codepage is not
> +     * supported.
>       */
>      protected Map readDictionary(final byte[] src, final long offset,
>                                   final int length, final int codepage)
> +    throws UnsupportedEncodingException
>      {
>          /* Check whether "offset" points into the "src" array". */
>          if (offset < 0 || offset > src.length)
> @@ -202,19 +205,23 @@
>              long sLength = LittleEndian.getUInt(src, o);
>              o += LittleEndian.INT_SIZE;
> 
> -            /* Read the bytes or characters depending on whether the
> -             * character set is Unicode or not. */
> -            StringBuffer b = new StringBuffer((int) sLength);
> -            for (int j = 0; j < sLength; j++)
> -                if (codepage == Constants.CP_UNICODE)
> -                {
> -                    final int i1 = o + (j * 2);
> -                    final int i2 = i1 + 1;
> -                    b.append((char) ((src[i2] << 8) + src[i1]));
> -                }
> -                else
> -                    b.append((char) src[o + j]);
> -
> +            String value;
> +            switch (codepage)
> +            {
> +                case -1:
> +                    value = new String(src, o, (int) sLength);
> +                    break;
> +                case Constants.CP_UNICODE:
> +                    // In the case of UTF-16, the length represents the number of characters.
> +                    value = new String(src, o, (int) sLength * 2, VariantSupport.codepageToEncoding(codepage));
> +                    break;
> +                default:
> +                    // TODO: Confirm the behaviour of UTF-8.
> +                    value = new String(src, o, (int) sLength, VariantSupport.codepageToEncoding(codepage));
> +            }
> +
> +            StringBuffer b = new StringBuffer(value);
> +
>              /* Strip 0x00 characters from the end of the string: */
>              while (b.length() > 0 && b.charAt(b.length() - 1) == 0x00)
>                  b.setLength(b.length() - 1);
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Patch: Dictionary should be read in the codepage of the section it's in.

Posted by Daniel Noll <da...@nuix.com.au>.
Daniel Noll wrote:

> I'll submit a patch in a few minutes if I can clean up my 
> codesufficiently, but it won't be a clean patch so you'll probably 
> have to rearrange it a little. :-/

Patch attached as promised / threatened. ;-)

To compare before and after, you might want to construct a test 
document, and create a custom property with non-latin name and value.  
The previous code will work for the value but not the name, and this 
update should hopefully make it work for the property name as well.

I'm not sure if this works in all cases, but it actually seems to behave 
for our test UTF-8 custom properties, which should be the most exotic 
encoding one would come across (fingers crossed.)

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


Re: Most recent release? (Codepage support in HPSF)

Posted by Daniel Noll <da...@nuix.com.au>.
Rainer Klute wrote:

>Regarding HPSF the CVS HEAD is stable. If you need any codepages that
>are unsupported yet, they can easily be added.
>
>  
>
I had a look at the VariantSupport class, and it seems to be pretty 
straight-forward to add new types.  If we find any that aren't there, 
I'll submit patches.

However...

I've just discovered that the dictionary in the custom property set 
isn't being read with the codepage specified for that property set.

I've tracked the issue down to the methor Property.readDictionary, which 
is doing the reading manually whereas Property itself uses 
VariantSupport.read() to perform all the decoding.

I'm doing some twiddling, doing a VariantSupport.codepageToEncoding call 
to determine how to create the string, but I'm not sure it will work 
because I don't fully understand what the length means in this case.  If 
the code page is 65001 (UTF-8), is the length the length in bytes or the 
length in characters?  If it's the length in bytes, this becomes easy, 
and it seems to work for the test file we have.  If it's the length in 
characters, it might take longer to get right.

I'll submit a patch in a few minutes if I can clean up my 
codesufficiently, but it won't be a clean patch so you'll probably have 
to rearrange it a little. :-/

Daniel

-- 
Daniel Noll

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Most recent release? (Codepage support in HPSF)

Posted by Rainer Klute <kl...@rainer-klute.de>.
Am Mittwoch, den 30.03.2005, 11:12 +1000 schrieb Daniel Noll:
> I'm in need of codepage support in HPSF, and noticed that according to 
> CVS, codepage support was added on the 2nd of December, 2003.
> 
> The last release (poi-src-2.5.1-final-20040804) is dated the 4th of 
> August, 2004, but as far as I can tell, this release has no codepage 
> support.  In fact, it lacks the class which the support was added to... 
> meaning that the release really dates to more than 19 months ago.
> 
> Given that POI CVS doesn't appear to have any branches in it, I'm just 
> getting a little confused over when this release was actually cut.  
> Perhaps it was supposed to say 2003 instead of 2004?
> 
> Either way I need this codepage support (at least for UTF-8), so how 
> stable would people say the current CVS is?

Regarding HPSF the CVS HEAD is stable. If you need any codepages that
are unsupported yet, they can easily be added.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

Inhibit software patents: http://www.nosoftwarepatents.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/