You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@abdera.apache.org by "Chris Berry (JIRA)" <ji...@apache.org> on 2007/09/04 05:24:58 UTC

[jira] Created: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Invalid UTF-8 chars in the AbderaClient
---------------------------------------

                 Key: ABDERA-60
                 URL: https://issues.apache.org/jira/browse/ABDERA-60
             Project: Abdera
          Issue Type: Bug
    Affects Versions: 0.3.0
         Environment: N/A
            Reporter: Chris Berry
             Fix For: 0.3.0


After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 

I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
It runs and builds out-of-the-box (using mvn install).
There is also a README.txt that details the output/issue

This JUnit reproduces the error. It is as small as I could get it. 
My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable

I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.

I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.

Note: I have looked very closely at the XML file(s) that is causing this issue. 
I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.

Chris Berry
chriswberry at gmail dot com


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524776 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------


The last line in the JUnit should be changed to::

        //assertEquals("Some Textx", entry.getContent());
        assertTrue( entry.getContent().indexOf( "id=\"9999\"" ) != -1 );

It will fail properly with the UTF-8 bug
But won't pass when it's fixed ;-) 

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Berry updated ABDERA-60:
------------------------------

    Attachment: abdera-utf8-bug.tar.gz

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524775 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

I want to point out several incidental facts::

1)  I saw a similar problem with an earlier 0.3. I was mixing the latest woodstox (3.2.1) with Abdera 
Or more correctly, maven was bringing in some chained dependencies -- one of which brought in woodstox 3.2.1. 
Abdera was using woodstox 2.0.5 at that time. 
The problem went away when I corrected this problem, using the maven <exclusions> element.
So this problem, exists in code from a long time back...

2) We are using woodstox 3.2.1 in another project with these exact same XMLs without problem.

3)  I ran these XML documents with the supposed invalid chars thru 2 different UTF-8 conversions as I read them from disk, before putting them into the <content> (As seen in the JUnit)
And I also processed them with the Unix "iconv" utility


So I am pretty darn sure that there are no invalid chars in the XML.

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525263 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

Reviewing my changes, this may not truly be an Abdera bug, 
although my code below does workaround the problem.                 

When we call;     

             FOMParser.parse( InputStream is,...  

it subsequently call Axiom's 

            StAXUtils.createXMLStreamReader(in, charset);     

(when there is a charset -- which there should be -- at least in my case)
This presumably should create a Reader with the proper charset??
But it definitely does not. So there is a bug somewhere in Axiom or possibly even Woodstox??

So what is happening is that the Reader (created by StAXUtils and subsequently Woodstox) 
uses the default encoding (MacRoman in my case)
Which is the reason why it works in Linux -- the default encoding is UTF-8.

I don't know what Herbert's default encoding is....

>>Would it be possible for you to put together a patch file with these
>>changes?

I would gladly produce a patch. 
BUT I really think you need to decide how to handle this.
When I call 

             FOMParser.parse( Reader rr,...  

This bypasses a bit of code. 

IMHO, I think that you should simply roll the required  "FOMParser.parse( InputStream is,..."  code into  "FOMParser.parse( Reader rr,... "
And not rely on the underlying code to do the right thing.

Oh, and for the Content-Type header, the right thing to do is call the
getCharacterEncoding method on ClientResponse.  You will still need to
verify that the value specified for the parameter is correct

So this should be something like this.....

  public BaseResponseContext(T base, boolean chunked) {
    this.base = base;
    setStatus(200);
    setStatusText("OK");
    this.chunked = chunked;
    try {
           //  setContentType(getContentType().toString());
           setContentType(getContentType().toString() + "; charset=" + getCharacterEncoding() );
    } catch (Exception e) {}
  }


> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "James M Snell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James M Snell resolved ABDERA-60.
---------------------------------

    Resolution: Fixed

Checked in a change to 0.3.0 and trunk that forces the use of UTF-8 if the charset is not otherwise specified. Also, have the server return the charset parameter in the content-type.

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525262 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

We figured it out. AFAICT, both my issue and Herbert's are the same.
I believe this is a bug in Abdera.  

There are actually two issues;

-----------------------
First ,  Abdera uses HttpClient's  

        method.getResponseBodyAsStream(); 

in order to obtain a raw stream bytes for Woodstox. (which is the correct thing to do for performance)

But Woodstox does NOT assume UTF-8.  So it fails when parsing valid UTF-8 characters.

The fix is to change the following line in AbstractClientResponse

  public <T extends Element>Document<T> getDocument( Parser parser,  ParserOptions options)
         throws ParseException {
    try {
      .......
      // Document<T> doc = parser.parse( getInputStream(), base, options);
      Document<T> doc = parser.parse(getReader(), base, options);
      ....

And to add the following method to AbstractClientResponse
    
  public java.io.Reader getReader() throws java.io.IOException {
    String header = getHeader("Content-Type");

    String type = "UTF-8"; // default to UTF-8
    java.util.regex.Matcher matcher = java.util.regex.Pattern.compile(".*charset\\s*\\=\\s*(\\S+).*").matcher(header);
    if (matcher.matches()) {
      System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type);
       type = matcher.group(1);
    }

    return new java.io.InputStreamReader(getInputStream(), type);
  }

Although, there is likely a cleaner way to get the "charset" param in Abdera??

-----------------------------
Second,  Abdera is NOT adding the "charset" parameter (e.g. ";charset=utf-8" ) to the Content-Type HTTP Header of the Response

So a fix might be to change the following line in BaseResponseContext::

  public BaseResponseContext(T base, boolean chunked) {
    this.base = base;
    setStatus(200);
    setStatusText("OK");
    this.chunked = chunked;
    try {

      //  setContentType(getContentType().toString());
      setContentType(getContentType().toString() + "; charset=utf-8");

    } catch (Exception e) {}
  }

Although there are likely better ways/places to accomplish this within Abdera.
Perhaps I need to set this in my SpringAbderaServlet??



> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Berry updated ABDERA-60:
------------------------------

    Attachment: abdera-utf8-bug.tar.gz

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524770 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

This problem has been boiled down to an interaction between woodstox 3.2.1 and Abdera.

Per James Snell (emails to the Abdera-Users list)::
===============
 ... there are no changes in Abdera 0.3.0 that *require* the new
version of woodstox.  If dropping down to an older version addresses the
issue, then we can explore that as a solution.

Per Chris Berry::
============-

That fixes it!!! 

I modified all of the pertinent POMs accordingly;
I.e. 
<!--    
      <dependency>
        <groupId>org.codehaus.woodstox</groupId>
        <artifactId>wstx-asl</artifactId>
        <version>3.2.1</version>
        <scope>runtime</scope>	   
      </dependency>
-->
      <dependency>
        <groupId>woodstox</groupId>
        <artifactId>wstx-asl</artifactId>
        <version>2.0.5</version>
        <scope>runtime</scope>	   
      </dependency>

9 POMs were affected::

dogstar:~/java/abdera/svn-head-using-old-woostox/trunk cberry$ find . -name "*.xml" | xargs grep woodstox
./extensions/gdata/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/geo/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/json/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/main/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/media/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/opensearch/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./extensions/sharing/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./parser/pom.xml:      <groupId>org.codehaus.woodstox</groupId>
./pom.xml:        <groupId>org.codehaus.woodstox</groupId>



> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524835 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

Just to be positive. I have added code to the previous JUnit that actually retrieves text from the XML w/  woodstox.
This is pretty unequivocal now...

package com.homeaway.hcdata.store.provider.blogs;

import junit.framework.Test; 
import junit.framework.TestCase; 
import junit.framework.TestSuite;

import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;

import java.io.FileInputStream;

import com.ctc.wstx.stax.WstxInputFactory; 

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 */
public class WoodstoxTest extends TestCase {

    static private Log log = LogFactory.getLog( WoodstoxTest.class );

    private static final String userdir = System.getProperty( "user.dir" );
    private static final String filename =  userdir + "/var/blogs/cberry/99/9999/en/blog_9999.xml" ;
    
    public static Test suite() 
    { return new TestSuite( WoodstoxTest.class ); }

    public void tearDown() throws Exception 
    { super.tearDown(); } 

    public void setUp() throws Exception 
    { super.tearDown(); } 

    public void testWoodstox1() throws Exception {
        // we will simply walk the doc and see if it throws an Exception
        XMLInputFactory xif = new WstxInputFactory();
        XMLStreamReader r = xif.createXMLStreamReader( new FileInputStream( filename ) );
        while (r.hasNext()) r.next();
        r.close();
    }

    public void testWoodstox2() throws Exception {
        // we will simply walk the doc and see if it throws an Exception
        XMLInputFactory xif = new WstxInputFactory();
        XMLStreamReader reader = xif.createXMLStreamReader( new FileInputStream( filename ) );

        while ( reader.hasNext() ) {
            printEventInfo( reader );
        }
        reader.close();
    }

    private static void printEventInfo(XMLStreamReader reader) throws XMLStreamException {
        int eventCode = reader.next();
        String val = null;
        switch (eventCode) {
            case 1 :
                val= reader.getLocalName(); 
                log.debug("event = START_ELEMENT");
                log.debug("Localname = "+val);
                break;
            case 2 :
                val= reader.getLocalName(); 
                log.debug("event = END_ELEMENT");
                log.debug("Localname = "+val);
                break;
            case 3 :
                val= reader.getPIData();
                log.debug("event = PROCESSING_INSTRUCTION");
                log.debug("PIData = " + val);
                break;
            case 4 :
                val= reader.getText();
                log.debug("event = CHARACTERS");
                log.debug("Characters = " + val);
                break;
            case 5 :
                val= reader.getText();
                log.debug("event = COMMENT");
                log.debug("Comment = " + val);
                break;
            case 6 :
                val= reader.getText();
                log.debug("event = SPACE");
                log.debug("Space = " + val);
                break;
            case 7 :
                log.debug("event = START_DOCUMENT");
                log.debug("Document Started.");
                break;
            case 8 :
                log.debug("event = END_DOCUMENT");
                log.debug("Document Ended");
                break;
            case 9 :
                val= reader.getText();
                log.debug("event = ENTITY_REFERENCE");
                log.debug("Text = " + val);
                break;
            case 11 :
                val= reader.getText();
                log.debug("event = DTD");
                log.debug("DTD = " + val);

                break;
            case 12 :
                val= reader.getText();
                log.debug("event = CDATA");
                log.debug("CDATA = " + val);
                break;
        }
    }

}


> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Berry updated ABDERA-60:
------------------------------

    Attachment:     (was: abdera-utf8-bug.tar.gz)

> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ABDERA-60) Invalid UTF-8 chars in the AbderaClient

Posted by "Chris Berry (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ABDERA-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524812 ] 

Chris Berry commented on ABDERA-60:
-----------------------------------

I added the following JUnit, which I think proves that woodstox 3.2.1 is not the issue.
It passes fine (no Exceptions thrown)

===================================
package com.homeaway.hcdata.store.provider.blogs;


import junit.framework.Test; 
import junit.framework.TestCase; 
import junit.framework.TestSuite;

import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLInputFactory;

import java.io.FileInputStream;

import com.ctc.wstx.stax.WstxInputFactory; 

/**
 */
public class WoodstoxTest extends TestCase {

    private static final String userdir = System.getProperty( "user.dir" );
    
    public static Test suite() 
    { return new TestSuite( WoodstoxTest.class ); }

    public void tearDown() throws Exception 
    { super.tearDown(); } 

    public void setUp() throws Exception 
    { super.tearDown(); } 

    public void testWoodstox() throws Exception {

        String filename =  userdir + "/var/blogs/cberry/99/9999/en/blog_9999.xml" ;

        // we sill simply walk the doc and see if it throws an Exception
        XMLInputFactory xif = new WstxInputFactory();
        XMLStreamReader r = xif.createXMLStreamReader(new FileInputStream( filename ));
        while (r.hasNext()) r.next();
    }


}


> Invalid UTF-8 chars in the AbderaClient
> ---------------------------------------
>
>                 Key: ABDERA-60
>                 URL: https://issues.apache.org/jira/browse/ABDERA-60
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>         Environment: N/A
>            Reporter: Chris Berry
>             Fix For: 0.3.0
>
>         Attachments: abdera-utf8-bug.tar.gz
>
>
> After upgrading to the latest 0.3-SNAPSHOT SVN trunk (on ~8/27/2007)) from a 0.3-SNAPSHOT download from a couple of months ago
> And after making all required modifications  (to catch up with all the API changes), I am seeing "Invalid UTF-8"
> Note that these errors only occur in the AbderaClient when I call "entry.getContent()" 
> I have attached a small, self-contained JUnit test case which reproduces/demonstrates this issue.
> It runs and builds out-of-the-box (using mvn install).
> There is also a README.txt that details the output/issue
> This JUnit reproduces the error. It is as small as I could get it. 
> My Atom Store is based on a Store and StoreProvider (based on code I received from Ugo Cei as a starting point)
> Note that all of the code in src/main/java is relatively fixed between the latest 0.3-SNAPSHOT and the 0.3-SNAPSHOT that works 
> In other words, my code stayed as fixed as possible, and the latest 0.3-SNAPSHOT is the only real variable
> I'm not saying that the bug isn't in my code, Only that it never showed up until my upgrade to 0.3-SNAPSHOT.
> I actually suspect that it may be an issue w/ woodstox, which the latest 0.3-SNAPSHOT significantly upgrades.
> Note: I have looked very closely at the XML file(s) that is causing this issue. 
> I used the Unix util; "iconv" on them. And AFAICT they do not contain improper UTF-8.
> Chris Berry
> chriswberry at gmail dot com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.