You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2007/04/25 14:14:15 UTC

[jira] Created: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

deficit of InputStreamReader support in anonymous class of ContentStream
------------------------------------------------------------------------

                 Key: SOLR-214
                 URL: https://issues.apache.org/jira/browse/SOLR-214
             Project: Solr
          Issue Type: Bug
            Reporter: Koji Sekiguchi


After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.

Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:

    // Cycle through each stream
    for( ContentStream stream : req.getContentStreams() ) {
      String charset = getCharsetFromContentType( stream.getContentType() );
      Reader reader = null;
      if( charset == null ) {
        reader = new InputStreamReader( stream.getStream() );
      }
      else {
        reader = new InputStreamReader( stream.getStream(), charset );
      }
      rsp.add( "update", this.update( reader ) );
      
      // Make sure its closed
      try { reader.close(); } catch( Exception ex ){}
    }

The patch will apply this effect to SolrRequestParsers.

regards,


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Toru Matsuzawa (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491926 ] 

Toru Matsuzawa commented on SOLR-214:
-------------------------------------

This problem can be confirmed with tomcat 5.5.23.

This problem had occurred by "/update" before the correction of SOLR-197.
stream.getReader() is acquired by org.apache.catalina.connector.CoyoteReader. 

CoyoteReader use org.apache.catalina.connector.InputBuffer#realReadBytes().
realReadBytes() is read with byte order.
Therefore, garbled characters in the index.


> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reassigned SOLR-214:
----------------------------------

    Assignee: Ryan McKinley

> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Assigned To: Ryan McKinley
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491711 ] 

Ryan McKinley commented on SOLR-214:
------------------------------------

Weird - the javadocs a pretty explicit that request.getReader() should  take care of the character encoding:
http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader()

What app server are you running?

Does this happen when you are using the /update from servlet?  (when /update is not mapped in solrconfig.xml)

SolrUpdateServlet.java has always  used getReader() .


> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi closed SOLR-214.
-------------------------------

    Resolution: Invalid

Close as invalid. The servlet container should take care of character encoding.

> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491746 ] 

Ken Krugler commented on SOLR-214:
----------------------------------

There's some complex interplay of the content-type in the request, the charset (if any) in the request, and the container being used. So some interesting questions are:

# exactly how the content is being posted (e.g. via the example script?)
# what request header values are being sent along with the post.
# what servlet container (and version) is being used.


> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-214.
--------------------------------

    Resolution: Fixed

added in rev 536019

> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Assigned To: Ryan McKinley
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated SOLR-214:
--------------------------------

    Attachment: UseInputStreamReader.patch

The patch attached.

> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491938 ] 

Koji Sekiguchi commented on SOLR-214:
-------------------------------------

> Weird - the javadocs a pretty explicit that request.getReader() should take care of the character encoding:
> http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader()

Good point. I simply thought the cause of this problem was the deficit of InputStreamReader support at SOLR-197.
But according to the javadoc, the servlet container should take care of encoding. We are using Tomcat 5.5.23. We should check out the servlet container. Thanks.

> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494151 ] 

Koji Sekiguchi commented on SOLR-214:
-------------------------------------

At this moment, to avoid this problem, we are examining to put a servlet filter to work.
But if Solr handles character encoding explicitly, we will be happy with it. We are using Tomcat 5.5.23.


> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reopened SOLR-214:
--------------------------------


Without this patch, resin balks at utf-8 input

http://www.nabble.com/UTF-8-problem-with-Resin-tf3704271.html

If resin and tomcat don't handle "getReader()" correctly, maybe we should handle it explicitly




> deficit of InputStreamReader support in anonymous class of ContentStream
> ------------------------------------------------------------------------
>
>                 Key: SOLR-214
>                 URL: https://issues.apache.org/jira/browse/SOLR-214
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Koji Sekiguchi
>         Attachments: UseInputStreamReader.patch
>
>
> After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
> I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
> The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.
> Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:
>     // Cycle through each stream
>     for( ContentStream stream : req.getContentStreams() ) {
>       String charset = getCharsetFromContentType( stream.getContentType() );
>       Reader reader = null;
>       if( charset == null ) {
>         reader = new InputStreamReader( stream.getStream() );
>       }
>       else {
>         reader = new InputStreamReader( stream.getStream(), charset );
>       }
>       rsp.add( "update", this.update( reader ) );
>       
>       // Make sure its closed
>       try { reader.close(); } catch( Exception ex ){}
>     }
> The patch will apply this effect to SolrRequestParsers.
> regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.