You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "James Dyer (JIRA)" <ji...@apache.org> on 2013/01/03 23:10:12 UTC

[jira] [Updated] (SOLR-2347) Use InputStream and not Reader for XML parsing

     [ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Dyer updated SOLR-2347:
-----------------------------

    Attachment: SOLR-2347.patch

Here is an attempt to fix DIH.  But we cannot make DataSource always deal in InputStreams.  JDBCDataSource, for instance, is entirely different and returns Maps. 

One difficult thing here is FieldReaderDataSource which can take a java.sql.Clob and pass it down to an XML or other text processor.  Clob#getCharacterStream() returns a Reader, not an Inputstream.  So I ended up having XpathEntityProcessor take a DataSource<?> and checking instanceof for Reader or InputStream.

All of this makes me wonder if having DataSource separate from EntityProcessor is really good design here.  The EntityProcessors are very much married to their DataSources and you really can't mix-n-match very much as the conceptualization would lend one to believe...
                
> Use InputStream and not Reader for XML parsing
> ----------------------------------------------
>
>                 Key: SOLR-2347
>                 URL: https://issues.apache.org/jira/browse/SOLR-2347
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.2, 5.0
>
>         Attachments: SOLR-2347.patch
>
>
> Followup to SOLR-96:
> Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a "hint" to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others).
> This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org