You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/04/10 08:47:10 UTC

Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)

I started working on the patch for 1591, and noticed EnwikiDocMaker uses the
FileInputStream instance from LineDocMaker and not the BuferredReader. I
don't see any reason to this, as InputSource accepts a Reader. I can change
it as part of 1591, unless you think I'm missing something.

Re: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)

Posted by Shai Erera <se...@gmail.com>.
Thanks Uwe. Then I think we should at least wrap the IS with a Buffered IS
in EnwikiDocMaker (that's what I wanted to achieve in the first place,
reusing LDM's BufferedReader)?

On Fri, Apr 10, 2009 at 10:22 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  Hi Shai,
>
>
>
> with XML parsers you should generally avoid using Readers, unless you know
> exactly that the underlying XML encoding is really the one given to the
> Reader. Readers as parameters should only be used for sources that are
> invariant of the encoding (like Java Strings containing XML, and without
> encoding declaration!!!!).
>
>
>
> Good examples of correctly using a Reader are:
>
> - new InputSource(new StringReader(“<tag>….</tag>”));  // no xml
> declaration
>
> - An XML stream serialized from a SAX/DOM to a Writer itself (so it is
> without encoding), e.g. stored in a Lucene Stored String.
>
>
>
> But documents from unknown source should always handled as byte streams.
> The XML parser must be able to switch the encoding according to the
> declaration it found in XML header, this is not possible with Readers.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Friday, April 10, 2009 8:47 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)
>
>
>
> I started working on the patch for 1591, and noticed EnwikiDocMaker uses
> the FileInputStream instance from LineDocMaker and not the BuferredReader. I
> don't see any reason to this, as InputSource accepts a Reader. I can change
> it as part of 1591, unless you think I'm missing something.
>

RE: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Shai,

 

with XML parsers you should generally avoid using Readers, unless you know
exactly that the underlying XML encoding is really the one given to the
Reader. Readers as parameters should only be used for sources that are
invariant of the encoding (like Java Strings containing XML, and without
encoding declaration!!!!).

 

Good examples of correctly using a Reader are:

- new InputSource(new StringReader("<tag>..</tag>"));  // no xml declaration

- An XML stream serialized from a SAX/DOM to a Writer itself (so it is
without encoding), e.g. stored in a Lucene Stored String.

 

But documents from unknown source should always handled as byte streams. The
XML parser must be able to switch the encoding according to the declaration
it found in XML header, this is not possible with Readers.

 

Uwe

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Friday, April 10, 2009 8:47 AM
To: java-dev@lucene.apache.org
Subject: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)

 

I started working on the patch for 1591, and noticed EnwikiDocMaker uses the
FileInputStream instance from LineDocMaker and not the BuferredReader. I
don't see any reason to this, as InputSource accepts a Reader. I can change
it as part of 1591, unless you think I'm missing something.