You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2009/04/08 20:39:12 UTC
[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark

    [ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697143#action_12697143 ] 

Michael McCandless commented on LUCENE-1591:
--------------------------------------------

I'm hitting this, when trying to convert the 20090306 Wikipedia export to a line file:

{code}
Exception in thread "Thread-0" java.lang.ArrayIndexOutOfBoundsException: 2048
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
	at java.lang.Thread.run(Thread.java:619)
{code}

>From this:

  http://marc.info/?l=xerces-j-user&m=120452263925040&w=2

It sounds likely an upgrade to xerces 2.9.1 will fix it.  I'm testing it now... if it fixes the issue, I'll commit the upgrade to contrib/benchmark.

> Enable bzip compression in benchmark
> ------------------------------------
>
>                 Key: LUCENE-1591
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1591
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>
> bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org