You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by tsuraan <ts...@gmail.com> on 2009/03/06 23:02:34 UTC

ZipFile directory implementation

I wrote a really basic read-only Directory implementation for indices
contained in zip files.  It's read-only because that's what Java's API
supports, and it has no documentation or anything else because I
haven't gotten to that yet.  It also claims its package is
org.apache.lucene.store since that's how I was testing it.

Anyhow, it's really ugly, but seems to work.  I was wondering if
anybody wanted to have a glance at it to see if there's anything
obvious that I'm doing wrong, simple off-by-one errors, that sort of
thing.

The code is on github,
http://github.com/tsuraan/zipdirectory/tree/master .  If anybody wants
to have a look, test it out a bit, whatever, I'd be grateful.  There's
no license headers on the source either; I figured public domain, bsd,
apache-2.0, whatever works would be fine.  I'm also open to better
methods of packaging it; I assume that putting it in the lucene
package like that myself isn't quite the right way to do things...

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ZipFile directory implementation

Posted by tsuraan <ts...@gmail.com>.

> Also, have you looked at how it performs?

Just making a directory of 1,000,000 documents and reading from it, it
looks like this implementation is probably unbearably slow, unless
Lucene has some really good caching.  ZipFile gives InputStreams for
the zip contents, and InputStreams don't support a seek (I'm guessing
that zip files probably don't either), so nearly every time Lucene
calls seek on the IndexInput we actually have to do a ton of reading.
I'm guessing that there's no good way around this, so my ZipDirectory
is probably not good for much unless you're really hard-up for drive
space, or if you want to use Directory.copy to put it into a
RAMDirectory for actual use :)

Is there any compressed format that does support random access?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ZipFile directory implementation

Posted by Michael McCandless <lu...@mikemccandless.com>.

tsuraan wrote:

>> Sounds interesting.  Can you tell us a bit more about the use case  
>> for it?
> Is it basically you are in a situation where you can't unzip the  
> index?
>
> Indices compress pretty nicely: 30% to 50% in my experience.  So, if  
> youre
> indices are read-only anyhow (mine aren't live; we do batch jobs to  
> modify
> them, so they're mostly read-only), they might as well be stored  
> compressed
> to save on disk usage.  Sometimes on-disk compression of files (in  
> general)
> can help throughput, since the drive IO tends to be a bottleneck  
> rather than
> the CPU load; I don't know whether that's true of zipped lucene  
> indices
> though.
>
>> Also, have you looked at how it performs?
>
> No, I'm not sure how to do this; what are good benchmarks of store
> performance?  Write speed tends to be a significant thing to test,  
> but my
> ZipDirectory doesn't support writing.  What other operations tend to  
> be
> commonly done in searching?  I could create an IndexReader and call  
> document
> and getTermFreqVectors for each doc in my reader.  Is that a useful  
> test, or
> is there some established body of useful measures on a store?

You could use contrib/benchmark.

I think query performance, for simple term queries, AND, OR, phrase,  
etc., would be interesting.

It sounds like the model is, you use a normal Lucene directory to  
create the index, then you zip it up, at which point you can then use  
ZipDirectory to search it.

I think this would make a great contribution -- any chance you could  
package it up and attach a patch to a new Jira issue?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ZipFile directory implementation

Posted by tsuraan <ts...@gmail.com>.

> Sounds interesting.  Can you tell us a bit more about the use case for it?
 Is it basically you are in a situation where you can't unzip the index?

Indices compress pretty nicely: 30% to 50% in my experience.  So, if youre
indices are read-only anyhow (mine aren't live; we do batch jobs to modify
them, so they're mostly read-only), they might as well be stored compressed
to save on disk usage.  Sometimes on-disk compression of files (in general)
can help throughput, since the drive IO tends to be a bottleneck rather than
the CPU load; I don't know whether that's true of zipped lucene indices
though.

> Also, have you looked at how it performs?

No, I'm not sure how to do this; what are good benchmarks of store
performance?  Write speed tends to be a significant thing to test, but my
ZipDirectory doesn't support writing.  What other operations tend to be
commonly done in searching?  I could create an IndexReader and call document
and getTermFreqVectors for each doc in my reader.  Is that a useful test, or
is there some established body of useful measures on a store?

Re: ZipFile directory implementation

Posted by Grant Ingersoll <gs...@apache.org>.

Hi,

Sounds interesting.  Can you tell us a bit more about the use case for  
it?  Is it basically you are in a situation where you can't unzip the  
index?

Also, have you looked at how it performs?

-Grant

On Mar 6, 2009, at 5:02 PM, tsuraan wrote:

> I wrote a really basic read-only Directory implementation for indices
> contained in zip files.  It's read-only because that's what Java's API
> supports, and it has no documentation or anything else because I
> haven't gotten to that yet.  It also claims its package is
> org.apache.lucene.store since that's how I was testing it.
>
> Anyhow, it's really ugly, but seems to work.  I was wondering if
> anybody wanted to have a glance at it to see if there's anything
> obvious that I'm doing wrong, simple off-by-one errors, that sort of
> thing.
>
> The code is on github,
> http://github.com/tsuraan/zipdirectory/tree/master .  If anybody wants
> to have a look, test it out a bit, whatever, I'd be grateful.  There's
> no license headers on the source either; I figured public domain, bsd,
> apache-2.0, whatever works would be fine.  I'm also open to better
> methods of packaging it; I assume that putting it in the lucene
> package like that myself isn't quite the right way to do things...
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org