You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Frédéric Olier <FO...@wooxo.fr> on 2015/11/04 15:41:00 UTC

Managing ZIP files inside ZIP files

Hi,

I have a ZIP (tar.gz) that contains many (> 100) other tar.gz files inside.

Solr takes ages to ingest the document.

I'd like to know if other users experienced with such a configuration and what the solution they found ?

Is there a way to tell Solr to go '1 level deep' while analysing the archive contents ?
Is that the right approach ?

Thanks for your response.

F. OLIER.




[TOP 250 des éditeurs]<http://miblink.letsignit.com/r/3808/5bf98bda-7098-42c9-aba2-bf0a530cdcc5/undefined>


[Logo]<http://miblink.letsignit.com/r/1794/57f8dd12-c869-43e5-ad7b-c2feb68e8f01/undefined>


Suivez-nous !

[Linkedin]<http://miblink.letsignit.com/r/1795/a000215b-477c-4a54-a2ff-be46f99f3bff/undefined>

[Viadeo]<http://miblink.letsignit.com/r/1796/e4eb6b07-d3cf-4f01-a6d4-07e463291ce7/undefined>

[Twitter]<http://miblink.letsignit.com/r/1797/28a8d571-9ee6-41fa-a871-909f7fdc5be7/undefined>

[Googleplus]<http://miblink.letsignit.com/r/2870/dbef1972-c4cd-4d3f-8be2-a3ffe1963204/undefined>




Frédéric OLIER | Responsable de la planification stratégique
33 442 016 891
33 662 635 031

WOOXO
Tél : 0811 140 160
Fax0811 481 507
Immeuble Le Forum - Bât A - 3ème étage
515 av. de la Tramontane
ZAC Athélia IV
13600 LA CIOTAT
FRANCE






Re: Managing ZIP files inside ZIP files

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
How are you injesting them now?

I'd probably use Java8 with SolrJ and use new Virtual File System approach
to read right out of the zip and gzip .
http://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystems.html#newFileSystem-java.nio.file.Path-java.lang.ClassLoader-

Tar is a bit harder, there is apache commons that reads it, but probably
not in Java8 way. You may have to extract it into the memory buffer and
construct file from that.

But basically both tar and gzip are streaming formats, so you should be
able to do a single-pass through them with in-memory decompression.

Still, without knowing what you do, it is hard to tell where "slow" is
coming from.

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 4 November 2015 at 09:41, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
>
>
> I have a ZIP (tar.gz) that contains many (> 100) other tar.gz files inside.
>
>
>
> Solr takes ages to ingest the document.
>
> I’d like to know if other users experienced with such a configuration and
> what the solution they found ?
>
>
>
> Is there a way to tell Solr to go ‘1 level deep’ while analysing the
> archive contents ?
>
> Is that the right approach ?
>
>
>
> Thanks for your response.
>
>
>
> F. OLIER.
>
>
>
>
>
>
>
>
>
> [image: TOP 250 des éditeurs]
> <http://miblink.letsignit.com/r/3808/5bf98bda-7098-42c9-aba2-bf0a530cdcc5/undefined>
>
> [image: Logo]
> <http://miblink.letsignit.com/r/1794/57f8dd12-c869-43e5-ad7b-c2feb68e8f01/undefined>
>
> *Suivez-nous !*
>
> [image: Linkedin]
> <http://miblink.letsignit.com/r/1795/a000215b-477c-4a54-a2ff-be46f99f3bff/undefined>
>
> [image: Viadeo]
> <http://miblink.letsignit.com/r/1796/e4eb6b07-d3cf-4f01-a6d4-07e463291ce7/undefined>
>
> [image: Twitter]
> <http://miblink.letsignit.com/r/1797/28a8d571-9ee6-41fa-a871-909f7fdc5be7/undefined>
>
> [image: Googleplus]
> <http://miblink.letsignit.com/r/2870/dbef1972-c4cd-4d3f-8be2-a3ffe1963204/undefined>
>
> *Frédéric OLIER** | Responsable de la planification stratégique*
>
> * 33 442 016 891 33 662 635 031*
>
> *WOOXO*
> Tél : 0811 140 160
> Fax0811 481 507
> Immeuble Le Forum - Bât A - 3ème étage
> 515 av. de la Tramontane
> ZAC Athélia IV
> 13600 LA CIOTAT
> FRANCE
>
>
>
>
>