You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Luke Shannon <ls...@futurebrand.com> on 2005/03/01 16:34:17 UTC

Zip Files

Hello;

Anyone have an ideas on how to index the contents within zip files?

Thanks,

Luke


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Zip Files

Posted by Luke Shannon <ls...@futurebrand.com>.

Hello;

I posted a question about ZIPS a while back. I thought I would post the
solution I arrived at.

Below is how I ended up handling ZIP files. If a situation occurs where
there are ZIPS inside of ZIPS I ignore the embedded ZIP. If requests come in
to handle this situation, I think I will have to unzip to a temp folder,
index and than delete the temp folder.

NOTE:
In our system all documents contained in a ZIP need to be associated with an
single document in our CMS. This is why all the field obtained from the
archive files are added to the same collection (and eventually written to
the same document).

Luke

//zip files
        else if (attached.getPath().endsWith(".zip")) {
            Document attachedDoc = new Document();
            Trace.DEBUG("Got a zip file to index: " + attached.getPath());
            try {
                ZipFile zip = new ZipFile(attached);
                ZipEntry zipEntry;
                Enumeration files = zip.entries();
                Vector totalEnummaration = new Vector();
                while (files.hasMoreElements()) {
                    zipEntry = (ZipEntry) files.nextElement();
                    Trace.DEBUG("The zip contains file: " +
zipEntry.getName());
                    Enumeration data =
indexAttached(zip.getInputStream(zipEntry), attached
                    .getPath(), zipEntry.getName());
                    /*
                     * put the return fields in a vector
                     */
                    insertFields(totalEnummaration, data);
                    data = null;
                }
                return totalEnummaration.elements();
            } catch (Exception e) {
                Trace.ERROR("INDEXING ERROR: Was unable to index Zip file: "
+ attached.getPath()
                + " " + e);
                return null;
            }
        }

----- Original Message ----- 
From: "Luke Shannon" <ls...@futurebrand.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, March 01, 2005 10:34 AM
Subject: Zip Files


> Hello;
>
> Anyone have an ideas on how to index the contents within zip files?
>
> Thanks,
>
> Luke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Zip Files

Posted by Chris Lamprecht <cl...@gmail.com>.

Luke,

Look at the javadocs for java.io.ByteArrayInputStream - it wraps a
byte array and makes it accessible as an InputStream.  Also see
java.util.zip.ZipFile.  You should be able to read and parse all
contents of the zip file in memory.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html


On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon
<ls...@futurebrand.com> wrote:
> Thanks Ernesto.
> 
> I'm struggling with how I can work with an  array of bytes  instead of a
> Java File.
> 
> It would be easier to unzip the zip to a temp directory, parse the files and
> than delete the directory. But this would greatly slow indexing and use up
> disk space.
> 
> Luke
> 
> ----- Original Message -----
> From: "Ernesto De Santis" <er...@colaborativa.net>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, March 01, 2005 10:48 AM
> Subject: Re: Zip Files
> 
> > Hello
> >
> > first, you need a parser for each file type: pdf, txt, word, etc.
> > and use a java api to iterate zip content, see:
> >
> > http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
> >
> > use getNextEntry() method
> >
> > little example:
> >
> > ZipInputStream zis = new ZipInputStream(fileInputStream);
> > ZipEntry zipEntry;
> > while(zipEntry = zis.getNextEntry() != null){
> >     //use zipEntry to get name, etc.
> >     //get properly parser for current entry
> >     //use parser with zis (ZipInputStream)
> > }
> >
> > good luck
> > Ernesto
> >
> > Luke Shannon escribió:
> >
> > >Hello;
> > >
> > >Anyone have an ideas on how to index the contents within zip files?
> > >
> > >Thanks,
> > >
> > >Luke
> > >
> > >
> > >---------------------------------------------------------------------
> > >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> > >
> > >
> >
> > --
> > Ernesto De Santis - Colaborativa.net
> > Córdoba 1147 Piso 6 Oficinas 3 y 4
> > (S2000AWO) Rosario, SF, Argentina.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Zip Files

Posted by Luke Shannon <ls...@futurebrand.com>.

Thanks Ernesto.

The issue I'm working with now (this is more lack of experience than
anything) is getting an input I can index. All my indexing classes (doc,
pdf, xml, ppt) take a File object as a parameter and return a Lucene
Document containing all the fields I need.

I'm struggling with how I can work with an  array of bytes  instead of a
Java File.

It would be easier to unzip the zip to a temp directory, parse the files and
than delete the directory. But this would greatly slow indexing and use up
disk space.

Luke

----- Original Message ----- 
From: "Ernesto De Santis" <er...@colaborativa.net>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, March 01, 2005 10:48 AM
Subject: Re: Zip Files


> Hello
>
> first, you need a parser for each file type: pdf, txt, word, etc.
> and use a java api to iterate zip content, see:
>
> http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
>
> use getNextEntry() method
>
> little example:
>
> ZipInputStream zis = new ZipInputStream(fileInputStream);
> ZipEntry zipEntry;
> while(zipEntry = zis.getNextEntry() != null){
>     //use zipEntry to get name, etc.
>     //get properly parser for current entry
>     //use parser with zis (ZipInputStream)
> }
>
> good luck
> Ernesto
>
> Luke Shannon escribió:
>
> >Hello;
> >
> >Anyone have an ideas on how to index the contents within zip files?
> >
> >Thanks,
> >
> >Luke
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >
>
> -- 
> Ernesto De Santis - Colaborativa.net
> Córdoba 1147 Piso 6 Oficinas 3 y 4
> (S2000AWO) Rosario, SF, Argentina.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Zip Files

Posted by Ernesto De Santis <er...@colaborativa.net>.

Hello

first, you need a parser for each file type: pdf, txt, word, etc.
and use a java api to iterate zip content, see:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html

use getNextEntry() method

little example:

ZipInputStream zis = new ZipInputStream(fileInputStream);
ZipEntry zipEntry;
while(zipEntry = zis.getNextEntry() != null){
    //use zipEntry to get name, etc.
    //get properly parser for current entry
    //use parser with zis (ZipInputStream)
}

good luck
Ernesto

Luke Shannon escribió:

>Hello;
>
>Anyone have an ideas on how to index the contents within zip files?
>
>Thanks,
>
>Luke
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>

-- 
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org