You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Vigneshwaran <vi...@gmail.com> on 2012/09/27 10:47:14 UTC

Extract only the filenames from an archive

Hello all,

I am new to Apache Tika. I want Tika to output only the names of the
files within the archive (if the input file is an archive) and the
file content as usual if the input file is not an archive. Is there a
way I can do that?

Thank you,
Vigneshwaran R

Re: Extract only the filenames from an archive

Posted by Vigneshwaran <vi...@gmail.com>.

Hai,

Sorry the previous one was a mistake. I think I got it now.

<snip>
context.set(EmbeddedDocumentExtractor.class, new
ParsingEmbeddedDocumentExtractor(this.context) {
            public void parseEmbedded(
                    InputStream stream, ContentHandler handler,
Metadata metadata, boolean outputHtml)
                    throws SAXException, IOException {
                if (outputHtml) {
                    AttributesImpl attributes = new AttributesImpl();
                    attributes.addAttribute("", "class", "class",
"CDATA", "package-entry");
                    handler.startElement(XHTML, "div", "div", attributes);
                }

                String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
                if (name != null && name.length() > 0 && outputHtml) {
                    handler.startElement(XHTML, "h1", "h1", new
AttributesImpl());
                    char[] chars = name.toCharArray();
                    handler.characters(chars, 0, chars.length);
                    handler.endElement(XHTML, "h1", "h1");
                }

                //Just removed the parsing logic here.. and it works :)

                if (outputHtml) {
                    handler.endElement(XHTML, "div", "div");
                }
            }
        });
</snip>

-- 
Vigneshwaran R

Re: Extract only the filenames from an archive

Posted by Vigneshwaran <vi...@gmail.com>.

Hai,

I'd rather not create my own parser.

I was trying this:

        context.set(DocumentSelector.class, new DocumentSelector() {
            @Override
            public boolean select(Metadata mtdt) {
                return false;
            }
        });


It's a little close to what I want. Now the parse method works as
usual for normal files but it doesn't extract anything from archives.
I need just the file names of the contents of the archives. How do I
do that following this way?

-- 
Vigneshwaran R

Re: Extract only the filenames from an archive

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 27 Sep 2012, Vigneshwaran wrote:
> I am new to Apache Tika. I want Tika to output only the names of the 
> files within the archive (if the input file is an archive) and the file 
> content as usual if the input file is not an archive. Is there a way I 
> can do that?

Yup. Rather than passing in something like AutoDetectParser in the 
ParseContext, parse in your own custom one. When that is called for an 
embedded document (eg a document within an archive), rather than 
processing the embedded resource, simply print out the name and return

Nick