You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vigneshwaran <vi...@gmail.com> on 2012/09/27 10:47:14 UTC
Extract only the filenames from an archive
Hello all,
I am new to Apache Tika. I want Tika to output only the names of the
files within the archive (if the input file is an archive) and the
file content as usual if the input file is not an archive. Is there a
way I can do that?
Thank you,
Vigneshwaran R
Re: Extract only the filenames from an archive
Posted by Vigneshwaran <vi...@gmail.com>.
Hai,
Sorry the previous one was a mistake. I think I got it now.
<snip>
context.set(EmbeddedDocumentExtractor.class, new
ParsingEmbeddedDocumentExtractor(this.context) {
public void parseEmbedded(
InputStream stream, ContentHandler handler,
Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
if (outputHtml) {
AttributesImpl attributes = new AttributesImpl();
attributes.addAttribute("", "class", "class",
"CDATA", "package-entry");
handler.startElement(XHTML, "div", "div", attributes);
}
String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
if (name != null && name.length() > 0 && outputHtml) {
handler.startElement(XHTML, "h1", "h1", new
AttributesImpl());
char[] chars = name.toCharArray();
handler.characters(chars, 0, chars.length);
handler.endElement(XHTML, "h1", "h1");
}
//Just removed the parsing logic here.. and it works :)
if (outputHtml) {
handler.endElement(XHTML, "div", "div");
}
}
});
</snip>
--
Vigneshwaran R
Re: Extract only the filenames from an archive
Posted by Vigneshwaran <vi...@gmail.com>.
Hai,
I'd rather not create my own parser.
I was trying this:
context.set(DocumentSelector.class, new DocumentSelector() {
@Override
public boolean select(Metadata mtdt) {
return false;
}
});
It's a little close to what I want. Now the parse method works as
usual for normal files but it doesn't extract anything from archives.
I need just the file names of the contents of the archives. How do I
do that following this way?
--
Vigneshwaran R
Re: Extract only the filenames from an archive
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 27 Sep 2012, Vigneshwaran wrote:
> I am new to Apache Tika. I want Tika to output only the names of the
> files within the archive (if the input file is an archive) and the file
> content as usual if the input file is not an archive. Is there a way I
> can do that?
Yup. Rather than passing in something like AutoDetectParser in the
ParseContext, parse in your own custom one. When that is called for an
embedded document (eg a document within an archive), rather than
processing the embedded resource, simply print out the name and return
Nick