You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Augusto Callejas <ac...@appliedminds.com> on 2011/04/07 21:48:18 UTC
extract metadata, but not content
hi-
i'm using the AutoDetectParser to extract metadata and content from a file.
is there a way to turn off content extraction, but keep metadata extraction on?
===
parser = new AutoDetectParser();
final FileInputStream input = new FileInputStream(file);
final StringWriter writer = new StringWriter();
final ContentHandler handler = new BodyContentHandler(writer);
final Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
final ParseContext context = new ParseContext();
parser.parse(input, handler, metadata, context);
===
thanks,
augusto.
Re: extract metadata, but not content
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 7 Apr 2011, Augusto Callejas wrote:
> i'm using the AutoDetectParser to extract metadata and content from a
> file.
>
> is there a way to turn off content extraction, but keep metadata
> extraction on?
You could pass in a ContentHandler that just ignores all the xhtml events.
However, there's no way to tell the parsers not to generate them in the
first place, sorry.
Nick