You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Augusto Callejas <ac...@appliedminds.com> on 2011/04/07 21:48:18 UTC

extract metadata, but not content

hi-

i'm using the AutoDetectParser to extract metadata and content from a file.

is there a way to turn off content extraction, but keep metadata extraction on?


===
    parser = new AutoDetectParser();
    final FileInputStream input = new FileInputStream(file);
    final StringWriter writer = new StringWriter();
    final ContentHandler handler = new BodyContentHandler(writer);

    final Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

    final ParseContext context = new ParseContext();
    parser.parse(input, handler, metadata, context);
===

thanks,
augusto.

Re: extract metadata, but not content

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 7 Apr 2011, Augusto Callejas wrote:
> i'm using the AutoDetectParser to extract metadata and content from a 
> file.
>
> is there a way to turn off content extraction, but keep metadata 
> extraction on?

You could pass in a ContentHandler that just ignores all the xhtml events. 
However, there's no way to tell the parsers not to generate them in the 
first place, sorry.

Nick