You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/10/28 13:40:34 UTC
[jira] [Resolved] (CONNECTORS-1088) Augment Tika extractor to allow
full use of boilerpipe content extraction
[ https://issues.apache.org/jira/browse/CONNECTORS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright resolved CONNECTORS-1088.
-------------------------------------
Resolution: Fixed
> Augment Tika extractor to allow full use of boilerpipe content extraction
> -------------------------------------------------------------------------
>
> Key: CONNECTORS-1088
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1088
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Tika extractor
> Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.8, ManifoldCF 2.0
>
>
> Boilerpipe has the ability to process content further than our current Tika extractor implementation allows. Specifically, we should be allowing a user to specify a BoilerPipe extractor class, from within the following package (or other places too, one expects):
> http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html
> If the extractor is specified, then our ContentHandler creation code in the Tika extractor changes from:
> {code}
> ContentHandler handler = new BodyContentHandler(w);
> {code}
> to:
> {code}
> ContentHandler handler = new BodyContentHandler(w);
> boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
> try {
> ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
> Class extractorClass = loader.loadClass(boilerpipe);
> BoilerpipeExtractor boilerpipeExtractor = (BoilerpipeExtractor)extractorClass.newInstance();
> handler = new BoilerpipeContentHandler(handler, boilerpipeExtractor);
> } catch (ClassNotFoundException e) {
> log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
> } catch (InstantiationException e) {
> log.warn("Could not instantiate " + boilerpipe);
> } catch (Exception e) {
> log.warn(e.toString());
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)