You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/01/23 14:24:43 UTC

[jira] Created: (NUTCH-961) Expose Tika's boilerpipe support

Expose Tika's boilerpipe support
--------------------------------

                 Key: NUTCH-961
                 URL: https://issues.apache.org/jira/browse/NUTCH-961
             Project: Nutch
          Issue Type: New Feature
          Components: parser
            Reporter: Markus Jelsma
             Fix For: 1.3, 2.0


Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961-1.3-tikaparser1.patch

Modified to include necessary changes to parse-plugins.xml also.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047501#comment-13047501 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Ah, that's great! Is this in 0.9 or trunk? We still bind with 0.9. This may be  useful because this patch doesn't add anchors to the detected outlinks. The last anchor(s) may contain the complete BP body! =D

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047122#comment-13047122 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

BTW, have you considered a more general patch to support (rather than expose) all of tika's options? I'm just thinking that perhaps no special Boilerpipe per-se support should (for the sake of code maintainability) be exposed at the Nutch level, but only an ability to pass parameters to tika. So at the nutch level one sets properties in nutch-site.xml (or even tika-site.xml) and those are forwarded to tika to the tika-delegating parser plugin.
There should therefore be no need for any Boilerpipe testing for example, but rather tika integration testing.
I'm just thinking out loud (w/o any patch).

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047130#comment-13047130 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

{quote}it needs to use a different ContentHandler in parse-tika itself.{quote}
[Documentation opportunity] why?

My intuition is that the default sax ContentHandler returns the full page and then Tika handles it, this time with the boilerpipe option. 

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025297#comment-13025297 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

yeah, I was looking for an issue i think was called to replace parse-html with parse-tika as the default but I found only NUTCH-869[1]. It have just been mentioned in the mailing list (by Julien) and I thought an issue was filed for it.

[1] https://issues.apache.org/jira/browse/NUTCH-869

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Fix Version/s: 1.4

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: BoilerpipeExtractorRepository.java

Here's the correct file.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment:     (was: BoilerpipeExtractorRepository.java)

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment:     (was: NUTCH-961v2.patch)

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055530#comment-13055530 ] 

Markus Jelsma edited comment on NUTCH-961 at 6/27/11 1:16 PM:
--------------------------------------------------------------

Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?

      was (Author: markus17):
    Patch to include mark up from Tika. Anchors are now detected but less outlinks are found!
  
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047490#comment-13047490 ] 

Ken Krugler commented on NUTCH-961:
-----------------------------------

The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML.

So it's incremental processing (you don't need to get the full page first).

Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text.


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: NUTCH-961-1.4-dombuilder-1.patch

With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods.

http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3C201107151523.18511.markus.jelsma@openindex.io%3E

There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025286#comment-13025286 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

@Markus - Thank you.

Watch out for [1] in parse-plugins.xml. .html pages may indeed by xhtml. You can safely delete alla parse-html mimeType associations, as long as you have [2] (and you want to use parse-tika instead of parse-html ).

[1]        
<mimeType name="application/xhtml+xml">
		<plugin id="parse-html" />
	</mimeType>

[2] 
<!--  by default if the mimeType is set to *, or 
        if it can't be determined, use parse-tika -->
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>
 

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961v2.patch

cleaned up patch. 
To reproduce:
{code}
export NUTCH_HOME=`pwd`"/nutch"; svn co -r 1101540 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 $NUTCH_HOME
cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch
ant
{code}

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-961:
--------------------------------

    Fix Version/s:     (was: 1.3)

Tika 0.8 has some issues with PDF parsing, it would be better to use the next release instead. This won't be done as part of the 1.3 release as this is a new functionality and not a bugfix 

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: NUTCH-961-1.3-3.patch

Patch to include mark up from Tika. Anchors are now detected but less outlinks are found!

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021110#comment-13021110 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

@Markus - BoilerpipeExtractorRepository.java == NUTCH-961-1.3-tikaparser.patch, content-wise.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment:     (was: NUTCH-961-1.3-tikaparser1.patch)

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961-1.3-tikaparser1.patch

Same as NUTCH-961-1.3-tikaparser.patch by Markus but adds necessary configuration to nutch-default.xml (!nutch-site.xml!) as discussed on the mailing list or privately time ago.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: NUTCH-961-1.3-tikaparser.patch
                BoilerpipeExtractorRepository.java

Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable:

tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url>

There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Fix Version/s:     (was: 1.4)
                       (was: 2.0)
                   1.5
         Assignee: Markus Jelsma

It works in production but is still a big hack when dealing with outlinks. Mark as 1.5
                
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: NUTCH-961-1.5-1.patch

Here's a working patch we use in production. This includes a nasty work around in TikeParsers to collect all outlinks. Without it, only outlinks from the extracted text are collected.

This is a bit nasty and i'd appreciate if anyone with a bit more experience with Tika can shed some light on this.
                
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025295#comment-13025295 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Not safely, there are still issues regarding HTML parsing with Tika, even without this nasty boilerpipe hack.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047127#comment-13047127 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

This is not a general patch and won't be. It can, however, be a dependacy if for a broader Tika patch but i haven't seen other tickets as of yet.

This patch cannot work by just passing parameters to Tika as it needs to use a different ContentHandler in parse-tika itself.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176194#comment-13176194 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Fixed already. See NUTCH-1233 for a patch!
                
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961v2.patch

Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial changes to TikaParser code.
More interestingly I've also removed the following from parse-plugins.xml:

-        <mimeType name="application/xhtml+xml">
-		<plugin id="parse-html" />
-	</mimeType>
-

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Patch Info: [Patch Available]

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-961) Expose Tika's boilerpipe support

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987575#action_12987575 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such as news overviews with blocks and lists are much better extracted with the CanolaExtractor instead. This poses a problem, we cannot have just one single configuration directive telling the parser which extractor to use for a whole crawl.

Some thoughts on how to deal with it:
- use Boilerpipe's estimator to automatically determine which extractor to use
- have a facility to override false positives returned by the estimator and hardcode which extractor to use for URL groups (not unlike the subcollection plugin)


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.