You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/07 15:39:58 UTC

[jira] [Created] (NUTCH-1005) Index headings h1 and h2

Index headings h1 and h2
------------------------

                 Key: NUTCH-1005
                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, parser
            Reporter: Markus Jelsma


Very simple plugin i needed for quickly extracting h1 and h2 values. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Attachment: NUTCH-1005-1.5-5.patch

New patch without indexing capabilities. Use NUTCH-1264 for indexing instead.
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings h1 and h2

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

         Priority: Minor  (was: Major)
    Fix Version/s: 1.4
         Assignee: Markus Jelsma

> Index headings h1 and h2
> ------------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java
>
>
> Very simple plugin i needed for quickly extracting h1 and h2 values. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125436#comment-13125436 ] 

Lewis John McGibbney commented on NUTCH-1005:
---------------------------------------------

Hi Markus & Julien, I really like to idea of merging these plugins as per your thoughts and comments. As I've previously said, you's guys did a great job refining the plugins for the last 1.3 release, therefore I think it makes sense to stick with the 'less-is-more' approach to reduce the likelihood of the plugin directory becoming rather overcrowded again! I would comment that this looks like the issue with the most work involved prior to getting a 1.4 RC, therefore what do you think is a realistic method for taking this forward? 
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116743#comment-13116743 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

I agree with Julien as it's the most flexible solution although it may be a bit more more to set up for simple extraction. But since such a feature is missing right now and heading text is very important in relevance ranking i feel we should add this in 1.4 anyway. 

If not, we should mark this for no version and set as wont fix.
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Attachment: NUTCH-1005-1.4-3.patch

New patch that trims values at parse time. It prevents bad output in parsechecker for many websites.

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings plugin

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-1005:
-------------------------------------

    Fix Version/s:     (was: 1.4)
                   1.5

- push 
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197894#comment-13197894 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

index-meta comes to mind! It's exactly what it does right?

I'll try the patch with the headings indexing filter disabled and with good results will provide a new patch without the indexing filter extension.
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Parse headings plugin

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202403#comment-13202403 ] 

Hudson commented on NUTCH-1005:
-------------------------------

Integrated in nutch-trunk-maven #139 (See [https://builds.apache.org/job/nutch-trunk-maven/139/])
    NUTCH-1005 Parse headings plugin

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1241460
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/headings
* /nutch/trunk/src/plugin/headings/build.xml
* /nutch/trunk/src/plugin/headings/ivy.xml
* /nutch/trunk/src/plugin/headings/plugin.xml
* /nutch/trunk/src/plugin/headings/src
* /nutch/trunk/src/plugin/headings/src/java
* /nutch/trunk/src/plugin/headings/src/java/org
* /nutch/trunk/src/plugin/headings/src/java/org/apache
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java

                
> Parse headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105456#comment-13105456 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

you are right. I'd read your comments too quickly

'db.parsemeta.to.crawldb' could be used to copy the values extracted by your parser into the crawldb and from there reuse URLMetaIndexingFilter which will index any metadata stored in the crawldb and listed in urlmeta.tags. 

This means using the crawldb as a temporary storage, which probably does not make too much sense.

What we should probably do is to rename url-meta into something more meaningful and make it more generic. We should have an indexer able to index anything store as crawldb, fetch or parse metadata via configuration. Then people would have to define custom parsers only, the indexing part should be doable in a generic way.  

I seem to remember that I had filed a patch for parsing / indexing description and keywords from HTML docs which is quite close to what you are offering to have. Why not having it all in one parser or at least in one plugin?



> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104431#comment-13104431 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

I'm not sure how. This plugin extracts hN tag values from the document fragment in the parse filter, i don't see URLmeta extracting values from the DOM. Can you elaborate?

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197853#comment-13197853 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

Markus, the parser should store the MD in getData().getParseData() and not getContentMeta() which is used by the fetcher. 
See [https://issues.apache.org/jira/browse/NUTCH-1264] for a generic indexing filter as discussed above
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197898#comment-13197898 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

bq. index-meta comes to mind! It's exactly what it does right?

not only. it also overrides index-static which is not based on metadata. I suppose we could leave index-static as-is and restrict index-meta to, well, metadata (wherever they come from) 
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109406#comment-13109406 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

Comments?

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1005) Parse headings plugin

Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1005.
----------------------------------

    Resolution: Fixed

Committed for 1.5 in rev. 1241460.
                
> Parse headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings h1 and h2

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Attachment: NUTCH-1005-1.4-2.patch

Here's a complete patch. I added a new setting with which users can specify a comma separated list of headings to extract and index. 

Headings are trimmed at index time. Perhaps this should be part of the parser instead. 

Please comment.


> Index headings h1 and h2
> ------------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch
>
>
> Very simple plugin i needed for quickly extracting h1 and h2 values. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109435#comment-13109435 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

let's try and come up with a single plugin for index-extra, urlmeta, NUTCH-809 and this one. Much of these things are related and could be done in a generic way

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197888#comment-13197888 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

bq. I assume i have to disable the indexing filter of this plugin but keep the parse filter since your patch does not do any parsing right?

absolutely. your plugin can focus on the parsing - the indexing will be done by index-extra. 
I will refactor existing plugins such as urlmeta or parse-metatags so that they use index-extra once it has been committed

Thanks

Julien
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197870#comment-13197870 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

Hi!

Don't you mean:
{code}
parse.getData().getParseMeta().set(headings[i], heading.trim());
{code}

That still works well with the indexfilter when testing via indexchecker.
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197890#comment-13197890 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

BTW if you can think of a better name for index-extra.... 
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Description: Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.  (was: Very simple plugin i needed for quickly extracting h1 and h2 values. )
        Summary: Index headings plugin  (was: Index headings h1 and h2)

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197876#comment-13197876 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

yep, you've corrected the typo yourself.

bq. That still works well with the indexfilter when testing via indexchecker.

that's because of the way you generate the field in your indexer i.e. parse.getData().getMeta(heading) => which means that it gets it from either the parse or content metadata. I was not saying that your code did not work, just that it would be conceptually more correct to put it in the parse md, well because it is obtained during the parse. My other point was that it would be better to use the generic indexer from NUTCH-1264. Could you please give it a try?
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104427#comment-13104427 ] 

Julien Nioche commented on NUTCH-1005:
--------------------------------------

Can't you do that with urlmeta already?

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Parse headings plugin

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203268#comment-13203268 ] 

Hudson commented on NUTCH-1005:
-------------------------------

Integrated in Nutch-trunk #1752 (See [https://builds.apache.org/job/Nutch-trunk/1752/])
    NUTCH-1005 Parse headings plugin

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=.&revision=1241460
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/headings
* /nutch/trunk/src/plugin/headings/build.xml
* /nutch/trunk/src/plugin/headings/ivy.xml
* /nutch/trunk/src/plugin/headings/plugin.xml
* /nutch/trunk/src/plugin/headings/src
* /nutch/trunk/src/plugin/headings/src/java
* /nutch/trunk/src/plugin/headings/src/java/org
* /nutch/trunk/src/plugin/headings/src/java/org/apache
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings
* /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java

                
> Parse headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105994#comment-13105994 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

{quote}
you are right. I'd read your comments too quickly

'db.parsemeta.to.crawldb' could be used to copy the values extracted by your parser into the crawldb and from there reuse URLMetaIndexingFilter which will index any metadata stored in the crawldb and listed in urlmeta.tags.

This means using the crawldb as a temporary storage, which probably does not make too much sense.
{quote}

Indeed. The less data there's in the CrawlDB, the better.

{quote}
What we should probably do is to rename url-meta into something more meaningful and make it more generic. We should have an indexer able to index anything store as crawldb, fetch or parse metadata via configuration. Then people would have to define custom parsers only, the indexing part should be doable in a generic way.

I seem to remember that I had filed a patch for parsing / indexing description and keywords from HTML docs which is quite close to what you are offering to have. Why not having it all in one parser or at least in one plugin?
{quote}

I believe this is what you're looking for:
https://issues.apache.org/jira/browse/NUTCH-809

I agree it would be nice to have such a mechanism but does it mean this plugin should not be included in your opinion?

> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197886#comment-13197886 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

Yes i'll give it a shot this week. Your patch can index fields from content, parse and db metadata which replaces the indexing filter of this headings plugin. I assume i have to disable the indexing filter of this plugin but keep the parse filter since your patch does not do any parsing right?
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Attachment: NUTCH-1005-1.5-4.patch

New patch as per Julien's comments.
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1005) Index headings h1 and h2

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Attachment: HeadingsParseFilter.java
                HeadingsIndexingFilter.java

> Index headings h1 and h2
> ------------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java
>
>
> Very simple plugin i needed for quickly extracting h1 and h2 values. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1005) Parse headings plugin

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1005:
---------------------------------

    Summary: Parse headings plugin  (was: Index headings plugin)
    
> Parse headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1005) Index headings plugin

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202219#comment-13202219 ] 

Markus Jelsma commented on NUTCH-1005:
--------------------------------------

i'll commit this one shortly if there are no objections
thanks
                
> Index headings plugin
> ---------------------
>
>                 Key: NUTCH-1005
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1005
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira