You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2012/07/20 13:37:33 UTC

[jira] [Created] (NUTCH-1434) Indexer to delete robots noIndex

Markus Jelsma created NUTCH-1434:
------------------------------------

             Summary: Indexer to delete robots noIndex
                 Key: NUTCH-1434
                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
    Affects Versions: 1.5.1
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.6


Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1434:
---------------------------------

    Attachment: NUTCH-1434-1.6-3.patch

New patch removing command line switch and adding configuration.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419128#comment-13419128 ] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

You're right about the current behaviour but there is no further problem. Documents will never be passed to the index with this patch but we have to send a delete request because they same page may have had NO NOINDEX metatag yesterday. The same goes for 404's, we have to delete those too because we don't know if we have added them to the index before (which is possible).

Thanks
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433512#comment-13433512 ] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

Any comments? Commit? Bad patch? I'd like to get this in, it has been baking for quite some time.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1434:
---------------------------------

    Attachment: NUTCH-1434-1.6-1.patch

Patch for 1.6. I had to move the code for skipping notModified pages to point where collection the reduce values has finished. Otherwise the record could be skipped if the CrawlDatum object comes before the ParseData.

please comment
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1434:
---------------------------------

    Attachment: NUTCH-1434-1.6-2.patch

New patch fixing potential NPE if CrawlDatum object no longer exists but is still in segment.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434078#comment-13434078 ] 

Lewis John McGibbney commented on NUTCH-1434:
---------------------------------------------

Hi Markus. I think this is fine. Your patch doesn't include the property in nutch-default.xml but I'm sure you have it. Something pretty trivial but for consistency can you clarify whether the name of the property should be indexer.blah.blah as this tends to be used for indexing filter functionality whereas solr.blah.blah is used for the solr indexer properties. Once these are sorted +1 from me. Thank you 
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434927#comment-13434927 ] 

Julien Nioche commented on NUTCH-1434:
--------------------------------------

Well, let's do configuration only then. After all it can be set on the command line with -D just as well + it means that we don't have to change the code reading the params etc...
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440940#comment-13440940 ] 

Hudson commented on NUTCH-1434:
-------------------------------

Integrated in Nutch-trunk #1935 (See [https://builds.apache.org/job/Nutch-trunk/1935/])
    NUTCH-1434 Indexer to delete robots noindex (Revision 1376394)

     Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1376394
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java

                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435000#comment-13435000 ] 

Ferdy Galema commented on NUTCH-1434:
-------------------------------------

+1 for removing commandline args and using configuration. (I actually like to see this done for many more tools, as this allows for the greatest flexibility, but that is another discussion.)
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419125#comment-13419125 ] 

Lewis John McGibbney commented on NUTCH-1434:
---------------------------------------------

Can we clarify exactly what we mean by NoIndex? If currently we simpky remove some fields then index the rest then this is not good practice. I get this patch in that it enables us to delete such documents therefore being responsible in the content residing within the index, to me this doesn't target the source of the problem though e.g. that documents are still being passed to the indexer even though robots.txt says they are not to be indexed. Or am I getting this the wrong way?  
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434093#comment-13434093 ] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

Hi Lewis - I haven't added the configuration because it's overridden by the command line switch regardless of the nutch-site.xml configuration. The propery name can be seen in the IndexerMapReduce.java patch:

+  public static final String INDEXER_DELETE_ROBOTS_NOINDEX = "indexer.delete.robots.noindex";

It's indeed not Solr because it's Solr agnostic.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440115#comment-13440115 ] 

Hudson commented on NUTCH-1434:
-------------------------------

Integrated in nutch-trunk-maven #398 (See [https://builds.apache.org/job/nutch-trunk-maven/398/])
    NUTCH-1434 Indexer to delete robots noindex (Revision 1376394)

     Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java

                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434908#comment-13434908 ] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

Alright. I'll add the property with description to nutch-default.xml in the commit.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1434.
----------------------------------

    Resolution: Fixed

Committed for 1.6 in rev. 1376394.
Thanks
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434911#comment-13434911 ] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

I still think it only leads to confusion. We also removed the -parse switch in favour of the configuration option because only one of them would ever work.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434893#comment-13434893 ] 

Julien Nioche commented on NUTCH-1434:
--------------------------------------

bq.  I haven't added the configuration because it's overridden by the command line switch regardless of the nutch-site.xml configuration.

I'd rather do like it's done in other parts of the code i.e take into account any value set in nutch-site.xml if nothing is set on the command line (see for instance fetcher.parse) and include in nutch-default.xml 
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove the title and content fields from the parsed data. It does not stop those pages from being indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira