You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "James Sullivan (JIRA)" <ji...@apache.org> on 2012/10/07 03:51:02 UTC

[jira] [Created] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

James Sullivan created NUTCH-1475:
-------------------------------------

             Summary: Nutch 2.1 Index-More Plugin -- A better fall back value for date field
                 Key: NUTCH-1475
                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
             Project: Nutch
          Issue Type: Bug
    Affects Versions: nutchgora, 2.1
         Environment: All
            Reporter: James Sullivan
            Priority: Minor
         Attachments: index-more-2x.patch

Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 

This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from


time = page.getFetchTime(); // use fetch time

to

time = new Date().getTime();


Users interested in the getFetchTime value can still get it from the "tstamp" field.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474460#comment-13474460 ] 

Sebastian Nagel commented on NUTCH-1475:
----------------------------------------

Indeed, a modified time in the future is a bad choice.
But CrawlDatum and WebPage both have a field modifiedTime. It should contain the time of the last fetch or (ideally) even the time of former fetch if the document is not modified.
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "James Sullivan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Sullivan updated NUTCH-1475:
----------------------------------

    Attachment: index-more-2x.patch
    
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: nutchgora, 2.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>         Attachments: index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "James Sullivan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475757#comment-13475757 ] 

James Sullivan commented on NUTCH-1475:
---------------------------------------

Agreed fetch time would be even better but this seems a simple interim solution until Nutch-1457 happens.
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1475:
----------------------------------------

    Fix Version/s: 2.2
                   1.6
    
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474198#comment-13474198 ] 

Julien Nioche commented on NUTCH-1475:
--------------------------------------

Nope, looks like a reasonable thing to do
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "James Sullivan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Sullivan updated NUTCH-1475:
----------------------------------

    Attachment: index-more-1xand2x.patch

Attaching new patch that patches both 1.x and 2.x
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474159#comment-13474159 ] 

Lewis John McGibbney commented on NUTCH-1475:
---------------------------------------------

Any objections to commit this?
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1475:
---------------------------------

    Affects Version/s:     (was: nutchgora)
                       1.5.1

This is an issue for the 1.x branch as well 
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>         Attachments: index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira