You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2011/08/22 17:10:29 UTC

[jira] [Created] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
--------------------------------------------------------------------------------------

                 Key: CONNECTORS-243
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
             Project: ManifoldCF
          Issue Type: New Feature
          Components: Web connector
    Affects Versions: ManifoldCF 0.2
            Reporter: Jan Høydahl


Last-Modified is important in web search, at it may be used for (de)boosting based on date.
In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-243.
------------------------------------

       Resolution: Fixed
    Fix Version/s: ManifoldCF 0.3

> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>            Assignee: Karl Wright
>              Labels: last-modified
>             Fix For: ManifoldCF 0.3
>
>
> Last-Modified is important in web search, at it may be used for (de)boosting based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089159#comment-13089159 ] 

Karl Wright commented on CONNECTORS-243:
----------------------------------------

r1160504.


> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>            Assignee: Karl Wright
>              Labels: last-modified
>             Fix For: ManifoldCF 0.3
>
>
> Last-Modified is important in web search, at it may be used for (de)boosting based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088754#comment-13088754 ] 

Karl Wright commented on CONNECTORS-243:
----------------------------------------

I'll try to have a look at this this evening.


> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>              Labels: last-modified
>
> Last-Modified is important in web search, at it may be used for (de)boosting based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Assigned] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright reassigned CONNECTORS-243:
--------------------------------------

    Assignee: Karl Wright

> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>            Assignee: Karl Wright
>              Labels: last-modified
>
> Last-Modified is important in web search, at it may be used for (de)boosting based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-243) Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089128#comment-13089128 ] 

Karl Wright commented on CONNECTORS-243:
----------------------------------------

Looking at this further, there are a number of headers that would be bad to include in metadata.  For example, you would not want to include anything authentication related or session related.  Any transient information should also be excluded, since that will cause ManifoldCF to be unable to avoid refetching the document on each job run.  Here's the list of exclusions I've come up with so far:

Age
WWW-Authenticate
Proxy-Authenticate
Date
Set-cookie
Via

Any I've missed?



> Web crawler must get the "Last-Modified" HTTP header and pass it as metadata to output
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-243
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Jan Høydahl
>            Assignee: Karl Wright
>              Labels: last-modified
>
> Last-Modified is important in web search, at it may be used for (de)boosting based on date.
> In fact, ManifoldCF should have the ability to parse any (or all) HTTP headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira