You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/30 01:23:28 UTC

[jira] [Created] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Dynamically set fetchInterval by MIME-type
------------------------------------------

                 Key: NUTCH-1024
                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
             Project: Nutch
          Issue Type: New Feature
            Reporter: Markus Jelsma
            Priority: Minor
             Fix For: 2.0


Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.

* simple key\tvalue\n configuration file
* only set fetchInterval for new documents
* keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: MimeAdaptiveFetchSchedule.java

New version with proper handling of Content-Type attrib. In test i didn't include charset which is present in real tests.

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090082#comment-13090082 ] 

Julien Nioche commented on NUTCH-1024:
--------------------------------------

Do you mind if we wait a bit? I'd like to spend some time on it first and see how this would fit with the refresh info we would get from the sitemap entries

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241090#comment-13241090 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

I'll change the legacy sys.out to logging. HttpHeaders doesnt have Text representations of the strings but i'll be happy to add if you want.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295797#comment-13295797 ] 

Hudson commented on NUTCH-1024:
-------------------------------

Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/])
    NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

     Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349226
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/adaptive-mimetypes.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java

                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment:     (was: NUTCH-1024-1.5-3.patch)
    
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: NUTCH-1024-1.5-2.patch

New patch for 1.5 with modifications as per Julien's comments.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment:     (was: MimeAdaptiveFetchSchedule.java)

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220787#comment-13220787 ] 

Markus Jelsma edited comment on NUTCH-1024 at 3/2/12 9:05 AM:
--------------------------------------------------------------

New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. This is useful for sites where you want to use AdaptiveFetchSchedule but still want the generator to select an injected homepage every N hours.
                
      was (Author: markus17):
    New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else.
                  
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291601#comment-13291601 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

I'll commit this one in the next few days unless there are objections or improvements. Thanks
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116746#comment-13116746 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

Integration with sitemaps and crawler commons is something that's not being implemented now. Should we include this in 1.4? Ir does offer good flexibility on large crawls with semi-immutable mime-types.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Fix Version/s:     (was: 1.4)
                   1.5
    
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241207#comment-13241207 ] 

Lewis John McGibbney commented on NUTCH-1024:
---------------------------------------------

I like this Markus. Although I need to be honest and say that I've not had time to give it a spin as of recent so apologies for this. It looks like the process to date has been a bit frustrating so I apologize for not chipping in earlier. Anyway, we don't rely on commons for logging, could you please replace this with
{code}
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
{code}

Another further point from me:

You make refernce to the following conf directories
{code}
SCHEDULE_INC_RATE = "db.fetch.schedule.adaptive.inc_rate";
SCHEDULE_DEC_RATE = "db.fetch.schedule.adaptive.dec_rate";
SCHEDULE_MIME_FILE = "db.fetch.schedule.mime.file";
{code}

Although I don't see the new MIME_FILE added to the patch, I also don't see the INC and DEC properties added to nutch-default.xml
Thanks

                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241229#comment-13241229 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

I'll fix the logging, this is old code. The inc and dec rate directives are already in nutch-default but the mime-file and the file itself are missing.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089761#comment-13089761 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

I'd like to commit this issue this friday unless there are objections or other comments.

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1024:
------------------------------------

    Assignee: Markus Jelsma

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 2.0
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: MimeAdaptiveFetchSchedule.java
                adaptive-mimetypes.txt

New version that allows for separate inc and dec rate values per MIME-type. Conf file format is now: mime\tinc_rate\tdec_rate. Code uses internal struct for storing rates per mime in a hashmap.

Please comment.

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: NUTCH-1024-1.5-3.patch

Something went wrong here. 
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1024.
----------------------------------

    Resolution: Fixed

Committed for 1.6 in rev. 1349226.
Thanks!
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

      Component/s: generator
       Patch Info: [Patch Available]
    Fix Version/s:     (was: 2.0)
                   1.4

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment:     (was: MimeAdaptiveFetchSchedule.java)

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240430#comment-13240430 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

Thoughts? I'd like to send this one in.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: Nutch.patch
                AdaptiveFetchSchedule.patch
                MimeAdaptiveFetchSchedule.java
                adaptive-mimetypes.txt

Here's a first WIP. It extends AdaptiveFetchSchedule and changes INC/DEC rates depending on current MIME-type. It also patches AdaptiveFetch so that INC and DEC properties are protected and settable from the child. I also added two propertis to metadata.Nutch for reading the Content-Type key as Writable from the CrawlDatum MetaData. That was a bit of trickery.

It uses original INC and DEC rate values for CrawlDatum without a Content-Type in their MetaData or with unconfigured Content-Types.

Please comment. There must be something wrong as it seems to work. :)

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment:     (was: adaptive-mimetypes.txt)

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241085#comment-13241085 ] 

Julien Nioche commented on NUTCH-1024:
--------------------------------------

Hi Markus

Will have a closer look later. 2 quick comments for now

AdaptiveFetchSchedule => remove calls to System.out and use logging instead
Metadata/Nutch => MIME_TYPE_KEY duplicates the one in Metadata/HttpHeaders
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090105#comment-13090105 ] 

Markus Jelsma commented on NUTCH-1024:
--------------------------------------

Sure but what do you mean by info from sitemap entries? Is there an issue to point to?

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: NUTCH-1024-1.5-3.patch

New patch with proper logging and configuration files.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1024:
---------------------------------

    Attachment: NUTCH-1024-1.5-1.patch

New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else.
                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090115#comment-13090115 ] 

Julien Nioche commented on NUTCH-1024:
--------------------------------------

There is a JIRA issue for 2.0 https://issues.apache.org/jira/browse/NUTCH-882, but I'd like to do it in 1.4

We've talked about processing sitemaps on the mailing lists for some time and now have crawler-commons to help us with the parsing. Entries in sitemaps have some info about how frequently they are likely to be modified so it is somewhat related to this issue.

> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293543#comment-13293543 ] 

Hudson commented on NUTCH-1024:
-------------------------------

Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/])
    NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

     Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/adaptive-mimetypes.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java

                
> Dynamically set fetchInterval by MIME-type
> ------------------------------------------
>
>                 Key: NUTCH-1024
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1024
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira