You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by "Mingfai Ma (JIRA)" <ji...@apache.org> on 2009/04/19 11:18:47 UTC

[jira] Created: (DROIDS-48) Support prioritizing in the TaskQueue

Support prioritizing in the TaskQueue
-------------------------------------

                 Key: DROIDS-48
                 URL: https://issues.apache.org/jira/browse/DROIDS-48
             Project: Droids
          Issue Type: New Feature
          Components: core
    Affects Versions: 0.01
            Reporter: Mingfai Ma


Use case:
 - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.

with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface

I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment: DROIDS-48d2.patch

the prev patches breaks a testcase and it is fixed. This patch patches only one file and should be used on top of DROIDS-48d.patch

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ] 

Mingfai Ma edited comment on DROIDS-48 at 6/21/09 9:50 PM:
-----------------------------------------------------------

just come up with a even better design for weight.

Weighted interface
{code}
public interface Weighted {
    public int getWeight();
}
{code}

The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
    public int getWeight() {
        return Integer.parseInt(String.valueOf(this.get("weight")));
    }
}
{code}

WeightComparator :
{code}
public class WeightComparator implements Comparator {
    public int compare(Object link1, Object link2) {
        int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
        int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
        return weight2 - weight1;
    }
}
{code}

Task Queue
{code}
 Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}

so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. 

      was (Author: mingfai):
    just come up with a event better design for weight.

Weighted interface
{code}
public interface Weighted {
    public int getWeight();
}
{code}

The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
    public int getWeight() {
        return Integer.parseInt(String.valueOf(this.get("weight")));
    }
}
{code}

WeightComparator :
{code}
public class WeightComparator implements Comparator {
    public int compare(Object link1, Object link2) {
        int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
        int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
        return weight2 - weight1;
    }
}
{code}

Task Queue
{code}
 Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}

so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. 
  
> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment:     (was: DROIDS-48c.patch)

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment: DROIDS-48d.patch

merged code to allow applying to the current snapshot. no functional change.

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48b.patch, DROIDS-48c.patch, DROIDS-48d.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ] 

Mingfai Ma edited comment on DROIDS-48 at 6/21/09 9:52 PM:
-----------------------------------------------------------

just come up with a even better design for weight.

Weighted interface
{code}
public interface Weighted {
    public int getWeight();
}
{code}

The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
    public int getWeight() {
        return Integer.parseInt(String.valueOf(this.get("weight")));
    }
}
{code}

WeightComparator :
{code}
public class WeightComparator implements Comparator {
    public int compare(Object link1, Object link2) {
        int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
        int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
        return weight2 - weight1;
    }
}
{code}

Task Queue
{code}
 Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}

so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. 

p.s. I'm designing a filter framework that work at a broader sense than URL filter. The Weighted interface is actually designed to cater the ordering of Filter as well.

      was (Author: mingfai):
    just come up with a even better design for weight.

Weighted interface
{code}
public interface Weighted {
    public int getWeight();
}
{code}

The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
    public int getWeight() {
        return Integer.parseInt(String.valueOf(this.get("weight")));
    }
}
{code}

WeightComparator :
{code}
public class WeightComparator implements Comparator {
    public int compare(Object link1, Object link2) {
        int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
        int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
        return weight2 - weight1;
    }
}
{code}

Task Queue
{code}
 Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}

so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. 
  
> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment:     (was: DROIDS-48b.patch)

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment:     (was: DROIDS-48.patch)

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48b.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ] 

Mingfai Ma commented on DROIDS-48:
----------------------------------

just come up with a event better design for weight.

Weighted interface
{code}
public interface Weighted {
    public int getWeight();
}
{code}

The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
    public int getWeight() {
        return Integer.parseInt(String.valueOf(this.get("weight")));
    }
}
{code}

WeightComparator :
{code}
public class WeightComparator implements Comparator {
    public int compare(Object link1, Object link2) {
        int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
        int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
        return weight2 - weight1;
    }
}
{code}

Task Queue
{code}
 Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}

so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. 

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710225#action_12710225 ] 

Mingfai Ma commented on DROIDS-48:
----------------------------------

any comment to this feature?

could a weight field be added to Task? or could Task be enhanced to support a map of custom data? without adding weight to the Task interface, this feature cannot be implemented.

for the Queue, there could be diff options:
 1. include in SimpleTaskQueue as provided in this patch, or
 2. make a separated TaskQueue implementation, e.g. PrioritizedTaskQueue, or
 3. do not include in the distribution (maybe provide in any example)

re. between 1 and 2, the so-called prioritization is not too complex, so I think it is ok to include SimpleTaskQueue rather than separate to another queue, if it is to be included in the dist at all.




> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48b.patch, DROIDS-48c.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment: DROIDS-48.patch

the patches changes quite a number of files, but it's all about
- added int getWeight() to Task
   remarks: LinkTask consumes 72 bytes per instance in a sample test. If the servers do not handle links fast enough, LinkTask will be kept adding to the memory. Just a quick calculation (maybe wrong), 1.5G memory could hold 20M LinkTask. It is preferable to minimize the field in a LinkTask, and use the shortest field. (int instead of long)

 - changed the SimpleTaskQueue from using ConcurrentLinkedQueue to PriorityBlockingQueue by default. Notice that there is a constructor for the user to provide a Queue, so it's not necessary to provide more configuration options such as providing a comparator. (there is no harm to do so, however)

- notice that the method for FileTask is not implemented. not sure if a FileTask need a weight.

How it works:
  - when a task is added to the queue, it checks the weight to decide if a task should be position at the top. 
  - if two tasks has the same weight, the older one go first.




> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment: DROIDS-48c.patch

the previous patch reversed the order. the higher the weight, the sooner a task should be processed

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48b.patch, DROIDS-48c.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721121#action_12721121 ] 

Mingfai Ma commented on DROIDS-48:
----------------------------------

let me submit another patch. i have a habit to use the formatter of my IDE but I haven't set it to use the coding style of this project, so. ... :-P

p.s. for this issue, it could be handled just by adding a weight integer field. but i feel it is most flexible if the LinkTask could whole any arbitrary data. And the simplest way is to make it extends Map.

{code}
public class LinkTask extends HashMap<String, Serializable> { //other interface are skipped;
    protected final String id; //whatever data type for ID
    protected final URI uri; //refer to DROIDS-52, this may cause problem for URI)

   // all the other data are optional
{code}

use cases:
- say, in submitting a link, we want to associate information about cookie/http header, so the fetcher could use the cookie info when fetching
- any optional fields like weight could be used
- any component, such as filter or parser or whatever, could mark arbitrary tag for a link. say, a parser/factory, may read a "parser"/"contentType" value to decide how the data could be parsed. (so the parser doesn't depends on HttpEntity in interface)  or the outlink could be attached directly to a LinkTask. 

i throw the initial idea here to see if anyone has comment. more details on the implementation could be provided.

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingfai Ma updated DROIDS-48:
-----------------------------

    Attachment: DROIDS-48b.patch

the previous attachment missed some files

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48b.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721118#action_12721118 ] 

Thorsten Scherler commented on DROIDS-48:
-----------------------------------------

I am fine with the feature but I have problems with the diff. It adds formating changes to the code which makes it hard to identify the real changes.

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721133#action_12721133 ] 

Thorsten Scherler commented on DROIDS-48:
-----------------------------------------

Hmm, yeah sounds more flexible for the future and I see the point to store related infos. I like it.

> Support prioritizing in the TaskQueue
> -------------------------------------
>
>                 Key: DROIDS-48
>                 URL: https://issues.apache.org/jira/browse/DROIDS-48
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
>  - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.