You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by "Mingfai Ma (JIRA)" <ji...@apache.org> on 2009/04/19 11:18:47 UTC
[jira] Created: (DROIDS-48) Support prioritizing in the TaskQueue
Support prioritizing in the TaskQueue
-------------------------------------
Key: DROIDS-48
URL: https://issues.apache.org/jira/browse/DROIDS-48
Project: Droids
Issue Type: New Feature
Components: core
Affects Versions: 0.01
Reporter: Mingfai Ma
Use case:
- when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: DROIDS-48d2.patch
the prev patches breaks a testcase and it is fixed. This patch patches only one file and should be used on top of DROIDS-48d.patch
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (DROIDS-48) Support prioritizing in
the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ]
Mingfai Ma edited comment on DROIDS-48 at 6/21/09 9:50 PM:
-----------------------------------------------------------
just come up with a even better design for weight.
Weighted interface
{code}
public interface Weighted {
public int getWeight();
}
{code}
The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
public int getWeight() {
return Integer.parseInt(String.valueOf(this.get("weight")));
}
}
{code}
WeightComparator :
{code}
public class WeightComparator implements Comparator {
public int compare(Object link1, Object link2) {
int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
return weight2 - weight1;
}
}
{code}
Task Queue
{code}
Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}
so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.
was (Author: mingfai):
just come up with a event better design for weight.
Weighted interface
{code}
public interface Weighted {
public int getWeight();
}
{code}
The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
public int getWeight() {
return Integer.parseInt(String.valueOf(this.get("weight")));
}
}
{code}
WeightComparator :
{code}
public class WeightComparator implements Comparator {
public int compare(Object link1, Object link2) {
int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
return weight2 - weight1;
}
}
{code}
Task Queue
{code}
Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}
so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: (was: DROIDS-48c.patch)
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: DROIDS-48d.patch
merged code to allow applying to the current snapshot. no functional change.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48b.patch, DROIDS-48c.patch, DROIDS-48d.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (DROIDS-48) Support prioritizing in
the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ]
Mingfai Ma edited comment on DROIDS-48 at 6/21/09 9:52 PM:
-----------------------------------------------------------
just come up with a even better design for weight.
Weighted interface
{code}
public interface Weighted {
public int getWeight();
}
{code}
The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
public int getWeight() {
return Integer.parseInt(String.valueOf(this.get("weight")));
}
}
{code}
WeightComparator :
{code}
public class WeightComparator implements Comparator {
public int compare(Object link1, Object link2) {
int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
return weight2 - weight1;
}
}
{code}
Task Queue
{code}
Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}
so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.
p.s. I'm designing a filter framework that work at a broader sense than URL filter. The Weighted interface is actually designed to cater the ordering of Filter as well.
was (Author: mingfai):
just come up with a even better design for weight.
Weighted interface
{code}
public interface Weighted {
public int getWeight();
}
{code}
The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
public int getWeight() {
return Integer.parseInt(String.valueOf(this.get("weight")));
}
}
{code}
WeightComparator :
{code}
public class WeightComparator implements Comparator {
public int compare(Object link1, Object link2) {
int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
return weight2 - weight1;
}
}
{code}
Task Queue
{code}
Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}
so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: (was: DROIDS-48b.patch)
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: (was: DROIDS-48.patch)
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48b.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722486#action_12722486 ]
Mingfai Ma commented on DROIDS-48:
----------------------------------
just come up with a event better design for weight.
Weighted interface
{code}
public interface Weighted {
public int getWeight();
}
{code}
The Link/LinkTask, assume extends HashMap
{code}
public class WeightedLink extends Link implements Weighted { //or LinkTask
public int getWeight() {
return Integer.parseInt(String.valueOf(this.get("weight")));
}
}
{code}
WeightComparator :
{code}
public class WeightComparator implements Comparator {
public int compare(Object link1, Object link2) {
int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
return weight2 - weight1;
}
}
{code}
Task Queue
{code}
Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
{code}
so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710225#action_12710225 ]
Mingfai Ma commented on DROIDS-48:
----------------------------------
any comment to this feature?
could a weight field be added to Task? or could Task be enhanced to support a map of custom data? without adding weight to the Task interface, this feature cannot be implemented.
for the Queue, there could be diff options:
1. include in SimpleTaskQueue as provided in this patch, or
2. make a separated TaskQueue implementation, e.g. PrioritizedTaskQueue, or
3. do not include in the distribution (maybe provide in any example)
re. between 1 and 2, the so-called prioritization is not too complex, so I think it is ok to include SimpleTaskQueue rather than separate to another queue, if it is to be included in the dist at all.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48b.patch, DROIDS-48c.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: DROIDS-48.patch
the patches changes quite a number of files, but it's all about
- added int getWeight() to Task
remarks: LinkTask consumes 72 bytes per instance in a sample test. If the servers do not handle links fast enough, LinkTask will be kept adding to the memory. Just a quick calculation (maybe wrong), 1.5G memory could hold 20M LinkTask. It is preferable to minimize the field in a LinkTask, and use the shortest field. (int instead of long)
- changed the SimpleTaskQueue from using ConcurrentLinkedQueue to PriorityBlockingQueue by default. Notice that there is a constructor for the user to provide a Queue, so it's not necessary to provide more configuration options such as providing a comparator. (there is no harm to do so, however)
- notice that the method for FileTask is not implemented. not sure if a FileTask need a weight.
How it works:
- when a task is added to the queue, it checks the weight to decide if a task should be position at the top.
- if two tasks has the same weight, the older one go first.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: DROIDS-48c.patch
the previous patch reversed the order. the higher the weight, the sooner a task should be processed
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48b.patch, DROIDS-48c.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721121#action_12721121 ]
Mingfai Ma commented on DROIDS-48:
----------------------------------
let me submit another patch. i have a habit to use the formatter of my IDE but I haven't set it to use the coding style of this project, so. ... :-P
p.s. for this issue, it could be handled just by adding a weight integer field. but i feel it is most flexible if the LinkTask could whole any arbitrary data. And the simplest way is to make it extends Map.
{code}
public class LinkTask extends HashMap<String, Serializable> { //other interface are skipped;
protected final String id; //whatever data type for ID
protected final URI uri; //refer to DROIDS-52, this may cause problem for URI)
// all the other data are optional
{code}
use cases:
- say, in submitting a link, we want to associate information about cookie/http header, so the fetcher could use the cookie info when fetching
- any optional fields like weight could be used
- any component, such as filter or parser or whatever, could mark arbitrary tag for a link. say, a parser/factory, may read a "parser"/"contentType" value to decide how the data could be parsed. (so the parser doesn't depends on HttpEntity in interface) or the outlink could be attached directly to a LinkTask.
i throw the initial idea here to see if anyone has comment. more details on the implementation could be provided.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Mingfai Ma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingfai Ma updated DROIDS-48:
-----------------------------
Attachment: DROIDS-48b.patch
the previous attachment missed some files
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48b.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721118#action_12721118 ]
Thorsten Scherler commented on DROIDS-48:
-----------------------------------------
I am fine with the feature but I have problems with the diff. It adds formating changes to the code which makes it hard to identify the real changes.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (DROIDS-48) Support prioritizing in the TaskQueue
Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721133#action_12721133 ]
Thorsten Scherler commented on DROIDS-48:
-----------------------------------------
Hmm, yeah sounds more flexible for the future and I see the point to store related infos. I like it.
> Support prioritizing in the TaskQueue
> -------------------------------------
>
> Key: DROIDS-48
> URL: https://issues.apache.org/jira/browse/DROIDS-48
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: DROIDS-48d.patch, DROIDS-48d2.patch
>
>
> Use case:
> - when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.
> with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface
> I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.