You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by "Eugen Paraschiv (JIRA)" <ji...@apache.org> on 2011/05/24 19:44:47 UTC

[jira] [Created] (DROIDS-144) The AlreadyVisitedFilter should not ignore the parameters of the URI

The AlreadyVisitedFilter should not ignore the parameters of the URI
--------------------------------------------------------------------

                 Key: DROIDS-144
                 URL: https://issues.apache.org/jira/browse/DROIDS-144
             Project: Droids
          Issue Type: Improvement
          Components: core
    Affects Versions: 0.0.2
            Reporter: Eugen Paraschiv
             Fix For: 0.0.2


Thiis filter strips the parameters from the URI and stores only the resulting URI as key in it's visited map. This severely limits the filter, because multiple URIs are now ignored because the filter sees them as visited, when in fact they're not. 
An example - these are pages to be crawled: 
http://www.domain.com/abc/?page=0&start=
http://www.domain.com/abc/?page=1&start=
Once the first one is analyzed, only the host, and path are considered: 
http://www.domain.com/abc/
and so the second URI will be rejected as already visited, when in fact it's a completely new page. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (DROIDS-144) The AlreadyVisitedFilter should not ignore the parameters of the URI

Posted by "Eugen Paraschiv (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugen Paraschiv updated DROIDS-144:
-----------------------------------

    Attachment: DROIDS-144.patch

> The AlreadyVisitedFilter should not ignore the parameters of the URI
> --------------------------------------------------------------------
>
>                 Key: DROIDS-144
>                 URL: https://issues.apache.org/jira/browse/DROIDS-144
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.0.2
>            Reporter: Eugen Paraschiv
>             Fix For: 0.0.2
>
>         Attachments: DROIDS-144.patch
>
>
> Thiis filter strips the parameters from the URI and stores only the resulting URI as key in it's visited map. This severely limits the filter, because multiple URIs are now ignored because the filter sees them as visited, when in fact they're not. 
> An example - these are pages to be crawled: 
> http://www.domain.com/abc/?page=0&start=
> http://www.domain.com/abc/?page=1&start=
> Once the first one is analyzed, only the host, and path are considered: 
> http://www.domain.com/abc/
> and so the second URI will be rejected as already visited, when in fact it's a completely new page. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (DROIDS-144) The AlreadyVisitedFilter should not ignore the parameters of the URI

Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-144:
-----------------------------------

    Fix Version/s:     (was: 0.2.0)
                   0.3.0
    
> The AlreadyVisitedFilter should not ignore the parameters of the URI
> --------------------------------------------------------------------
>
>                 Key: DROIDS-144
>                 URL: https://issues.apache.org/jira/browse/DROIDS-144
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.2.0
>            Reporter: Eugen Paraschiv
>             Fix For: 0.3.0
>
>         Attachments: DROIDS-144.patch
>
>
> Thiis filter strips the parameters from the URI and stores only the resulting URI as key in it's visited map. This severely limits the filter, because multiple URIs are now ignored because the filter sees them as visited, when in fact they're not. 
> An example - these are pages to be crawled: 
> http://www.domain.com/abc/?page=0&start=
> http://www.domain.com/abc/?page=1&start=
> Once the first one is analyzed, only the host, and path are considered: 
> http://www.domain.com/abc/
> and so the second URI will be rejected as already visited, when in fact it's a completely new page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira