You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2012/09/04 15:59:07 UTC

[jira] [Created] (NUTCH-1465) Support sitemaps in Nutch

Lewis John McGibbney created NUTCH-1465:
-------------------------------------------

             Summary: Support sitemaps in Nutch
                 Key: NUTCH-1465
                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
             Project: Nutch
          Issue Type: New Feature
          Components: parser
            Reporter: Lewis John McGibbney
             Fix For: 1.6, 2.1


I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].

[0] http://sourceforge.net/projects/sitemap-parser/
[1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448596#comment-13448596 ] 

Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------

Hi Ken,
{bq} I could start a thread, but I also don't want to flog a dead horse {bq}

I thought there had been renewed interest over @ CC but it looks like this is not the case. So I guess that we can progress with moving the sitemap-parser into Nutch. There have been people from the community who would like it I therefore see no reason not to. There was also mention of the canonical tag topic again in the thread I cited above (and there are also issues already logged on our Jira for this as well) so it will be interesting to see what the code contains.   
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447708#comment-13447708 ] 

Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------

I think I can invisage the next comment on this thread... this is yet another reason to use crawler commons :0)
Ken I wonder if you would be so kind to start a thread over on dev@nutch regarding the atmosphere going on over @ CC... it was my thought that we were flogging a dead horse with this conversation but the duplication of issues over here that are quite clearly included in CC seems rather ridiculous.
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448774#comment-13448774 ] 

Ken Krugler commented on NUTCH-1465:
------------------------------------

Hi Lewis,

Just to be clear, I think the dead horse is trying to get people interested in porting their code to crawler-commons, and then switching existing functionality to rely on cc.

For anything new (like sitemap parsing) I think it's a no-brainer to use cc, unless the API is totally borked. E.g. if you didn't, then you wouldn't have picked up our BOM fix.

-- Ken
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447700#comment-13447700 ] 

Ken Krugler commented on NUTCH-1465:
------------------------------------

The sitemap parsing code referenced in the discussion you note has been placed in crawler-commons. We just finished using it during a crawl (fixed one bug, dealing with sitemaps that have a BOM) and it worked fine for the sites we were crawling.
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447797#comment-13447797 ] 

Ken Krugler commented on NUTCH-1465:
------------------------------------

Hi Lewis - I could start a thread, but I also don't want to flog a dead horse :)

I'm spending occasional small amounts of time trying to move code from Bixo over to CC, and the plan is for the 0.9 release of Bixo to switch over to using CC where possible.

But the lack of excitement among Droids, Heretrix, Common Crawl, Nutch, etc. has made it pretty clear getting wide-spread adoption would be an uphill battle, one that I don't have the time currently to fight.

-- Ken
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1465:
----------------------------------------

    Fix Version/s:     (was: 2.1)
                   2.2
    
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.2
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448816#comment-13448816 ] 

Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------

So CC it is for sitemap parsing support in Nutch :0) 
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.6, 2.1
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira