You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/04/02 22:05:22 UTC

[jira] [Created] (NUTCH-1323) AjaxNormalizer

AjaxNormalizer
--------------

                 Key: NUTCH-1323
                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
             Project: Nutch
          Issue Type: New Feature
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.6


A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.

https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

Posted by "behnam nikbakht (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274529#comment-13274529 ] 

behnam nikbakht commented on NUTCH-1323:
----------------------------------------

yes , it's works correctly. thank you
                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1323) AjaxNormalizer

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1323:
---------------------------------

    Attachment: NUTCH-1323-1.6-1.patch

Patch for 1.6. See unit tests for examples. Please comment. There must be something wrong as all tests pass. Any tests to add?
                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954 ] 

Sebastian Nagel commented on NUTCH-1323:
----------------------------------------

After a small test crawl on http://si.draagle.com:
# usage is cumbersome because you have to carefully think about in which steps to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang URLs are escaped, escaped ones are unescaped. If URLs are normalized during parsing and then during CrawlDb update, you get the hashbang URL again.
# relative hashbang links are not resolved correctly. The outlink of
{noformat}
 base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
 <a href="#!static/draagle_pogoji_uporabe.html">
{noformat}
should be
{noformat}
http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
but hardly
{noformat}
http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
# the outlink set of one page with escaped base URL may contain escaped and unescaped URLs simultaneously as results of
** a relative link without hashbang, e.g., {{<a href="#search">}}
** a global link with hashbang

If understood right:
* URLs with escaped fragments are used
** in crawlDb, segments, linkDb (URL acts as key)
** for fetching
* unescaped hashbang URLs
** are used in the index (and shown to the user)
** may appear in outlinks, redirects, and seeds

Couldn't we bind the decision whether to (un)escape to the current normalizer scope:
* if URL contains #!
  and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }
  => escape
* if URL contains _escaped_fragment_=
  and scope is index
  => unescape

                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274517#comment-13274517 ] 

Markus Jelsma commented on NUTCH-1323:
--------------------------------------

@sebastian:
yes, it should honor scoping rules.

@behnam:
you should work around this by changing URL normalizer order depening on your scope.

However, we may also change the basic normalizer to disable reference removal via configuration. Changing order at fetch and index time to work-around this is cumbersome.
                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

Posted by "behnam nikbakht (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274488#comment-13274488 ] 

behnam nikbakht commented on NUTCH-1323:
----------------------------------------

hi
when i want to crawl some dynamic url like this:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain
AjaxNorlalizer must convert this to:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain
but there is problem:
other normalizers remove # from urls based on rules in regex-normalize.xml 
also in 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
there is a line that remove ref:
if (url.getRef() != null) {
...
for this, i test that must change to:
if (url.getRef() != null) {                 // remove the ref
       file=file+"#"+url.getRef();
       changed = true;
}
and when remove rules in regex-normalize.xml , the plugin works correctly.
                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1323) AjaxNormalizer

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1323:
---------------------------------

    Patch Info: Patch Available
    
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira