You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/04/02 22:05:22 UTC
[jira] [Created] (NUTCH-1323) AjaxNormalizer
AjaxNormalizer
--------------
Key: NUTCH-1323
URL: https://issues.apache.org/jira/browse/NUTCH-1323
Project: Nutch
Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.6
A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
Posted by "behnam nikbakht (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274529#comment-13274529 ]
behnam nikbakht commented on NUTCH-1323:
----------------------------------------
yes , it's works correctly. thank you
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1323) AjaxNormalizer
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1323:
---------------------------------
Attachment: NUTCH-1323-1.6-1.patch
Patch for 1.6. See unit tests for examples. Please comment. There must be something wrong as all tests pass. Any tests to add?
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954 ]
Sebastian Nagel commented on NUTCH-1323:
----------------------------------------
After a small test crawl on http://si.draagle.com:
# usage is cumbersome because you have to carefully think about in which steps to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang URLs are escaped, escaped ones are unescaped. If URLs are normalized during parsing and then during CrawlDb update, you get the hashbang URL again.
# relative hashbang links are not resolved correctly. The outlink of
{noformat}
base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
<a href="#!static/draagle_pogoji_uporabe.html">
{noformat}
should be
{noformat}
http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
but hardly
{noformat}
http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
# the outlink set of one page with escaped base URL may contain escaped and unescaped URLs simultaneously as results of
** a relative link without hashbang, e.g., {{<a href="#search">}}
** a global link with hashbang
If understood right:
* URLs with escaped fragments are used
** in crawlDb, segments, linkDb (URL acts as key)
** for fetching
* unescaped hashbang URLs
** are used in the index (and shown to the user)
** may appear in outlinks, redirects, and seeds
Couldn't we bind the decision whether to (un)escape to the current normalizer scope:
* if URL contains #!
and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }
=> escape
* if URL contains _escaped_fragment_=
and scope is index
=> unescape
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274517#comment-13274517 ]
Markus Jelsma commented on NUTCH-1323:
--------------------------------------
@sebastian:
yes, it should honor scoping rules.
@behnam:
you should work around this by changing URL normalizer order depening on your scope.
However, we may also change the basic normalizer to disable reference removal via configuration. Changing order at fetch and index time to work-around this is cumbersome.
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
Posted by "behnam nikbakht (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274488#comment-13274488 ]
behnam nikbakht commented on NUTCH-1323:
----------------------------------------
hi
when i want to crawl some dynamic url like this:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain
AjaxNorlalizer must convert this to:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain
but there is problem:
other normalizers remove # from urls based on rules in regex-normalize.xml
also in
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
there is a line that remove ref:
if (url.getRef() != null) {
...
for this, i test that must change to:
if (url.getRef() != null) { // remove the ref
file=file+"#"+url.getRef();
changed = true;
}
and when remove rules in regex-normalize.xml , the plugin works correctly.
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1323) AjaxNormalizer
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1323:
---------------------------------
Patch Info: Patch Available
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira