You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Tony Dietrich <to...@dietrich.org.uk> on 2011/04/11 20:15:35 UTC

Question re: Ajax processing

OK, I'm getting ready to make use of Droids in an application for my
company, BUT:

 

As far as I can tell, the current Droids Http Client implementation does not
return a fully populated w3c document if the remote page uses
Ajax/JavaScript to synchronously populate the document.

Correct/Not correct?

 

(If I'm correct, this is a show-stopper for me.)

 

If I'm wrong, can someone point me in the right way to ensure that a remote
crawl of a website will indeed return a fully populated document whether or
not the site uses Ajax/JavaScript to populate elements within the page after
load?

 

Tony Dietrich

RE: Question re: Ajax processing

Posted by Fuad Efendi <fu...@efendi.ca>.

>>Err, sorry Fuad, but you are wrong. Even Google disagrees with you /grin!
- they happily crawl my company's website and return fully fleshed out pages
(from their cache!) from our Ajax-based website. However, they don't seem to
be willing to share their secrets /sigh.

- I wrote about "Search Engine Friendliness". I meant: different content for
different agent signatures. AJAX for IE-8, and HTML for GoogleBot. And very
specific HTML for people with disabilities. And specific HTML for those who
hate JavaScript.

Yes, my Laptop allows to generate HTML with color, images, and etc from
multiple chunk of data, but it's different story...  Googlebot will have to
spend 10000 times more CPU cycles than right now to do the same. Double-core
for single page (AJAX, whatever) + 1-2-5 seconds to generate a page
(HtmlUnit, Mozilla, whatever) is OK for a single page (and 2-3 seconds), but
it is not Ok for Googlebot (few milliseconds per page; billions pages per
day). 

Your website probably uses extremely basic "deterministic" AJAX: dynamically
loaded HTML snippets, and each such snippet can have static
(search-engine-friendly) URL, and each snippet has embedded HTML tags. What
about REAL AJAX application? What about Adobe ActionScript (with specific
Search Engine API? It failed...)

Are you sure about "fully fleshed out pages"? What is was, JSON objects
converted to DOM using very specific "transformation"!? Yes, Google can
"emulate" initial computer screen with CSS and etc; with basic
"OnLoad"-generated staff; but it requires a lot of CPU so that it can do it
for home pages only of most important sites. 

And JSON... subject of discussion is similar to "form submission", can
GoogleBot discover ALL imaginable URLs generated as form submissions
(including JavaScript-generated URLs)? Even if you don't publish such URLs
explicitly as part of SEO strategy? And what if it can, it won't play any
role: zero-rank since no any incoming links...

RE: Question re: Ajax processing

Posted by Fuad Efendi <fu...@efendi.ca>.

Tony, extremely simple use case: dynamically assign CSS class to HTML
elements... Search engines operate with plain text retrieved from
HTML/PDF/(even meta-tags of video files)/...; but what you describe here is
"regenerating user screen", "emulating web browser"; it is different use
case (some websites such as www.alexa.com, www.quantcast.com, and even
www.google.com do that for home pages)

Also, "w3c document" is not the same as "SGML"... and "AJAX" is sometimes
workaround around buggy browsers & dynamic CSS, DOJO developers are mostly
worried about IE8 vs Mozilla... and it's not just "dynamic HTML"; it could
be popup window (which can't be defined as single HTML)


This is my extra (mis-)understanding of the discussion:
- we already have computer with browser which can do this
- we already have digital camera which can do snapshot of a screen
- is it related to PLAIN TEXT retrieval for indexing?
- what is UNIQUE "resource identifier" for this plain text, do we have it
for generic AJAX use cases? (we do have it for very naïve "onload()" AJAX)

Just as a sample of technology... web portals can use AJAX to load portlets,
and each portlet have unique URL for "MAXIMIZED" state, and each such
"portal" page lists such URLs for robots - "search engine friendliness"; but
it is very basic AJAX; to convert sophisticated OOP-style JavaScript object
into HTML you need kind of "transformation rules" which are JavaScript, and
you will have to worry about memory leaks of (still mostly buggy) popular
libraries (which are sometimes huge)... what about viruses and threats
inside JavaScript? I can't imagine "polite" robot trying to run AJAX
Sorry if I misunderstood...


Of course... Would be nice to have Apache-library which can generate image
(JPEG) for a homepage!




-----Original Message-----
From: Tony Dietrich [mailto:tony@dietrich.org.uk] 
Sent: April-11-11 4:56 PM
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

Err, sorry Fuad, but you are wrong. Even Google disagrees with you /grin! -
they happily crawl my company's website and return fully fleshed out pages
(from their cache!) from our Ajax-based website. However, they don't seem to
be willing to share their secrets /sigh.

I've previously implemented a service using HtmlUnit that does just this.
Although the package isn't intended for the purpose, it worked well. Needed
some manipulation to get it to work well in a multi-threaded environment,
since HTMLUnit isn't thread-safe, but still worked. However, that service
wasn't a web-crawler, just a web-based
download-on-request-parse-and-copy-content service. 

I also have a test-case, based on webSphinx, that makes use of HtmlUnit as
the downloader. However, webSphinx hasn't been maintained for many years and
bringing the code-base up to scratch would take as long as writing my own.

Basically HtmlUnit is a headless browser which includes both a CSS and a
JavaScript processor and which can be queried to return the downloaded (and
finalised) page. It includes the capability to perform asynchronous Ajax
transactions.

I'm not particularly worried about the events that might be triggered by
user-interaction, more the onload() events that cause the page to be fully
initialised with all content, as first seen by a browser user. As used in my
first-mentioned service, it worked a treat.

Implemented as a queue-based, 'multi-windowed' headless browser, the service
overcame the overhead caused by initialising the browser by creating a
singleton instance which was shared between various requests, opening then
processing each requested page in a new 'window', closing the 'window' to
clean up the memory and then returning the result to the query thread. The
individual 'windows' provide a listener event which is triggered on various
events, such as the completion of the page load, and I used this feature to
trigger the return of the result to the query thread, which sat waiting on a
monitor for the event.

I appreciate this is a heavy-weight component, since creating each 'window'
takes quite a lot of time/cycles, and don't expect to be crawling huge
numbers of pages.

If Droids doesn't currently have this capability, is there anyone who can
talk me through (off-list) the process for creating the capability. I'd be
happy to add it back to the code-base if requested, or make the code
available to anyone who needs it on request.


Tony

-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca]
Sent: 11 April 2011 21:21
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

>>>If I'm wrong, can someone point me in the right way to ensure that a
remote crawl of a website will indeed return a fully populated document
whether or not the site uses Ajax/JavaScript to populate elements within the
page after load?

 This is ABSOLUTELY impossible, no one can do it. This is even THEORETICALLY
impossible. Because DOM manipulations are event-driven and unpredictable.
AJAX-based websites can be "crawled" only if these websites generate
search-engine friendly HTML (for instance, if this website is fully
functional even for users without JavaScript disabled).

RE: Question re: Ajax processing

Posted by Tony Dietrich <to...@dietrich.org.uk>.

Err, sorry Fuad, but you are wrong. Even Google disagrees with you /grin! -
they happily crawl my company's website and return fully fleshed out pages
(from their cache!) from our Ajax-based website. However, they don't seem to
be willing to share their secrets /sigh.

I've previously implemented a service using HtmlUnit that does just this.
Although the package isn't intended for the purpose, it worked well. Needed
some manipulation to get it to work well in a multi-threaded environment,
since HTMLUnit isn't thread-safe, but still worked. However, that service
wasn't a web-crawler, just a web-based
download-on-request-parse-and-copy-content service. 

I also have a test-case, based on webSphinx, that makes use of HtmlUnit as
the downloader. However, webSphinx hasn't been maintained for many years and
bringing the code-base up to scratch would take as long as writing my own.

Basically HtmlUnit is a headless browser which includes both a CSS and a
JavaScript processor and which can be queried to return the downloaded (and
finalised) page. It includes the capability to perform asynchronous Ajax
transactions.

I'm not particularly worried about the events that might be triggered by
user-interaction, more the onload() events that cause the page to be fully
initialised with all content, as first seen by a browser user. As used in my
first-mentioned service, it worked a treat.

Implemented as a queue-based, 'multi-windowed' headless browser, the service
overcame the overhead caused by initialising the browser by creating a
singleton instance which was shared between various requests, opening then
processing each requested page in a new 'window', closing the 'window' to
clean up the memory and then returning the result to the query thread. The
individual 'windows' provide a listener event which is triggered on various
events, such as the completion of the page load, and I used this feature to
trigger the return of the result to the query thread, which sat waiting on a
monitor for the event.

I appreciate this is a heavy-weight component, since creating each 'window'
takes quite a lot of time/cycles, and don't expect to be crawling huge
numbers of pages.

If Droids doesn't currently have this capability, is there anyone who can
talk me through (off-list) the process for creating the capability. I'd be
happy to add it back to the code-base if requested, or make the code
available to anyone who needs it on request.


Tony

-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: 11 April 2011 21:21
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

>>>If I'm wrong, can someone point me in the right way to ensure that a
remote crawl of a website will indeed return a fully populated document
whether or not the site uses Ajax/JavaScript to populate elements within the
page after load?

 This is ABSOLUTELY impossible, no one can do it. This is even THEORETICALLY
impossible. Because DOM manipulations are event-driven and unpredictable.
AJAX-based websites can be "crawled" only if these websites generate
search-engine friendly HTML (for instance, if this website is fully
functional even for users without JavaScript disabled).

RE: Question re: Ajax processing

Posted by Fuad Efendi <fu...@efendi.ca>.

>>>If I'm wrong, can someone point me in the right way to ensure that a
remote crawl of a website will indeed return a fully populated document
whether or not the site uses Ajax/JavaScript to populate elements within the
page after load?

 This is ABSOLUTELY impossible, no one can do it. This is even THEORETICALLY
impossible. Because DOM manipulations are event-driven and unpredictable.
AJAX-based websites can be "crawled" only if these websites generate
search-engine friendly HTML (for instance, if this website is fully
functional even for users without JavaScript disabled).

Re: Question re: Ajax processing

Posted by Chapuis Bertil <bc...@agimem.com>.

You pointed the thread-safety problem. A good starting point may be to have
a HtmlUnit WebClient initialized for each Worker instances. However I'm not
able to evaluate the quantity of work it requires.

On 11 April 2011 22:57, Tony Dietrich <to...@dietrich.org.uk> wrote:

> Thanks Chapuis, I know about HtmlUnit.
>
> See my reply to Fuad.
>
> However, I have no idea how to integrate this into Droids. Help?
>
> Tony
>
> -----Original Message-----
> From: Chapuis Bertil [mailto:bchapuis@agimem.com]
> Sent: 11 April 2011 21:52
> To: droids-dev@incubator.apache.org
> Subject: Re: Question re: Ajax processing
>
> Yes, you are right, the HttpClient does not interpret javascript and no
> support is provided in Droids for such a use case. However, this may
> probably be achieved by using another client like the one provided by
> HtmlUnit which can be used to retrieve information from web sites and which
> works with most javascript libraries.
>
> http://htmlunit.sourceforge.net/javascript-howto.html
>
> On 11 April 2011 22:15, Tony Dietrich <to...@dietrich.org.uk> wrote:
>
> > OK, I'm getting ready to make use of Droids in an application for my
> > company, BUT:
> >
> >
> >
> > As far as I can tell, the current Droids Http Client implementation does
> > not
> > return a fully populated w3c document if the remote page uses
> > Ajax/JavaScript to synchronously populate the document.
> >
> > Correct/Not correct?
> >
> >
> >
> > (If I'm correct, this is a show-stopper for me.)
> >
> >
> >
> > If I'm wrong, can someone point me in the right way to ensure that a
> remote
> > crawl of a website will indeed return a fully populated document whether
> or
> > not the site uses Ajax/JavaScript to populate elements within the page
> > after
> > load?
> >
> >
> >
> > Tony Dietrich
> >
> >
> >
> >
>
>
> --
> Bertil Chapuis
> Agimem Sàrl
> http://www.agimem.com
>
>


-- 
Bertil Chapuis
Agimem Sàrl
http://www.agimem.com

RE: Question re: Ajax processing

Posted by Tony Dietrich <to...@dietrich.org.uk>.

Thanks Chapuis, I know about HtmlUnit.

See my reply to Fuad.

However, I have no idea how to integrate this into Droids. Help?

Tony

-----Original Message-----
From: Chapuis Bertil [mailto:bchapuis@agimem.com] 
Sent: 11 April 2011 21:52
To: droids-dev@incubator.apache.org
Subject: Re: Question re: Ajax processing

Yes, you are right, the HttpClient does not interpret javascript and no
support is provided in Droids for such a use case. However, this may
probably be achieved by using another client like the one provided by
HtmlUnit which can be used to retrieve information from web sites and which
works with most javascript libraries.

http://htmlunit.sourceforge.net/javascript-howto.html

On 11 April 2011 22:15, Tony Dietrich <to...@dietrich.org.uk> wrote:

> OK, I'm getting ready to make use of Droids in an application for my
> company, BUT:
>
>
>
> As far as I can tell, the current Droids Http Client implementation does
> not
> return a fully populated w3c document if the remote page uses
> Ajax/JavaScript to synchronously populate the document.
>
> Correct/Not correct?
>
>
>
> (If I'm correct, this is a show-stopper for me.)
>
>
>
> If I'm wrong, can someone point me in the right way to ensure that a remote
> crawl of a website will indeed return a fully populated document whether or
> not the site uses Ajax/JavaScript to populate elements within the page
> after
> load?
>
>
>
> Tony Dietrich
>
>
>
>

-- 
Bertil Chapuis
Agimem Sàrl
http://www.agimem.com

Re: Question re: Ajax processing

Posted by Chapuis Bertil <bc...@agimem.com>.

Yes, you are right, the HttpClient does not interpret javascript and no
support is provided in Droids for such a use case. However, this may
probably be achieved by using another client like the one provided by
HtmlUnit which can be used to retrieve information from web sites and which
works with most javascript libraries.

http://htmlunit.sourceforge.net/javascript-howto.html

On 11 April 2011 22:15, Tony Dietrich <to...@dietrich.org.uk> wrote:

> OK, I'm getting ready to make use of Droids in an application for my
> company, BUT:
>
>
>
> As far as I can tell, the current Droids Http Client implementation does
> not
> return a fully populated w3c document if the remote page uses
> Ajax/JavaScript to synchronously populate the document.
>
> Correct/Not correct?
>
>
>
> (If I'm correct, this is a show-stopper for me.)
>
>
>
> If I'm wrong, can someone point me in the right way to ensure that a remote
> crawl of a website will indeed return a fully populated document whether or
> not the site uses Ajax/JavaScript to populate elements within the page
> after
> load?
>
>
>
> Tony Dietrich
>
>
>
>


-- 
Bertil Chapuis
Agimem Sàrl
http://www.agimem.com