You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@sling.apache.org by sam lee <sk...@gmail.com> on 2011/02/24 21:20:15 UTC

crawling html in asynchronous service?

Hey,

I am using Scheduler to crawl html files.

It runs every minute.
And it needs to crawl /content/foo.html

If I use Apache commons HttpClient for GET /content/foo.html,  I need to set
up authentication (Basic Auth?).

However, since all html pages that I want to crawl are served within Sling,
is there an API that "resolves" (or "renders") paths like /content/foo.html,
/content/bar.json ... etc?


If I have to actually make HTTP request, where should I get authentication
information? Scheduler does not know JCR Session..
Should I explicitly get ResourceResolver (using
ResourceResolverFactory.getAdministrativeResourceResolver()) every time job
is fired?

Re: crawling html in asynchronous service?

Posted by sam lee <sk...@gmail.com>.
Thank you.

SlingRequestProcessor and mock up http Request and Response objects worked
well.


On Thu, Feb 24, 2011 at 3:34 PM, Bertrand Delacretaz <bdelacretaz@apache.org
> wrote:

> Hi,
>
> On Thu, Feb 24, 2011 at 9:20 PM, sam lee <sk...@gmail.com> wrote:
> > ...since all html pages that I want to crawl are served within Sling,
> > is there an API that "resolves" (or "renders") paths like
> /content/foo.html,
> > /content/bar.json ... etc?
>
> You can use the SlingRequestProcessor to make requests directly
> without going through http.
>
> For an example see the TestAllPaths class [1] which uses it. Don't be
> scared by the use of static member variables, that class is a somewhat
> tricky bridge with JUnit to run server-side tests.
>
> You will have to provide a ResourceResolver to use it, and that's
> built from a Session, login happens when that session is created.
>
> -Bertrand
>
> [1]
> http://svn.apache.org/repos/asf/sling/trunk/testing/junit/scriptable/src/main/java/org/apache/sling/junit/scriptable/TestAllPaths.java
>

Re: crawling html in asynchronous service?

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,

On Thu, Feb 24, 2011 at 9:20 PM, sam lee <sk...@gmail.com> wrote:
> ...since all html pages that I want to crawl are served within Sling,
> is there an API that "resolves" (or "renders") paths like /content/foo.html,
> /content/bar.json ... etc?

You can use the SlingRequestProcessor to make requests directly
without going through http.

For an example see the TestAllPaths class [1] which uses it. Don't be
scared by the use of static member variables, that class is a somewhat
tricky bridge with JUnit to run server-side tests.

You will have to provide a ResourceResolver to use it, and that's
built from a Session, login happens when that session is created.

-Bertrand

[1] http://svn.apache.org/repos/asf/sling/trunk/testing/junit/scriptable/src/main/java/org/apache/sling/junit/scriptable/TestAllPaths.java