You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@sling.apache.org by sam lee <sk...@gmail.com> on 2011/02/24 21:20:15 UTC
crawling html in asynchronous service?
Hey,
I am using Scheduler to crawl html files.
It runs every minute.
And it needs to crawl /content/foo.html
If I use Apache commons HttpClient for GET /content/foo.html, I need to set
up authentication (Basic Auth?).
However, since all html pages that I want to crawl are served within Sling,
is there an API that "resolves" (or "renders") paths like /content/foo.html,
/content/bar.json ... etc?
If I have to actually make HTTP request, where should I get authentication
information? Scheduler does not know JCR Session..
Should I explicitly get ResourceResolver (using
ResourceResolverFactory.getAdministrativeResourceResolver()) every time job
is fired?
Re: crawling html in asynchronous service?
Posted by sam lee <sk...@gmail.com>.
Thank you.
SlingRequestProcessor and mock up http Request and Response objects worked
well.
On Thu, Feb 24, 2011 at 3:34 PM, Bertrand Delacretaz <bdelacretaz@apache.org
> wrote:
> Hi,
>
> On Thu, Feb 24, 2011 at 9:20 PM, sam lee <sk...@gmail.com> wrote:
> > ...since all html pages that I want to crawl are served within Sling,
> > is there an API that "resolves" (or "renders") paths like
> /content/foo.html,
> > /content/bar.json ... etc?
>
> You can use the SlingRequestProcessor to make requests directly
> without going through http.
>
> For an example see the TestAllPaths class [1] which uses it. Don't be
> scared by the use of static member variables, that class is a somewhat
> tricky bridge with JUnit to run server-side tests.
>
> You will have to provide a ResourceResolver to use it, and that's
> built from a Session, login happens when that session is created.
>
> -Bertrand
>
> [1]
> http://svn.apache.org/repos/asf/sling/trunk/testing/junit/scriptable/src/main/java/org/apache/sling/junit/scriptable/TestAllPaths.java
>
Re: crawling html in asynchronous service?
Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,
On Thu, Feb 24, 2011 at 9:20 PM, sam lee <sk...@gmail.com> wrote:
> ...since all html pages that I want to crawl are served within Sling,
> is there an API that "resolves" (or "renders") paths like /content/foo.html,
> /content/bar.json ... etc?
You can use the SlingRequestProcessor to make requests directly
without going through http.
For an example see the TestAllPaths class [1] which uses it. Don't be
scared by the use of static member variables, that class is a somewhat
tricky bridge with JUnit to run server-side tests.
You will have to provide a ResourceResolver to use it, and that's
built from a Session, login happens when that session is created.
-Bertrand
[1] http://svn.apache.org/repos/asf/sling/trunk/testing/junit/scriptable/src/main/java/org/apache/sling/junit/scriptable/TestAllPaths.java