You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Mitch Claborn <mi...@claborn.net> on 2015/07/16 19:58:21 UTC

Check if a URL exists programatically

Short question: How can I, from within code running under Tomcat, 
determine if a given URL request to that tomcat instance would result in 
a 404 or not, without calling back to the Tomcat using an HTTP HEAD or GET?

Background: We use google custom search by calling the google server and 
then formatting the results on our search page.  Our range of products 
is fairly fluid, and there is occasionally a gap between when a product 
goes away and the google search index is updated, which would result in 
a 404 if user clicked that link in the search results.  (I know that I 
can ask google to re-index, but I still need to solve this problem.)

Rather than write a ton of code for the various types of pages that we 
have (product, category, etc) I'd like to just be able to call some 
Tomcat method to determine if the URL that I get back from google would 
result in a 404 or not.  I'm currently calling back to the Tomcat 
instance using an HTTP HEAD call, but that is a waste of resources and 
during periods of high volume uses up processing threads that I want to 
reserve for actual customers.

We are using Tomcat 7 with Struts.


-- 

Mitch


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Mitch,

On 7/20/15 2:09 PM, Mitch Claborn wrote:
> On 07/17/2015 10:48 AM, Mitch Claborn wrote:
>> On 07/16/2015 02:19 PM, chris derham wrote:
>>>> I already have a custom error page. When I detect that a URL 
>>>> returned by google would return a 404, I exclude it from the
>>>> search results so that the user never sees it.
>>>> 
>>>> Mitch
>>> Mitch,
>>> 
>>> Ok I see now what you mean. Sorry your original email was quite
>>> clear.
>>> 
>>> Hmm interesting challenge. Big picture terms, I guess the two
>>> obvious choices seem to be to not use google for searching, or
>>> parse the google results, and determine the url validity as you
>>> are doing. Depending on the urls you use, that could be
>>> horrible. Guess that's where you are. Is not using google an
>>> option?
>>> 
>>> Please let us know how you resolve it.
>>> 
>>> Chris
>>> 
>>> --------------------------------------------------------------------
- -
>>>
>>> 
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>>> For additional commands, e-mail: users-help@tomcat.apache.org
>>> 
>>> 
>> Doing without google is not an option.  We are quite happy with
>> them except for this one, admittedly minor, glitch.
>> 
>> I spent some time yesterday digging through code without much
>> luck. Today I'm going to experiment with this: getting a Request
>> Dispatcher for the URL from the ServletContext, creating a dummy
>> ServerRequest and ServerResponse object and invoking
>> include(request, response) or forward() on that dispatcher.  With
>> luck, I'll be able to get what would be the response from a HEAD
>> or a GET request in some sort of output stream in the response
>> object, then examine that output stream for the result.
>> 
> 
> I guess I'm giving up on this. I tried the approach described
> above, but can't seem to make it work.  Trying the case of a
> known-good URL as a baseline.  When I invoke
> displatcher,forward(request,response) my dummy response objects
> gets called with a sendError(404, "/url.html"), but I can also see
> evidence that the code that should run for that URL (a struts
> action) is running and is returning a good Struts response.  When I
> enable low level logging, it appears to me that the JSP that
> renders the output is being called, but the output is not making it
> back to my dummy response object.
> 
> That sendError() is coming from the DefaultServlet, which is odd
> because I would think that should not be called as Struts is
> (should be) intercepting all of the requests.

S2 is implemented as a Filter. If nothing matches in the S2 setup, it
will probably just call-down the Filter chain, eventually ending up at
the DefaultServlet. So, a 404 is pretty much always handled by the
DefaultServlet.

> I must be setting something up wrong somewhere.  The only next step
> I can think of is to compile Tomcat for myself so I can debug the 
> execution path from the forward() to figure out what's going on.
> I can't justify that much time and effort on this.
> 
> I'm guessing the RequestDispatcher only works down below the
> filters, which is where Struts is invoked.

RequestDispatcher will act pretty much just like an incoming request.
At this point, you may just want to make the loopback request. You
mentioned wasting resources using this approach. Which resources? If
you're willing to call the RequestDispatcher, you're pretty much using
those resources already. About the only difference is the use of
another Thread. You can limit the number of threads used for these
loopback requests by creating a second <Connector> that is only used
for loopback requests, and use an <Executor> that has a small number
of threads. Of course, if any incoming request can result in a
loopback request, then it's possible to DOS your server just by making
lots of requests that will trigger these kinds of lookback requests.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQIcBAEBCAAGBQJVrXoAAAoJEBzwKT+lPKRYzxwQAK9y9WOmqbDh8pCik1paKHsU
aldPFJVwxdgxNxKPnhNHvtBVHBn+aueOv9ywK1MKun2UqYxznZmTon3Fy4IehVcV
Y16+45MXXA/dpIDEwVgj8ByNB/7NRPscxkg9IIKV+eliGhhjpb33owCoT8qd5p7/
yDwvVM5bMZ9h4+faHinu/FY56Qx7tjBpXER/uLOK8aDgxgak1TdyhBzQHXktD1zB
UPmydwDxlzGv0dODY/cEzWAh8FBDiyZtRakAKSs0rCD3t7Zs3q4JecEFq/vQDP71
xZoGwBtge3+Im2gEav5GYYF2EsDKrEUD1dbqCUyBI3uOnHQvNptngeKXfoq4Vkv6
6HY3VEMS0wsYPAG2JhAc/TVGH0Cm8Eq9FFvlRUeCIjOwVUK0OXACXTP1Wn9VDyUH
vo+VfIUHgqzkdoGzKyoU6gvZgA7cwQAAp9iQlrVhbAxtvKkgor607a3g0LZ+A5hI
Zw04wNy4ANsYi8ad989Ycg/Xmr9tZId6F1y9+sSmeJ3imWnEOYH6uyToa/0p8cQd
VC9SfuOATSrjOdnn7CPiGdnQCmW3JSB3mZBCp4er78rHf5oyDN5Ybgm5jXGfGKI+
61WlePY/NA5UsIMR8DYWSPIXdJfyVfEQcoUVmWV2fIt2zq0sf0c4wpt69c12PR+z
7aTZc4+lCOLbN0KJ/3zv
=8TfU
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Mitch Claborn <mi...@claborn.net>.
On 07/17/2015 10:48 AM, Mitch Claborn wrote:
> On 07/16/2015 02:19 PM, chris derham wrote:
>>> I already have a custom error page. When I detect that a URL 
>>> returned by
>>> google would return a 404, I exclude it from the search results so 
>>> that the
>>> user never sees it.
>>>
>>> Mitch
>> Mitch,
>>
>> Ok I see now what you mean. Sorry your original email was quite clear.
>>
>> Hmm interesting challenge. Big picture terms, I guess the two obvious
>> choices seem to be to not use google for searching, or parse the
>> google results, and determine the url validity as you are doing.
>> Depending on the urls you use, that could be horrible. Guess that's
>> where you are. Is not using google an option?
>>
>> Please let us know how you resolve it.
>>
>> Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>
> Doing without google is not an option.  We are quite happy with them 
> except for this one, admittedly minor, glitch.
>
> I spent some time yesterday digging through code without much luck. 
> Today I'm going to experiment with this: getting a Request Dispatcher 
> for the URL from the ServletContext, creating a dummy ServerRequest 
> and ServerResponse object and invoking include(request, response) or 
> forward() on that dispatcher.  With luck, I'll be able to get what 
> would be the response from a HEAD or a GET request in some sort of 
> output stream in the response object, then examine that output stream 
> for the result.
>

I guess I'm giving up on this. I tried the approach described above, but 
can't seem to make it work.  Trying the case of a known-good URL as a 
baseline.  When I invoke displatcher,forward(request,response) my dummy 
response objects gets called with a sendError(404, "/url.html"), but I 
can also see evidence that the code that should run for that URL (a 
struts action) is running and is returning a good Struts response.  When 
I enable low level logging, it appears to me that the JSP that renders 
the output is being called, but the output is not making it back to my 
dummy response object.

That sendError() is coming from the DefaultServlet, which is odd because 
I would think that should not be called as Struts is (should be) 
intercepting all of the requests.

I must be setting something up wrong somewhere.  The only next step I 
can think of is to compile Tomcat for myself so I can debug the 
execution path from the forward() to figure out what's going on.  I 
can't justify that much time and effort on this.

I'm guessing the RequestDispatcher only works down below the filters, 
which is where Struts is invoked.

I welcome any further ideas.


-- 

Mitch


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Mitch Claborn <mi...@claborn.net>.
2015-07-17 18:48 GMT+03:00 Mitch Claborn <mi...@claborn.net>:
>> I spent some time yesterday digging through code without much luck. Today
>> I'm going to experiment with this: getting a Request Dispatcher for the URL
>> from the ServletContext, creating a dummy ServerRequest and ServerResponse
>> object and invoking include(request, response) or forward() on that
>> dispatcher.  With luck, I'll be able to get what would be the response from
>> a HEAD or a GET request in some sort of output stream in the response
>> object, then examine that output stream for the result.

The way I finally solved this is a bit of a shortcut, but it works.
Since our site is completely based on Struts, I'm reading the struts.xml
file and matching the action names against the URL returned by google.
For those that don't have a specific match, I call a routine in my
"default" action that checks for various dynamically named pages
(categories, products, etc).  It runs super fast and doesn't need the
dummy request and response objects.

I was hoping for something that would be framework agnostic, but this
will do for now.

Thanks all for your help and suggestions.

Mitch



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Konstantin Kolinko <kn...@gmail.com>.
2015-07-17 18:48 GMT+03:00 Mitch Claborn <mi...@claborn.net>:
>
> I spent some time yesterday digging through code without much luck. Today
> I'm going to experiment with this: getting a Request Dispatcher for the URL
> from the ServletContext, creating a dummy ServerRequest and ServerResponse
> object and invoking include(request, response) or forward() on that
> dispatcher.  With luck, I'll be able to get what would be the response from
> a HEAD or a GET request in some sort of output stream in the response
> object, then examine that output stream for the result.


Using dummy objects with those APIs is disallowed by Servlet specification.

If you run in "strict compliance mode", Tomcat will check this
requirement. As far as I remember, the error message mentions the
chapter number of specification.

http://tomcat.apache.org/tomcat-7.0-doc/config/systemprops.html#Specification
See for "WRAP_SAME_OBJECT"


Testing for existence of static pages should be easy, with
ServletContext.getResource[AsStream]() or with other APIs (using
Tomcat internal resources APIs, or accessing the files directly)

Testing for existence of dynamic pages may be hard. You cannot check
for existence unless making an actual request (better with a HEAD
request rather than with a GET).

If you are unlucky, a GET request may trigger some action. E.g. Tomcat
Manager application was suffering from such feature,
https://bz.apache.org/bugzilla/show_bug.cgi?id=50231

A HEAD request is better, as it produces no output, but e.g. for JSPs
a HEAD request is implemented as GET + suppressing output, so it
actually performs the same processing.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Mitch Claborn <mi...@claborn.net>.
On 07/16/2015 02:19 PM, chris derham wrote:
>> I already have a custom error page.  When I detect that a URL returned by
>> google would return a 404, I exclude it from the search results so that the
>> user never sees it.
>>
>> Mitch
> Mitch,
>
> Ok I see now what you mean. Sorry your original email was quite clear.
>
> Hmm interesting challenge. Big picture terms, I guess the two obvious
> choices seem to be to not use google for searching, or parse the
> google results, and determine the url validity as you are doing.
> Depending on the urls you use, that could be horrible. Guess that's
> where you are. Is not using google an option?
>
> Please let us know how you resolve it.
>
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>
Doing without google is not an option.  We are quite happy with them 
except for this one, admittedly minor, glitch.

I spent some time yesterday digging through code without much luck. 
Today I'm going to experiment with this: getting a Request Dispatcher 
for the URL from the ServletContext, creating a dummy ServerRequest and 
ServerResponse object and invoking include(request, response) or 
forward() on that dispatcher.  With luck, I'll be able to get what would 
be the response from a HEAD or a GET request in some sort of output 
stream in the response object, then examine that output stream for the 
result.

-- 

Mitch


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by chris derham <ch...@derham.me.uk>.
> I already have a custom error page.  When I detect that a URL returned by
> google would return a 404, I exclude it from the search results so that the
> user never sees it.
>
> Mitch

Mitch,

Ok I see now what you mean. Sorry your original email was quite clear.

Hmm interesting challenge. Big picture terms, I guess the two obvious
choices seem to be to not use google for searching, or parse the
google results, and determine the url validity as you are doing.
Depending on the urls you use, that could be horrible. Guess that's
where you are. Is not using google an option?

Please let us know how you resolve it.

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by Mitch Claborn <mi...@claborn.net>.
On 07/16/2015 01:04 PM, chris derham wrote:
>> Short question: How can I, from within code running under Tomcat, determine
>> if a given URL request to that tomcat instance would result in a 404 or not,
>> without calling back to the Tomcat using an HTTP HEAD or GET?
>>
>> Background: We use google custom search by calling the google server and
>> then formatting the results on our search page.  Our range of products is
>> fairly fluid, and there is occasionally a gap between when a product goes
>> away and the google search index is updated, which would result in a 404 if
>> user clicked that link in the search results.  (I know that I can ask google
>> to re-index, but I still need to solve this problem.)
>>
>> Rather than write a ton of code for the various types of pages that we have
>> (product, category, etc) I'd like to just be able to call some Tomcat method
>> to determine if the URL that I get back from google would result in a 404 or
>> not.  I'm currently calling back to the Tomcat instance using an HTTP HEAD
>> call, but that is a waste of resources and during periods of high volume
>> uses up processing threads that I want to reserve for actual customers.
>>
>> We are using Tomcat 7 with Struts.
> Mitch,
>
> What will you do when you detect a 404? Couldn't you just implement a
> custom 404 error page, that does what ever it is?
>
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

I already have a custom error page.  When I detect that a URL returned 
by google would return a 404, I exclude it from the search results so 
that the user never sees it.

Mitch



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Check if a URL exists programatically

Posted by chris derham <ch...@derham.me.uk>.
> Short question: How can I, from within code running under Tomcat, determine
> if a given URL request to that tomcat instance would result in a 404 or not,
> without calling back to the Tomcat using an HTTP HEAD or GET?
>
> Background: We use google custom search by calling the google server and
> then formatting the results on our search page.  Our range of products is
> fairly fluid, and there is occasionally a gap between when a product goes
> away and the google search index is updated, which would result in a 404 if
> user clicked that link in the search results.  (I know that I can ask google
> to re-index, but I still need to solve this problem.)
>
> Rather than write a ton of code for the various types of pages that we have
> (product, category, etc) I'd like to just be able to call some Tomcat method
> to determine if the URL that I get back from google would result in a 404 or
> not.  I'm currently calling back to the Tomcat instance using an HTTP HEAD
> call, but that is a waste of resources and during periods of high volume
> uses up processing threads that I want to reserve for actual customers.
>
> We are using Tomcat 7 with Struts.

Mitch,

What will you do when you detect a 404? Couldn't you just implement a
custom 404 error page, that does what ever it is?

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org