You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by Scott Purcell <sp...@vertisinc.com> on 2006/02/11 15:34:20 UTC

Robots cannot read JSP?

Tim,
Thanks a lot for the info. I got to thinking, and tried invoking curl
from my box on the url, and see exactly what you saw. The js screwing
things up.

So I decided to run curl on different pages, and I came to the
conclusion that only htm, or html pages show up via curl?

Does anyone think that the robots are just like curl, and that they can
only read HTML files?

Thanks for all, I know this is a bit off topic ...and I hope I don't
hack anyone off.

Thanks
Scott

-----Original Message-----
From: Tim Funk [mailto:funkman@joedog.org] 
Sent: Friday, February 10, 2006 8:50 PM
To: Tomcat Users List
Subject: Re: Access log to see where robots go.

The problem is your home page, not robots.txt. When / is requested - the

following is served back, notice the javascript redirect: (the full file
is 
below)

----
   function invokeWebApp() {
     top.location.href =
"http://www.theuniquepear.com/unique/index.jsp";
   }
----
Search engines do not execute javascript are there are no links on the
page 
so search engines have no where to go. (Except someone else's site).

As much as I detest SEO companies, you might find it helpful to search
for 
one for some assistance.

<html>
<head>
   <head>
     <title>The Unique Pear | Unique Home Decor & Accessories</title>
                 <meta name="description" content="The Unique Pear is an

online b                     outique specializing in home decor & 
accessories. Products include clocks, candl                     es, wall

decor, garden, lighting, bath and more.">
     <meta name="keywords" content="The Unique Pear Timework clocks,
lamps, 
lamp                      shades, candles, aroma, aroma difuser, wall
decor, 
wall scounces, wrought iron,                      pitchers, bookstands,
jaqua 
bath products, candleholders">
                 <meta name="description" content="">
<meta name="keywords" content="">
  </head>
<body bgcolor="#FFFFFF">

<script language = "javascript">
   //<!--
   function invokeWebApp() {
     top.location.href =
"http://www.theuniquepear.com/unique/index.jsp";
   }
   invokeWebApp();
   // -->
</script>

hello
</body>
</html>

-Tim

Scott Purcell wrote:
> I have had trouble getting search engines to see my site. I built it
with struts, and use some tags from the index.html page to get business
logic, to finally get to my page. The url is
http://www.theuniquepear.com
> 
> Anyway, upon talking to some co-workers, they suggested I watch my
access log, so I can see what files they are indexing. I thought I had
the access log turned on for the site, and see when someone hits my web
site, but as far as the searchbots go, I only see this in my logs daily.
> 
> $ cat  localhost_access_log.2006-02-07.txt | less
> 67.15.16.30 - - [07/Feb/2006:03:44:55 -0600] "GET /robots.txt
HTTP/1.0" 404 985
> 67.15.16.30 - - [07/Feb/2006:03:46:21 -0600] "GET / HTTP/1.0" 200 844
> 67.15.16.30 - - [07/Feb/2006:03:51:57 -0600] "GET /robots.txt
HTTP/1.0" 404 985
> 62.114.208.233 - - [07/Feb/2006:03:52:42 -0600] "GET
/unique/welcome.do?OVRAW=home%20decorating%20ideas&OVKEY=home
> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
/unique/includes/siteWide.css HTTP/1.1" 200 15402
> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
/unique/images/header_pear.jpg HTTP/1.1" 200 11227
> 
> 
> I see the entry for robots.txt, but I have no idea where they are
going, or what they are doing.
> 
> I turned on access log like this in the server.xml like so:
>         <Valve className="org.apache.catalina.valves.AccessLogValve"
>                  directory="logs"  prefix="localhost_access_log."
suffix=".txt"
>                  pattern="common" resolveHosts="false"/>
> 
> And that is a snippet of the log from above.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Robots cannot read JSP?

Posted by Mike Sabroff <mi...@cygnusb2b.com>.

Scott,
Your assessment is incorrect!  First off, curl doesn't read html pages, 
it does a get or post to a url just as though you clicked it in your 
browser (and a lot of other things you can do with curl). Second off, it 
is not the jsp that is the problem, it is the javascript as Tim said, 
and the lack of links.

Mike

David Smith wrote:
> I doubt the problem is with curl not being able to read files other 
> than .htm or .html. The problem is only browsers execute javascript. 
> Think of curl or the search engines as a browser without javascript 
> enabled.  What would you get in IE or Firefox if you disabled javascript?
>
> -- David
>
> Scott Purcell wrote:
>> Tim,
>> Thanks a lot for the info. I got to thinking, and tried invoking curl
>> from my box on the url, and see exactly what you saw. The js screwing
>> things up.
>>
>> So I decided to run curl on different pages, and I came to the
>> conclusion that only htm, or html pages show up via curl?
>>
>> Does anyone think that the robots are just like curl, and that they can
>> only read HTML files?
>>
>> Thanks for all, I know this is a bit off topic ...and I hope I don't
>> hack anyone off.
>>
>> Thanks
>> Scott
>>
>> -----Original Message-----
>> From: Tim Funk [mailto:funkman@joedog.org] Sent: Friday, February 10, 
>> 2006 8:50 PM
>> To: Tomcat Users List
>> Subject: Re: Access log to see where robots go.
>>
>> The problem is your home page, not robots.txt. When / is requested - the
>>
>> following is served back, notice the javascript redirect: (the full file
>> is below)
>>
>> ----
>>    function invokeWebApp() {
>>      top.location.href =
>> "http://www.theuniquepear.com/unique/index.jsp";
>>    }
>> ----
>> Search engines do not execute javascript are there are no links on the
>> page so search engines have no where to go. (Except someone else's 
>> site).
>>
>> As much as I detest SEO companies, you might find it helpful to search
>> for one for some assistance.
>>
>> <html>
>> <head>
>>    <head>
>>      <title>The Unique Pear | Unique Home Decor & Accessories</title>
>>                  <meta name="description" content="The Unique Pear is an
>>
>> online b                     outique specializing in home decor & 
>> accessories. Products include clocks, candl                     es, wall
>>
>> decor, garden, lighting, bath and more.">
>>      <meta name="keywords" content="The Unique Pear Timework clocks,
>> lamps, lamp                      shades, candles, aroma, aroma 
>> difuser, wall
>> decor, wall scounces, wrought iron,                      pitchers, 
>> bookstands,
>> jaqua bath products, candleholders">
>>                  <meta name="description" content="">
>> <meta name="keywords" content="">
>>   </head>
>> <body bgcolor="#FFFFFF">
>>
>> <script language = "javascript">
>>    //<!--
>>    function invokeWebApp() {
>>      top.location.href =
>> "http://www.theuniquepear.com/unique/index.jsp";
>>    }
>>    invokeWebApp();
>>    // -->
>> </script>
>>
>> hello
>> </body>
>> </html>
>>
>> -Tim
>>
>> Scott Purcell wrote:
>>  
>>> I have had trouble getting search engines to see my site. I built it
>>>     
>> with struts, and use some tags from the index.html page to get business
>> logic, to finally get to my page. The url is
>> http://www.theuniquepear.com
>>  
>>> Anyway, upon talking to some co-workers, they suggested I watch my
>>>     
>> access log, so I can see what files they are indexing. I thought I had
>> the access log turned on for the site, and see when someone hits my web
>> site, but as far as the searchbots go, I only see this in my logs daily.
>>  
>>> $ cat  localhost_access_log.2006-02-07.txt | less
>>> 67.15.16.30 - - [07/Feb/2006:03:44:55 -0600] "GET /robots.txt
>>>     
>> HTTP/1.0" 404 985
>>  
>>> 67.15.16.30 - - [07/Feb/2006:03:46:21 -0600] "GET / HTTP/1.0" 200 844
>>> 67.15.16.30 - - [07/Feb/2006:03:51:57 -0600] "GET /robots.txt
>>>     
>> HTTP/1.0" 404 985
>>  
>>> 62.114.208.233 - - [07/Feb/2006:03:52:42 -0600] "GET
>>>     
>> /unique/welcome.do?OVRAW=home%20decorating%20ideas&OVKEY=home
>>  
>>> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
>>>     
>> /unique/includes/siteWide.css HTTP/1.1" 200 15402
>>  
>>> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
>>>     
>> /unique/images/header_pear.jpg HTTP/1.1" 200 11227
>>  
>>> I see the entry for robots.txt, but I have no idea where they are
>>>     
>> going, or what they are doing.
>>  
>>> I turned on access log like this in the server.xml like so:
>>>         <Valve className="org.apache.catalina.valves.AccessLogValve"
>>>                  directory="logs"  prefix="localhost_access_log."
>>>     
>> suffix=".txt"
>>  
>>>                  pattern="common" resolveHosts="false"/>
>>>
>>> And that is a snippet of the log from above.
>>>
>>>     
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>>   
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>

-- 
Mike Sabroff
Web Services Developer
mike.sabroff@cygnusb2b.com
920-568-8379


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Robots cannot read JSP?

Posted by David Smith <dn...@cornell.edu>.

I doubt the problem is with curl not being able to read files other than 
.htm or .html. The problem is only browsers execute javascript. Think of 
curl or the search engines as a browser without javascript enabled.  
What would you get in IE or Firefox if you disabled javascript? 


-- David

Scott Purcell wrote:
> Tim,
> Thanks a lot for the info. I got to thinking, and tried invoking curl
> from my box on the url, and see exactly what you saw. The js screwing
> things up.
>
> So I decided to run curl on different pages, and I came to the
> conclusion that only htm, or html pages show up via curl?
>
> Does anyone think that the robots are just like curl, and that they can
> only read HTML files?
>
> Thanks for all, I know this is a bit off topic ...and I hope I don't
> hack anyone off.
>
> Thanks
> Scott
>
> -----Original Message-----
> From: Tim Funk [mailto:funkman@joedog.org] 
> Sent: Friday, February 10, 2006 8:50 PM
> To: Tomcat Users List
> Subject: Re: Access log to see where robots go.
>
> The problem is your home page, not robots.txt. When / is requested - the
>
> following is served back, notice the javascript redirect: (the full file
> is 
> below)
>
> ----
>    function invokeWebApp() {
>      top.location.href =
> "http://www.theuniquepear.com/unique/index.jsp";
>    }
> ----
> Search engines do not execute javascript are there are no links on the
> page 
> so search engines have no where to go. (Except someone else's site).
>
> As much as I detest SEO companies, you might find it helpful to search
> for 
> one for some assistance.
>
> <html>
> <head>
>    <head>
>      <title>The Unique Pear | Unique Home Decor & Accessories</title>
>                  <meta name="description" content="The Unique Pear is an
>
> online b                     outique specializing in home decor & 
> accessories. Products include clocks, candl                     es, wall
>
> decor, garden, lighting, bath and more.">
>      <meta name="keywords" content="The Unique Pear Timework clocks,
> lamps, 
> lamp                      shades, candles, aroma, aroma difuser, wall
> decor, 
> wall scounces, wrought iron,                      pitchers, bookstands,
> jaqua 
> bath products, candleholders">
>                  <meta name="description" content="">
> <meta name="keywords" content="">
>   </head>
> <body bgcolor="#FFFFFF">
>
> <script language = "javascript">
>    //<!--
>    function invokeWebApp() {
>      top.location.href =
> "http://www.theuniquepear.com/unique/index.jsp";
>    }
>    invokeWebApp();
>    // -->
> </script>
>
> hello
> </body>
> </html>
>
> -Tim
>
> Scott Purcell wrote:
>   
>> I have had trouble getting search engines to see my site. I built it
>>     
> with struts, and use some tags from the index.html page to get business
> logic, to finally get to my page. The url is
> http://www.theuniquepear.com
>   
>> Anyway, upon talking to some co-workers, they suggested I watch my
>>     
> access log, so I can see what files they are indexing. I thought I had
> the access log turned on for the site, and see when someone hits my web
> site, but as far as the searchbots go, I only see this in my logs daily.
>   
>> $ cat  localhost_access_log.2006-02-07.txt | less
>> 67.15.16.30 - - [07/Feb/2006:03:44:55 -0600] "GET /robots.txt
>>     
> HTTP/1.0" 404 985
>   
>> 67.15.16.30 - - [07/Feb/2006:03:46:21 -0600] "GET / HTTP/1.0" 200 844
>> 67.15.16.30 - - [07/Feb/2006:03:51:57 -0600] "GET /robots.txt
>>     
> HTTP/1.0" 404 985
>   
>> 62.114.208.233 - - [07/Feb/2006:03:52:42 -0600] "GET
>>     
> /unique/welcome.do?OVRAW=home%20decorating%20ideas&OVKEY=home
>   
>> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
>>     
> /unique/includes/siteWide.css HTTP/1.1" 200 15402
>   
>> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
>>     
> /unique/images/header_pear.jpg HTTP/1.1" 200 11227
>   
>> I see the entry for robots.txt, but I have no idea where they are
>>     
> going, or what they are doing.
>   
>> I turned on access log like this in the server.xml like so:
>>         <Valve className="org.apache.catalina.valves.AccessLogValve"
>>                  directory="logs"  prefix="localhost_access_log."
>>     
> suffix=".txt"
>   
>>                  pattern="common" resolveHosts="false"/>
>>
>> And that is a snippet of the log from above.
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

RE: Robots cannot read JSP?

Posted by Tim Lucia <ti...@yahoo.com>.

It's not html or JSP nature of things.  You are returning text/html for the
mime type, and a real HTML document.  The problem is the content you return
does not provide the robots any place to go.

Perhaps responding with a redirect (302) will provide them somewhere to go.
You can use meta-refresh, or <logic:redirect> or if front-ended with apache,
just provide a RedirectMatch ^/$ /unique/index.jsp line...

HTH,
Tim


-----Original Message-----
From: Scott Purcell [mailto:spurcell@vertisinc.com] 
Sent: Saturday, February 11, 2006 9:34 AM
To: Tomcat Users List
Subject: Robots cannot read JSP?


Tim,
Thanks a lot for the info. I got to thinking, and tried invoking curl from
my box on the url, and see exactly what you saw. The js screwing things up.

So I decided to run curl on different pages, and I came to the conclusion
that only htm, or html pages show up via curl?

Does anyone think that the robots are just like curl, and that they can only
read HTML files?

Thanks for all, I know this is a bit off topic ...and I hope I don't hack
anyone off.

Thanks
Scott

-----Original Message-----
From: Tim Funk [mailto:funkman@joedog.org] 
Sent: Friday, February 10, 2006 8:50 PM
To: Tomcat Users List
Subject: Re: Access log to see where robots go.

The problem is your home page, not robots.txt. When / is requested - the

following is served back, notice the javascript redirect: (the full file is 
below)

----
   function invokeWebApp() {
     top.location.href = "http://www.theuniquepear.com/unique/index.jsp";
   }
----
Search engines do not execute javascript are there are no links on the page 
so search engines have no where to go. (Except someone else's site).

As much as I detest SEO companies, you might find it helpful to search for 
one for some assistance.

<html>
<head>
   <head>
     <title>The Unique Pear | Unique Home Decor & Accessories</title>
                 <meta name="description" content="The Unique Pear is an

online b                     outique specializing in home decor & 
accessories. Products include clocks, candl                     es, wall

decor, garden, lighting, bath and more.">
     <meta name="keywords" content="The Unique Pear Timework clocks, lamps, 
lamp                      shades, candles, aroma, aroma difuser, wall
decor, 
wall scounces, wrought iron,                      pitchers, bookstands,
jaqua 
bath products, candleholders">
                 <meta name="description" content="">
<meta name="keywords" content="">
  </head>
<body bgcolor="#FFFFFF">

<script language = "javascript">
   //<!--
   function invokeWebApp() {
     top.location.href = "http://www.theuniquepear.com/unique/index.jsp";
   }
   invokeWebApp();
   // -->
</script>

hello
</body>
</html>

-Tim

Scott Purcell wrote:
> I have had trouble getting search engines to see my site. I built it
with struts, and use some tags from the index.html page to get business
logic, to finally get to my page. The url is http://www.theuniquepear.com
> 
> Anyway, upon talking to some co-workers, they suggested I watch my
access log, so I can see what files they are indexing. I thought I had the
access log turned on for the site, and see when someone hits my web site,
but as far as the searchbots go, I only see this in my logs daily.
> 
> $ cat  localhost_access_log.2006-02-07.txt | less
> 67.15.16.30 - - [07/Feb/2006:03:44:55 -0600] "GET /robots.txt
HTTP/1.0" 404 985
> 67.15.16.30 - - [07/Feb/2006:03:46:21 -0600] "GET / HTTP/1.0" 200 844 
> 67.15.16.30 - - [07/Feb/2006:03:51:57 -0600] "GET /robots.txt
HTTP/1.0" 404 985
> 62.114.208.233 - - [07/Feb/2006:03:52:42 -0600] "GET
/unique/welcome.do?OVRAW=home%20decorating%20ideas&OVKEY=home
> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
/unique/includes/siteWide.css HTTP/1.1" 200 15402
> 62.114.208.233 - - [07/Feb/2006:03:52:44 -0600] "GET
/unique/images/header_pear.jpg HTTP/1.1" 200 11227
> 
> 
> I see the entry for robots.txt, but I have no idea where they are
going, or what they are doing.
> 
> I turned on access log like this in the server.xml like so:
>         <Valve className="org.apache.catalina.valves.AccessLogValve"
>                  directory="logs"  prefix="localhost_access_log."
suffix=".txt"
>                  pattern="common" resolveHosts="false"/>
> 
> And that is a snippet of the log from above.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org