You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by J S <ve...@hotmail.com> on 2005/06/09 23:21:46 UTC

parsing msword docs

Hi,

Complete newbie here so sorry if this is a silly question! I was wondering 
about the following message in the crawl.log I have:

050609 221715 fetch okay, but can't parse 
http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc, 
reason: Content-Type not text/html: application/msword

Would my search be more efficient if  turned on the plugin to parse 
microsoft word docs? If so, how do I turn the plugin on?

Thanks for any help,

JS.

Re: JavaScript Urls

Posted by Jack Tang <hi...@gmail.com>.

:) 
Thanks for your tip

Regards
/Jack

On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi Andrzej
> >
> > I think javascript-function-and-url mapping is a good solution.
> > Say
> > domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> >
> > "go" is the javascipt function and it contains one param. And
> > "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> > and {0} is the exactly param, it should be merged when "go" function
> > is detected.
> > Now the problem I face is in "go" function the form is submited, and
> > the "action" is "POST".
> 
> Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
> that - it doesn't really understand JavaScript, it just tries to extract
> urls, and does it with quite high error rate... but still better than
> nothing.
> 
> If you want a full-fledged solution that can actually interpret your
> scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
> which can be turned into Javascript-aware crawlers.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: JavaScript Urls

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jack Tang wrote:
> Hi Andrzej
> 
> I think javascript-function-and-url mapping is a good solution.
> Say
> domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> 
> "go" is the javascipt function and it contains one param. And
> "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> and {0} is the exactly param, it should be merged when "go" function
> is detected.
> Now the problem I face is in "go" function the form is submited, and
> the "action" is "POST".

Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just 
that - it doesn't really understand JavaScript, it just tries to extract 
urls, and does it with quite high error rate... but still better than 
nothing.

If you want a full-fledged solution that can actually interpret your 
scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of 
which can be turned into Javascript-aware crawlers.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: JavaScript Urls

Posted by Jack Tang <hi...@gmail.com>.

Hi Andrzej

I think javascript-function-and-url mapping is a good solution.
Say
domainName.javascript:go = http://www.a.com/b.jsp?id={0}

"go" is the javascipt function and it contains one param. And
"http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
and {0} is the exactly param, it should be merged when "go" function
is detected.
Now the problem I face is in "go" function the form is submited, and
the "action" is "POST".

Regards
/Jack

On 6/10/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Howie Wang wrote:
> > I think you have to hack the parse-html plugin. Look in
> > DOMContentUtils.java
> > in getOutlinks.java.  You'll probably have to look for targets that
> > start with
> > "javascript:" and do some string replacing.
> 
> The latest SVN version already has a JavaScript link extractor
> (JSParseFilter in parse-js plugin). Currently it handles extraction of
> JS snippets from HTML events (onload, onclick, onmouseover, etc), and of
> course from <script> elements. The only thing missing to handle your
> case is to add a clause to handle the "javascript:" in any other attribute.
> 
> I can make this change. Watch the commit messages so that you know when
> to sync your source.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: JavaScript Urls

Posted by Andrzej Bialecki <ab...@getopt.org>.

Howie Wang wrote:
> I think you have to hack the parse-html plugin. Look in 
> DOMContentUtils.java
> in getOutlinks.java.  You'll probably have to look for targets that 
> start with
> "javascript:" and do some string replacing.

The latest SVN version already has a JavaScript link extractor 
(JSParseFilter in parse-js plugin). Currently it handles extraction of 
JS snippets from HTML events (onload, onclick, onmouseover, etc), and of 
course from <script> elements. The only thing missing to handle your 
case is to add a clause to handle the "javascript:" in any other attribute.

I can make this change. Watch the commit messages so that you know when 
to sync your source.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: JavaScript Urls

Posted by Howie Wang <ho...@hotmail.com>.

I think you have to hack the parse-html plugin. Look in DOMContentUtils.java
in getOutlinks.java.  You'll probably have to look for targets that start 
with
"javascript:" and do some string replacing.

Howie

>Hi,
>
>  Anyone here know how to make Nutch read "<a href=javascript(aaa);>" as
>http://www.myurl.com/one.php?id=aaa ?
>
>Thanks in advance.
>
>Marco
>
>
>

JavaScript Urls

Posted by lu...@uol.com.br.

Hi,

 Anyone here know how to make Nutch read "<a href=javascript(aaa);>" as
http://www.myurl.com/one.php?id=aaa ?

Thanks in advance.

Marco

Re: parsing msword docs

Posted by J S <ve...@hotmail.com>.

Hi,

Thanks for your reply. I'm running the crawl again now with that expression 
added in, so it will be interesting to see the results later.

I also changed the protocol to <value>protocol-(http|https|ftp)|parse .... 
to try and pick up ftp and https sites as well. However it looks as if https 
doesn't work:

050610 082219 fetch of http://my.bp.com/login.do failed with: 
org.apache.nutch.protocol.http.HttpException: Not an HTTP 
url:https://my.bp.com/password/redirect.jsp

Have I configured this wrong, or Is ssl support not added in yet?

JS.


>first sorry for my english:
>you should see the conf/ nutch-default.xml :
>
><property>
>   <name>plugin.includes</name>
>   
><value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.  By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
></property>
>
>by default nutch only works with text and html files, then you should
>do some changes in the
>conf/nutch-site.xml, to parse msword:
>
>  <property>
>   <name>plugin.includes</name>
>   
><value>protocol-http|parse-(text|html|msword|pdf|rtf)|index-basic|query-(basic|site|url)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.  By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
></property>
>
></nutch-conf>
>
>
>
>
>On 6/9/05, J S <ve...@hotmail.com> wrote:
> > Hi,
> >
> > Complete newbie here so sorry if this is a silly question! I was 
>wondering
> > about the following message in the crawl.log I have:
> >
> > 050609 221715 fetch okay, but can't parse
> > 
>http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
> > reason: Content-Type not text/html: application/msword
> >
> > Would my search be more efficient if  turned on the plugin to parse
> > microsoft word docs? If so, how do I turn the plugin on?
> >
> > Thanks for any help,
> >
> > JS.
> >
> >
> >

Re: parsing msword docs

Posted by Santi Gori <ha...@gmail.com>.

first sorry for my english:
you should see the conf/ nutch-default.xml :

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

by default nutch only works with text and html files, then you should
do some changes in the
conf/nutch-site.xml, to parse msword:

 <property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(text|html|msword|pdf|rtf)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

</nutch-conf>

On 6/9/05, J S <ve...@hotmail.com> wrote:
> Hi,
> 
> Complete newbie here so sorry if this is a silly question! I was wondering
> about the following message in the crawl.log I have:
>  
> 050609 221715 fetch okay, but can't parse
> http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
> reason: Content-Type not text/html: application/msword
> 
> Would my search be more efficient if  turned on the plugin to parse
> microsoft word docs? If so, how do I turn the plugin on?
> 
> Thanks for any help,
> 
> JS.
> 
> 
>