You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by J S <ve...@hotmail.com> on 2005/06/09 23:21:46 UTC
parsing msword docs
Hi,
Complete newbie here so sorry if this is a silly question! I was wondering
about the following message in the crawl.log I have:
050609 221715 fetch okay, but can't parse
http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
reason: Content-Type not text/html: application/msword
Would my search be more efficient if turned on the plugin to parse
microsoft word docs? If so, how do I turn the plugin on?
Thanks for any help,
JS.
Re: JavaScript Urls
Posted by Jack Tang <hi...@gmail.com>.
:)
Thanks for your tip
Regards
/Jack
On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi Andrzej
> >
> > I think javascript-function-and-url mapping is a good solution.
> > Say
> > domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> >
> > "go" is the javascipt function and it contains one param. And
> > "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> > and {0} is the exactly param, it should be merged when "go" function
> > is detected.
> > Now the problem I face is in "go" function the form is submited, and
> > the "action" is "POST".
>
> Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
> that - it doesn't really understand JavaScript, it just tries to extract
> urls, and does it with quite high error rate... but still better than
> nothing.
>
> If you want a full-fledged solution that can actually interpret your
> scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
> which can be turned into Javascript-aware crawlers.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: JavaScript Urls
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:
> Hi Andrzej
>
> I think javascript-function-and-url mapping is a good solution.
> Say
> domainName.javascript:go = http://www.a.com/b.jsp?id={0}
>
> "go" is the javascipt function and it contains one param. And
> "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> and {0} is the exactly param, it should be merged when "go" function
> is detected.
> Now the problem I face is in "go" function the form is submited, and
> the "action" is "POST".
Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
that - it doesn't really understand JavaScript, it just tries to extract
urls, and does it with quite high error rate... but still better than
nothing.
If you want a full-fledged solution that can actually interpret your
scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
which can be turned into Javascript-aware crawlers.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: JavaScript Urls
Posted by Jack Tang <hi...@gmail.com>.
Hi Andrzej
I think javascript-function-and-url mapping is a good solution.
Say
domainName.javascript:go = http://www.a.com/b.jsp?id={0}
"go" is the javascipt function and it contains one param. And
"http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
and {0} is the exactly param, it should be merged when "go" function
is detected.
Now the problem I face is in "go" function the form is submited, and
the "action" is "POST".
Regards
/Jack
On 6/10/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Howie Wang wrote:
> > I think you have to hack the parse-html plugin. Look in
> > DOMContentUtils.java
> > in getOutlinks.java. You'll probably have to look for targets that
> > start with
> > "javascript:" and do some string replacing.
>
> The latest SVN version already has a JavaScript link extractor
> (JSParseFilter in parse-js plugin). Currently it handles extraction of
> JS snippets from HTML events (onload, onclick, onmouseover, etc), and of
> course from <script> elements. The only thing missing to handle your
> case is to add a clause to handle the "javascript:" in any other attribute.
>
> I can make this change. Watch the commit messages so that you know when
> to sync your source.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: JavaScript Urls
Posted by Andrzej Bialecki <ab...@getopt.org>.
Howie Wang wrote:
> I think you have to hack the parse-html plugin. Look in
> DOMContentUtils.java
> in getOutlinks.java. You'll probably have to look for targets that
> start with
> "javascript:" and do some string replacing.
The latest SVN version already has a JavaScript link extractor
(JSParseFilter in parse-js plugin). Currently it handles extraction of
JS snippets from HTML events (onload, onclick, onmouseover, etc), and of
course from <script> elements. The only thing missing to handle your
case is to add a clause to handle the "javascript:" in any other attribute.
I can make this change. Watch the commit messages so that you know when
to sync your source.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: JavaScript Urls
Posted by Howie Wang <ho...@hotmail.com>.
I think you have to hack the parse-html plugin. Look in DOMContentUtils.java
in getOutlinks.java. You'll probably have to look for targets that start
with
"javascript:" and do some string replacing.
Howie
>Hi,
>
> Anyone here know how to make Nutch read "<a href=javascript(aaa);>" as
>http://www.myurl.com/one.php?id=aaa ?
>
>Thanks in advance.
>
>Marco
>
>
>
JavaScript Urls
Posted by lu...@uol.com.br.
Hi,
Anyone here know how to make Nutch read "<a href=javascript(aaa);>" as
http://www.myurl.com/one.php?id=aaa ?
Thanks in advance.
Marco
Re: parsing msword docs
Posted by J S <ve...@hotmail.com>.
Hi,
Thanks for your reply. I'm running the crawl again now with that expression
added in, so it will be interesting to see the results later.
I also changed the protocol to <value>protocol-(http|https|ftp)|parse ....
to try and pick up ftp and https sites as well. However it looks as if https
doesn't work:
050610 082219 fetch of http://my.bp.com/login.do failed with:
org.apache.nutch.protocol.http.HttpException: Not an HTTP
url:https://my.bp.com/password/redirect.jsp
Have I configured this wrong, or Is ssl support not added in yet?
JS.
>first sorry for my english:
>you should see the conf/ nutch-default.xml :
>
><property>
> <name>plugin.includes</name>
>
><value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
></property>
>
>by default nutch only works with text and html files, then you should
>do some changes in the
>conf/nutch-site.xml, to parse msword:
>
> <property>
> <name>plugin.includes</name>
>
><value>protocol-http|parse-(text|html|msword|pdf|rtf)|index-basic|query-(basic|site|url)</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
></property>
>
></nutch-conf>
>
>
>
>
>On 6/9/05, J S <ve...@hotmail.com> wrote:
> > Hi,
> >
> > Complete newbie here so sorry if this is a silly question! I was
>wondering
> > about the following message in the crawl.log I have:
> >
> > 050609 221715 fetch okay, but can't parse
> >
>http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
> > reason: Content-Type not text/html: application/msword
> >
> > Would my search be more efficient if turned on the plugin to parse
> > microsoft word docs? If so, how do I turn the plugin on?
> >
> > Thanks for any help,
> >
> > JS.
> >
> >
> >
Re: parsing msword docs
Posted by Santi Gori <ha...@gmail.com>.
first sorry for my english:
you should see the conf/ nutch-default.xml :
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
by default nutch only works with text and html files, then you should
do some changes in the
conf/nutch-site.xml, to parse msword:
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-(text|html|msword|pdf|rtf)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
</nutch-conf>
On 6/9/05, J S <ve...@hotmail.com> wrote:
> Hi,
>
> Complete newbie here so sorry if this is a silly question! I was wondering
> about the following message in the crawl.log I have:
>
> 050609 221715 fetch okay, but can't parse
> http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
> reason: Content-Type not text/html: application/msword
>
> Would my search be more efficient if turned on the plugin to parse
> microsoft word docs? If so, how do I turn the plugin on?
>
> Thanks for any help,
>
> JS.
>
>
>