You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/02/17 08:51:06 UTC

extract links problem with parse-html plugin

It seems that the parse-html plguin may not process many pages well, because
I have found that the plugin can't extract all valid links in a page when I
test it in my code.
I guess that it may be caused by the style of a html page? When I "view
source" of a html page I used to parse, I saw that some elements in the
source are segmented by some unrequired spaces. However, the situation is
quiet often to the pages of large portal sites or news sites.

Re: extract links problem with parse-html plugin

Posted by Elwin <ma...@gmail.com>.

I have wrote a test class HtmlWrapper and here is some code:

  HtmlWrapper wrapper=new HtmlWrapper();
  Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html");
  String temp=new String(c.getContent());
  System.out.println(temp);

  wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
  ArrayList links=wrapper.getBlogLinks();
  for(int i=0;i<links.size();i++){
   String urlString=(String)links.get(i);
   System.out.println(urlString);
  }

I can only get a few of links from that page.

The url is from a Chinese site; however you can just skip those non-Enligsh
contents and just see the html elements.

2006/2/17, Guenter, Matthias <Ma...@ipi.ch>:
>
> Hi Elwin
> Can you provide samples of not working links and code? And put it into
> JIRA?
> Kind regards
> Matthias
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Elwin [mailto:maoyuxin@gmail.com]
> Gesendet: Fr 17.02.2006 08:51
> An: nutch-user@lucene.apache.org
> Betreff: extract links problem with parse-html plugin
>
> It seems that the parse-html plguin may not process many pages well,
> because
> I have found that the plugin can't extract all valid links in a page when
> I
> test it in my code.
> I guess that it may be caused by the style of a html page? When I "view
> source" of a html page I used to parse, I saw that some elements in the
> source are segmented by some unrequired spaces. However, the situation is
> quiet often to the pages of large portal sites or news sites.
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

Posted by Po...@acocon.de.

Hi Guenter,

the site I have trouble with is

http://www.dmgbielefeld.de/de,dmg,dmg-bielefeld

Some links of the site will be extracted but up to 80% not. I have switched
the JavaScript plugin on.

My be, can you take a look...

That would help me...

"Guenter, Matthias" <Ma...@ipi.ch> wrote on 17.02.2006 09:04:12:

> Hi Elwin
> Can you provide samples of not working links and code? And put it into
JIRA?
> Kind regards
> Matthias
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Elwin [mailto:maoyuxin@gmail.com]
> Gesendet: Fr 17.02.2006 08:51
> An: nutch-user@lucene.apache.org
> Betreff: extract links problem with parse-html plugin
>
> It seems that the parse-html plguin may not process many pages well,
because
> I have found that the plugin can't extract all valid links in a page when
I
> test it in my code.
> I guess that it may be caused by the style of a html page? When I "view
> source" of a html page I used to parse, I saw that some elements in the
> source are segmented by some unrequired spaces. However, the situation is
> quiet often to the pages of large portal sites or news sites.

AW: extract links problem with parse-html plugin

Posted by "Guenter, Matthias" <Ma...@ipi.ch>.

Hi Elwin
Can you provide samples of not working links and code? And put it into JIRA?
Kind regards
Matthias



-----Ursprüngliche Nachricht-----
Von: Elwin [mailto:maoyuxin@gmail.com]
Gesendet: Fr 17.02.2006 08:51
An: nutch-user@lucene.apache.org
Betreff: extract links problem with parse-html plugin
 
It seems that the parse-html plguin may not process many pages well, because
I have found that the plugin can't extract all valid links in a page when I
test it in my code.
I guess that it may be caused by the style of a html page? When I "view
source" of a html page I used to parse, I saw that some elements in the
source are segmented by some unrequired spaces. However, the situation is
quiet often to the pages of large portal sites or news sites.

Re: extract links problem with parse-html plugin

Posted by Po...@acocon.de.

I determined the same.

With my Site is the HTML source 160 kByte per Page largely.
The Parser has here definitely problems (whether Javascript on a side is
used or not).

Before my decision for Nutch  I tested the Java/Lucene based open source
solution Oxygen ( http://sourceforge.net/projects/oxyus/ ).  Here the
Parser does not have problems with the Site.

I am not a developer, perhaps one of the nutch developers may throw a view
in into the source code of the Parser used there.

Perhaps it helps.



Elwin <ma...@gmail.com> wrote on 17.02.2006 08:51:06:

> It seems that the parse-html plguin may not process many pages well,
because
> I have found that the plugin can't extract all valid links in a page when
I
> test it in my code.
> I guess that it may be caused by the style of a html page? When I "view
> source" of a html page I used to parse, I saw that some elements in the
> source are segmented by some unrequired spaces. However, the situation is
> quiet often to the pages of large portal sites or news sites.