You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthias Naber <na...@informatik.hu-berlin.de> on 2011/06/23 21:16:13 UTC

Problem implementing my own HtmlParseFilter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

I'm new to the nutch project and just started to test some things. So
I followed this example
http://wiki.apache.org/nutch/WritingPluginExample and implemented my
own HtmlParseFilter.

My custom MyHtmlParseFilter works fine on most of the pages - but
isn't called at all on others. (I also implemented an IndexingFilter
that works just fine)

The goal was to add a new field to the search index. For most of the
pages my stuff is called what adds a custom field to the later
search-index-documents. For some few pages, my code is ignored and I
don't see this field in the index-documents.

To sum this up: my ParseFilter doesn't get called at all for only a
few random pages ... why!?!

I guess this may be related to the MIME-type of the pages to be
parsed? Has anyone an idea what may cause this?

Regards,
mana

# I'm using nutch v.1.3 stable
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4DkP0ACgkQzp84az+gLK3GIgCgimSSrsREQYqh3vWbf3ywaX5S
HxcAnjqJgOML/a/NR6Q80PjC9EhU2MFS
=jY8Z
-----END PGP SIGNATURE-----

Re: Problem implementing my own HtmlParseFilter

Posted by Markus Jelsma <ma...@openindex.io>.

Can you provide steps to reproduce with a public or sample XHTML document? 
Received HTTP headers may be interesting as well (e.g. type, length, 
redirections).

> Hey,
> 
> first of all I'm using nutch v.1.3 stable.
> 
> The goal was to crawl a web app and then publish the data to solr. For
> the crawling and parsing part I take nutch. Therefore I implemented my
> own ParsingFilter (the only thing it does is to extract a certain node
> from the DOM and write its contents (node.textContents()) into a new
> field. this field was added to the solr schema and to the
> nutch-solr-mapping and everything works quite well)
> 
> Except for some URLs that are not properly handled (aka "my
> ParseFilter is not invoked"). That URLs do not differ from these that
> work. The xHTML is valid -- its a simple .(x)html document.
> (magic-mime-type is something like application/xhtml+xml)
> These pages are parsed by the parse-html but as I said, my ParseFilter
> is not invoked on only a subset of all the pages. There is no
> Exception. The document will be shown in the solr -- but without my
> cusom field from above.
> 
> I sourrounded my whole code with a try{...}catch(Throwable th) in case
> something weird happens within my code, but this still don't do the
> trick. And since it doesn't get called, there is not much to log. No
> Exceptions nor errors at all :(
> Has a ParseFilter to be registered for a certain mime type?
> 
> Regards,
> mana
> 
> Am 24.06.2011 00:39, schrieb lewis john mcgibbney:
> > Hi Mana,
> > 
> > I think you would be best to provide details on the following.
> > 
> > What the htmlparsefilter plugin does some log data displaying how
> > it works with some urls but not witrh others e.g. so we can see the
> > nature of the urls it is not working with and vice versa Which
> > version of nutch you are using
> > 
> > Some comments on your indexing plugin, in my own opinion it is much
> > easier to create fields to be indexed if we write this into our
> > mapping schema and in our Solr implementation. My assumption is
> > that you are not using Solr for indexing, this is why you are
> > experiencing some problem getting your fields to map to the index.
> > Is it convenient to try Solr, without access to code for yoyur
> > plugin it makes it extremely hard to try and route out the problem
> > you are experiencing.
> > 
> > On Thu, Jun 23, 2011 at 12:16 PM, Matthias Naber <
> > naber@informatik.hu-berlin.de> wrote:
> > 
> > Hey,
> > 
> > I'm new to the nutch project and just started to test some things.
> > So I followed this example
> > http://wiki.apache.org/nutch/WritingPluginExample and implemented
> > my own HtmlParseFilter.
> > 
> > My custom MyHtmlParseFilter works fine on most of the pages - but
> > isn't called at all on others. (I also implemented an
> > IndexingFilter that works just fine)
> > 
> > The goal was to add a new field to the search index. For most of
> > the pages my stuff is called what adds a custom field to the later
> > search-index-documents. For some few pages, my code is ignored and
> > I don't see this field in the index-documents.
> > 
> > To sum this up: my ParseFilter doesn't get called at all for only
> > a few random pages ... why!?!
> > 
> > I guess this may be related to the MIME-type of the pages to be
> > parsed? Has anyone an idea what may cause this?
> > 
> > Regards, mana
> > 
> > # I'm using nutch v.1.3 stable

Re: Problem implementing my own HtmlParseFilter

Posted by Matthias Naber <na...@informatik.hu-berlin.de>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

first of all I'm using nutch v.1.3 stable.

The goal was to crawl a web app and then publish the data to solr. For
the crawling and parsing part I take nutch. Therefore I implemented my
own ParsingFilter (the only thing it does is to extract a certain node
from the DOM and write its contents (node.textContents()) into a new
field. this field was added to the solr schema and to the
nutch-solr-mapping and everything works quite well)

Except for some URLs that are not properly handled (aka "my
ParseFilter is not invoked"). That URLs do not differ from these that
work. The xHTML is valid -- its a simple .(x)html document.
(magic-mime-type is something like application/xhtml+xml)
These pages are parsed by the parse-html but as I said, my ParseFilter
is not invoked on only a subset of all the pages. There is no
Exception. The document will be shown in the solr -- but without my
cusom field from above.

I sourrounded my whole code with a try{...}catch(Throwable th) in case
something weird happens within my code, but this still don't do the
trick. And since it doesn't get called, there is not much to log. No
Exceptions nor errors at all :(
Has a ParseFilter to be registered for a certain mime type?

Regards,
mana

Am 24.06.2011 00:39, schrieb lewis john mcgibbney:
> Hi Mana,
>
> I think you would be best to provide details on the following.
>
> What the htmlparsefilter plugin does some log data displaying how
> it works with some urls but not witrh others e.g. so we can see the
> nature of the urls it is not working with and vice versa Which
> version of nutch you are using
>
> Some comments on your indexing plugin, in my own opinion it is much
> easier to create fields to be indexed if we write this into our
> mapping schema and in our Solr implementation. My assumption is
> that you are not using Solr for indexing, this is why you are
> experiencing some problem getting your fields to map to the index.
> Is it convenient to try Solr, without access to code for yoyur
> plugin it makes it extremely hard to try and route out the problem
> you are experiencing.
>
> On Thu, Jun 23, 2011 at 12:16 PM, Matthias Naber <
> naber@informatik.hu-berlin.de> wrote:
>
> Hey,
>
> I'm new to the nutch project and just started to test some things.
> So I followed this example
> http://wiki.apache.org/nutch/WritingPluginExample and implemented
> my own HtmlParseFilter.
>
> My custom MyHtmlParseFilter works fine on most of the pages - but
> isn't called at all on others. (I also implemented an
> IndexingFilter that works just fine)
>
> The goal was to add a new field to the search index. For most of
> the pages my stuff is called what adds a custom field to the later
> search-index-documents. For some few pages, my code is ignored and
> I don't see this field in the index-documents.
>
> To sum this up: my ParseFilter doesn't get called at all for only
> a few random pages ... why!?!
>
> I guess this may be related to the MIME-type of the pages to be
> parsed? Has anyone an idea what may cause this?
>
> Regards, mana
>
> # I'm using nutch v.1.3 stable
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Dx0wACgkQzp84az+gLK0WTwCdFPLc0H9ULE1C+Yg1ZYZffzgv
d7oAn18bT3ekHlgtp/y9KVSSMt/mUbfS
=L06R
-----END PGP SIGNATURE-----

Re: Problem implementing my own HtmlParseFilter

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Mana,

I think you would be best to provide details on the following.

What the htmlparsefilter plugin does
some log data displaying how it works with some urls but not witrh others
e.g. so we can see the nature of the urls it is not working with and vice
versa
Which version of nutch you are using

Some comments on your indexing plugin, in my own opinion it is much easier
to create fields to be indexed if we write this into our mapping schema and
in our Solr implementation. My assumption is that you are not using Solr for
indexing, this is why you are experiencing some problem getting your fields
to map to the index. Is it convenient to try Solr, without access to code
for yoyur plugin it makes it extremely hard to try and route out the problem
you are experiencing.

On Thu, Jun 23, 2011 at 12:16 PM, Matthias Naber <
naber@informatik.hu-berlin.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hey,
>
> I'm new to the nutch project and just started to test some things. So
> I followed this example
> http://wiki.apache.org/nutch/WritingPluginExample and implemented my
> own HtmlParseFilter.
>
> My custom MyHtmlParseFilter works fine on most of the pages - but
> isn't called at all on others. (I also implemented an IndexingFilter
> that works just fine)
>
> The goal was to add a new field to the search index. For most of the
> pages my stuff is called what adds a custom field to the later
> search-index-documents. For some few pages, my code is ignored and I
> don't see this field in the index-documents.
>
> To sum this up: my ParseFilter doesn't get called at all for only a
> few random pages ... why!?!
>
> I guess this may be related to the MIME-type of the pages to be
> parsed? Has anyone an idea what may cause this?
>
> Regards,
> mana
>
> # I'm using nutch v.1.3 stable
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk4DkP0ACgkQzp84az+gLK3GIgCgimSSrsREQYqh3vWbf3ywaX5S
> HxcAnjqJgOML/a/NR6Q80PjC9EhU2MFS
> =jY8Z
> -----END PGP SIGNATURE-----
>
>

-- 
*Lewis*