You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "nutch.buddy@gmail.com" <nu...@gmail.com> on 2012/03/07 14:34:07 UTC

Multiple parsers

Hi
I've looked at nutch's code in ParseUtil and it seems that it was designed
so only one parses is eventually activated on a single url.
What's the reason for this?
What should I do if I want, in addition to the existing parsers, add a
parser that will get a certain field out of the url, an run this behaivour
on all the urls?
Do I have to add this code to all the parsers?


thanks.


--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Multiple parsers

Posted by Julien Nioche <li...@gmail.com>.

As I said use the Tika parser and implement your own HTMLPArseFmilter - it
will get called on the XHTML representation of the doc whatever its mimetype

J.

On 8 March 2012 07:29, nutch buddy <nu...@gmail.com> wrote:

> I looked at  HtmlParseFilter .
> I think that thats exactly what i need but for other file types as well,
> not just html.
> Any reason why this behaivour was implemented only for html files?
>
> I'm thinking of extending this implementation so it would be available for
> other types. any advice on that?
>
> On Wed, Mar 7, 2012 at 11:14 PM, Ferdy Galema <ferdy.galema@kalooga.com
> >wrote:
>
> > Hi,
> >
> > Do you mean running multiple parsers in a single parse action? That is
> > currently only possible for html types. Take a look at HtmlParseFilter
> for
> > that. You can chain multiple parsers for a single url, in addition to
> > regular html parsing. For other types it's not possible.
> >
> > If this is about running a parse implementation on all urls regardless of
> > mimetype, you have to change the parser mappings in parse-plugins.xml
> > and the parser's plugin.xml. But again there is only support for running
> > one Parser on a single document.
> >
> > Ferdy.
> >
> > On Wed, Mar 7, 2012 at 2:34 PM, nutch.buddy@gmail.com <
> > nutch.buddy@gmail.com
> > > wrote:
> >
> > > Hi
> > > I've looked at nutch's code in ParseUtil and it seems that it was
> > designed
> > > so only one parses is eventually activated on a single url.
> > > What's the reason for this?
> > > What should I do if I want, in addition to the existing parsers, add a
> > > parser that will get a certain field out of the url, an run this
> > behaivour
> > > on all the urls?
> > > Do I have to add this code to all the parsers?
> > >
> > >
> > > thanks.
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Multiple parsers

Posted by nutch buddy <nu...@gmail.com>.

I looked at  HtmlParseFilter .
I think that thats exactly what i need but for other file types as well,
not just html.
Any reason why this behaivour was implemented only for html files?

I'm thinking of extending this implementation so it would be available for
other types. any advice on that?

On Wed, Mar 7, 2012 at 11:14 PM, Ferdy Galema <fe...@kalooga.com>wrote:

> Hi,
>
> Do you mean running multiple parsers in a single parse action? That is
> currently only possible for html types. Take a look at HtmlParseFilter for
> that. You can chain multiple parsers for a single url, in addition to
> regular html parsing. For other types it's not possible.
>
> If this is about running a parse implementation on all urls regardless of
> mimetype, you have to change the parser mappings in parse-plugins.xml
> and the parser's plugin.xml. But again there is only support for running
> one Parser on a single document.
>
> Ferdy.
>
> On Wed, Mar 7, 2012 at 2:34 PM, nutch.buddy@gmail.com <
> nutch.buddy@gmail.com
> > wrote:
>
> > Hi
> > I've looked at nutch's code in ParseUtil and it seems that it was
> designed
> > so only one parses is eventually activated on a single url.
> > What's the reason for this?
> > What should I do if I want, in addition to the existing parsers, add a
> > parser that will get a certain field out of the url, an run this
> behaivour
> > on all the urls?
> > Do I have to add this code to all the parsers?
> >
> >
> > thanks.
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>

Re: Multiple parsers

Posted by Ferdy Galema <fe...@kalooga.com>.

Thanks for correcting me on this one.

The endpoint is org.apache.nutch.parse.ParseFilter in Nutchgora.

On Thu, Mar 8, 2012 at 10:33 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> >
> > Do you mean running multiple parsers in a single parse action? That is
> > currently only possible for html types. Take a look at HtmlParseFilter
> for
> > that. You can chain multiple parsers for a single url, in addition to
> > regular html parsing. For other types it's not possible.
> >
>
> Not only for html docs mind you. The tika parser produces a normalised
> XHTML representation of the docs which is then passed on to the
> HTMLParseFilter implementations (I think I renamed the endpoint in
> Nutchgora some time ago)
>
>
>
> >
> > If this is about running a parse implementation on all urls regardless of
> > mimetype, you have to change the parser mappings in parse-plugins.xml
> > and the parser's plugin.xml. But again there is only support for running
> > one Parser on a single document.
> >
> > Ferdy.
> >
> > On Wed, Mar 7, 2012 at 2:34 PM, nutch.buddy@gmail.com <
> > nutch.buddy@gmail.com
> > > wrote:
> >
> > > Hi
> > > I've looked at nutch's code in ParseUtil and it seems that it was
> > designed
> > > so only one parses is eventually activated on a single url.
> > > What's the reason for this?
> > > What should I do if I want, in addition to the existing parsers, add a
> > > parser that will get a certain field out of the url, an run this
> > behaivour
> > > on all the urls?
> > > Do I have to add this code to all the parsers?
> > >
> > >
> > > thanks.
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Multiple parsers

Posted by Julien Nioche <li...@gmail.com>.

>
> Do you mean running multiple parsers in a single parse action? That is
> currently only possible for html types. Take a look at HtmlParseFilter for
> that. You can chain multiple parsers for a single url, in addition to
> regular html parsing. For other types it's not possible.
>

Not only for html docs mind you. The tika parser produces a normalised
XHTML representation of the docs which is then passed on to the
HTMLParseFilter implementations (I think I renamed the endpoint in
Nutchgora some time ago)



>
> If this is about running a parse implementation on all urls regardless of
> mimetype, you have to change the parser mappings in parse-plugins.xml
> and the parser's plugin.xml. But again there is only support for running
> one Parser on a single document.
>
> Ferdy.
>
> On Wed, Mar 7, 2012 at 2:34 PM, nutch.buddy@gmail.com <
> nutch.buddy@gmail.com
> > wrote:
>
> > Hi
> > I've looked at nutch's code in ParseUtil and it seems that it was
> designed
> > so only one parses is eventually activated on a single url.
> > What's the reason for this?
> > What should I do if I want, in addition to the existing parsers, add a
> > parser that will get a certain field out of the url, an run this
> behaivour
> > on all the urls?
> > Do I have to add this code to all the parsers?
> >
> >
> > thanks.
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Multiple parsers

Posted by Ferdy Galema <fe...@kalooga.com>.

Hi,

Do you mean running multiple parsers in a single parse action? That is
currently only possible for html types. Take a look at HtmlParseFilter for
that. You can chain multiple parsers for a single url, in addition to
regular html parsing. For other types it's not possible.

If this is about running a parse implementation on all urls regardless of
mimetype, you have to change the parser mappings in parse-plugins.xml
and the parser's plugin.xml. But again there is only support for running
one Parser on a single document.

Ferdy.

On Wed, Mar 7, 2012 at 2:34 PM, nutch.buddy@gmail.com <nutch.buddy@gmail.com
> wrote:

> Hi
> I've looked at nutch's code in ParseUtil and it seems that it was designed
> so only one parses is eventually activated on a single url.
> What's the reason for this?
> What should I do if I want, in addition to the existing parsers, add a
> parser that will get a certain field out of the url, an run this behaivour
> on all the urls?
> Do I have to add this code to all the parsers?
>
>
> thanks.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multiple-parsers-tp3806721p3806721.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>