You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Albin Vigier <al...@gmail.com> on 2014/09/25 10:24:51 UTC

Re: Generic xsl parser plugin

Hello everybody,

I'm just wondering if it is possible to fetch specific metadata with
an existing nutch plugin.

Let's take an example.
I want to extract some metadata from "div" or "td" tags from html
pages that have specific ids and name them the way I like (this is
done at parser time).
Then, at indexer time, I would use index-metadata (a very good plugin)
to add my custom metadata.

Currently from what I've seen on the wiki and by quickly analyzing
plugins I suppose I have to code my own plugin each time I've got a
new site (with a new html structure). I've already done that by using
a node walker in a custom htmlParseFilter but the extraction can be a
little bit boring :)

So on my side i've coded a little plugin that enables me to specify
xpaths in an xml file. But before diving into more functionalities I'm
just wondering if I did not missed something.
This work allowed me to explore some nutch aspects but I don't want to
reinvent the wheel or miss something.

Albin

Re: Generic xsl parser plugin

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Great work!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Albinscode <al...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, October 5, 2014 at 1:09 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: Generic xsl parser plugin

>@Chris Thank you for your suggestion too.
>
>As requested I've created the
>https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch.
>
>Feel free to give me feedbacks. I'll continue work on my branch ;)
>
>2014-10-03 10:03 GMT+02:00 Albinscode <al...@gmail.com>:
>> Hello Sebastian,
>>
>> Thank you for having taken a look to the global mechanism.
>> I've tried to make as simple as possible to focus on "what to extract?".
>>
>> Currently I've got lots of needs (and so ideas). The code will
>> naturally evolve (support of XSLT 2.0) and I would be happy to fully
>> give this code to the community.
>>
>> Of course, I'll create a JIRA and prepare a patch. I'll take the time
>> to provide it as clean as possible.
>>
>> Thank you for your interest.
>>
>> 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
>> <ch...@jpl.nasa.gov>:
>>> Agree with Sebastian, if we could make this part of Nutch it
>>> would be great, as I think it would help us do page scraping
>>> a lot better!
>>>
>>> What do you think Albin?
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel <wa...@googlemail.com>
>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Date: Thursday, October 2, 2014 at 3:03 PM
>>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Subject: Re: Generic xsl parser plugin
>>>
>>>>Hi Albin,
>>>>
>>>>the plugin looks very nice!
>>>>I like the clean and extensible way how
>>>>fields are filled by XPath statements.
>>>>To use XSLT functions to do the cleansing
>>>>of extracted text (you hardly ever can do without!)
>>>>is an excellent idea!
>>>>
>>>>I hope to find the time soon to look at it more detail
>>>>and give it a trial.
>>>>
>>>>Even more I would like to see the plugin as part of Nutch.
>>>>Are you willing to open a Jira for it and provide a patch?
>>>>
>>>>Thanks a lot,
>>>>Sebastian
>>>>
>>>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>>>> Hi all,
>>>>>
>>>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>>>
>>>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nut
>>>>>ch/
>>>>> 
>>>>>http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>>>
>>>>> The source code is available on
>>>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>>>> I'll update the google code wiki to gather information from my blog.
>>>>>
>>>>> If you have any comment feel free.
>>>>> As I'm currently using it to crawl different web sites related to
>>>>>searching friends I'll have lots
>>>>> of examples to provide.
>>>>>
>>>>> Have a nice day!
>>>>>
>>>>> Albin
>>>>>
>>>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>>>>><ma...@gmail.com>>:
>>>>>
>>>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>>>implementation for my own needs
>>>>>     and I'll post it to google code or other repo if the community is
>>>>>interested.
>>>>>     I'll work on a small doc too.
>>>>>     Thank you for your answer.
>>>>>
>>>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>>>><lists.digitalpebble@gmail.com
>>>>>     <ma...@gmail.com>> wrote:
>>>>>
>>>>>         Hi Albin,
>>>>>
>>>>>         You don't have to have a separate plugin for each html
>>>>>structure you want to parse. You can
>>>>>         have a single plugin with multiple HTMLParseFilters.
>>>>>
>>>>>         Having a generic extractor with the extraction logic
>>>>>configured
>>>>>in an external file is
>>>>>         definitely a good idea and would make a great contribution to
>>>>>the project. In a nutshell,
>>>>>         you haven't missed anything and that wheel definitely needs
>>>>>inventing ;-)
>>>>>
>>>>>         Best
>>>>>
>>>>>         Julien
>>>>>
>>>>>
>>>>>         On 25 September 2014 09:24, Albin Vigier
>>>>><albinscode@gmail.com
>>>>>         <ma...@gmail.com>> wrote:
>>>>>
>>>>>             Hello everybody,
>>>>>
>>>>>             I'm just wondering if it is possible to fetch specific
>>>>>metadata with
>>>>>             an existing nutch plugin.
>>>>>
>>>>>             Let's take an example.
>>>>>             I want to extract some metadata from "div" or "td" tags
>>>>>from html
>>>>>             pages that have specific ids and name them the way I like
>>>>>(this is
>>>>>             done at parser time).
>>>>>             Then, at indexer time, I would use index-metadata (a very
>>>>>good plugin)
>>>>>             to add my custom metadata.
>>>>>
>>>>>             Currently from what I've seen on the wiki and by quickly
>>>>>analyzing
>>>>>             plugins I suppose I have to code my own plugin each time
>>>>>I've got a
>>>>>             new site (with a new html structure). I've already done
>>>>>that by using
>>>>>             a node walker in a custom htmlParseFilter but the
>>>>>extraction can be a
>>>>>             little bit boring :)
>>>>>
>>>>>             So on my side i've coded a little plugin that enables me
>>>>>to
>>>>>specify
>>>>>             xpaths in an xml file. But before diving into more
>>>>>functionalities I'm
>>>>>             just wondering if I did not missed something.
>>>>>             This work allowed me to explore some nutch aspects but I
>>>>>don't want to
>>>>>             reinvent the wheel or miss something.
>>>>>
>>>>>             Albin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         --
>>>>>         *
>>>>>         *Open Source Solutions for Text Engineering
>>>>>
>>>>>         http://digitalpebble.blogspot.com/
>>>>>         http://www.digitalpebble.com
>>>>>         http://twitter.com/digitalpebble
>>>>>
>>>>>
>>>>>
>>>>
>>>

Re: Generic xsl parser plugin

Posted by Albinscode <al...@gmail.com>.

@Chris Thank you for your suggestion too.

As requested I've created the
https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch.

Feel free to give me feedbacks. I'll continue work on my branch ;)

2014-10-03 10:03 GMT+02:00 Albinscode <al...@gmail.com>:
> Hello Sebastian,
>
> Thank you for having taken a look to the global mechanism.
> I've tried to make as simple as possible to focus on "what to extract?".
>
> Currently I've got lots of needs (and so ideas). The code will
> naturally evolve (support of XSLT 2.0) and I would be happy to fully
> give this code to the community.
>
> Of course, I'll create a JIRA and prepare a patch. I'll take the time
> to provide it as clean as possible.
>
> Thank you for your interest.
>
> 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
> <ch...@jpl.nasa.gov>:
>> Agree with Sebastian, if we could make this part of Nutch it
>> would be great, as I think it would help us do page scraping
>> a lot better!
>>
>> What do you think Albin?
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wa...@googlemail.com>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Thursday, October 2, 2014 at 3:03 PM
>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: Re: Generic xsl parser plugin
>>
>>>Hi Albin,
>>>
>>>the plugin looks very nice!
>>>I like the clean and extensible way how
>>>fields are filled by XPath statements.
>>>To use XSLT functions to do the cleansing
>>>of extracted text (you hardly ever can do without!)
>>>is an excellent idea!
>>>
>>>I hope to find the time soon to look at it more detail
>>>and give it a trial.
>>>
>>>Even more I would like to see the plugin as part of Nutch.
>>>Are you willing to open a Jira for it and provide a patch?
>>>
>>>Thanks a lot,
>>>Sebastian
>>>
>>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>>> Hi all,
>>>>
>>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>>
>>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>>>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>>
>>>> The source code is available on
>>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>>> I'll update the google code wiki to gather information from my blog.
>>>>
>>>> If you have any comment feel free.
>>>> As I'm currently using it to crawl different web sites related to
>>>>searching friends I'll have lots
>>>> of examples to provide.
>>>>
>>>> Have a nice day!
>>>>
>>>> Albin
>>>>
>>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>>>><ma...@gmail.com>>:
>>>>
>>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>>implementation for my own needs
>>>>     and I'll post it to google code or other repo if the community is
>>>>interested.
>>>>     I'll work on a small doc too.
>>>>     Thank you for your answer.
>>>>
>>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>>><lists.digitalpebble@gmail.com
>>>>     <ma...@gmail.com>> wrote:
>>>>
>>>>         Hi Albin,
>>>>
>>>>         You don't have to have a separate plugin for each html
>>>>structure you want to parse. You can
>>>>         have a single plugin with multiple HTMLParseFilters.
>>>>
>>>>         Having a generic extractor with the extraction logic configured
>>>>in an external file is
>>>>         definitely a good idea and would make a great contribution to
>>>>the project. In a nutshell,
>>>>         you haven't missed anything and that wheel definitely needs
>>>>inventing ;-)
>>>>
>>>>         Best
>>>>
>>>>         Julien
>>>>
>>>>
>>>>         On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com
>>>>         <ma...@gmail.com>> wrote:
>>>>
>>>>             Hello everybody,
>>>>
>>>>             I'm just wondering if it is possible to fetch specific
>>>>metadata with
>>>>             an existing nutch plugin.
>>>>
>>>>             Let's take an example.
>>>>             I want to extract some metadata from "div" or "td" tags
>>>>from html
>>>>             pages that have specific ids and name them the way I like
>>>>(this is
>>>>             done at parser time).
>>>>             Then, at indexer time, I would use index-metadata (a very
>>>>good plugin)
>>>>             to add my custom metadata.
>>>>
>>>>             Currently from what I've seen on the wiki and by quickly
>>>>analyzing
>>>>             plugins I suppose I have to code my own plugin each time
>>>>I've got a
>>>>             new site (with a new html structure). I've already done
>>>>that by using
>>>>             a node walker in a custom htmlParseFilter but the
>>>>extraction can be a
>>>>             little bit boring :)
>>>>
>>>>             So on my side i've coded a little plugin that enables me to
>>>>specify
>>>>             xpaths in an xml file. But before diving into more
>>>>functionalities I'm
>>>>             just wondering if I did not missed something.
>>>>             This work allowed me to explore some nutch aspects but I
>>>>don't want to
>>>>             reinvent the wheel or miss something.
>>>>
>>>>             Albin
>>>>
>>>>
>>>>
>>>>
>>>>         --
>>>>         *
>>>>         *Open Source Solutions for Text Engineering
>>>>
>>>>         http://digitalpebble.blogspot.com/
>>>>         http://www.digitalpebble.com
>>>>         http://twitter.com/digitalpebble
>>>>
>>>>
>>>>
>>>
>>

Re: Generic xsl parser plugin

Posted by Albinscode <al...@gmail.com>.

Hello Sebastian,

Thank you for having taken a look to the global mechanism.
I've tried to make as simple as possible to focus on "what to extract?".

Currently I've got lots of needs (and so ideas). The code will
naturally evolve (support of XSLT 2.0) and I would be happy to fully
give this code to the community.

Of course, I'll create a JIRA and prepare a patch. I'll take the time
to provide it as clean as possible.

Thank you for your interest.

2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov>:
> Agree with Sebastian, if we could make this part of Nutch it
> would be great, as I think it would help us do page scraping
> a lot better!
>
> What do you think Albin?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Thursday, October 2, 2014 at 3:03 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Re: Generic xsl parser plugin
>
>>Hi Albin,
>>
>>the plugin looks very nice!
>>I like the clean and extensible way how
>>fields are filled by XPath statements.
>>To use XSLT functions to do the cleansing
>>of extracted text (you hardly ever can do without!)
>>is an excellent idea!
>>
>>I hope to find the time soon to look at it more detail
>>and give it a trial.
>>
>>Even more I would like to see the plugin as part of Nutch.
>>Are you willing to open a Jira for it and provide a patch?
>>
>>Thanks a lot,
>>Sebastian
>>
>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>> Hi all,
>>>
>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>
>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>
>>> The source code is available on
>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>> I'll update the google code wiki to gather information from my blog.
>>>
>>> If you have any comment feel free.
>>> As I'm currently using it to crawl different web sites related to
>>>searching friends I'll have lots
>>> of examples to provide.
>>>
>>> Have a nice day!
>>>
>>> Albin
>>>
>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>>><ma...@gmail.com>>:
>>>
>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>implementation for my own needs
>>>     and I'll post it to google code or other repo if the community is
>>>interested.
>>>     I'll work on a small doc too.
>>>     Thank you for your answer.
>>>
>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>><lists.digitalpebble@gmail.com
>>>     <ma...@gmail.com>> wrote:
>>>
>>>         Hi Albin,
>>>
>>>         You don't have to have a separate plugin for each html
>>>structure you want to parse. You can
>>>         have a single plugin with multiple HTMLParseFilters.
>>>
>>>         Having a generic extractor with the extraction logic configured
>>>in an external file is
>>>         definitely a good idea and would make a great contribution to
>>>the project. In a nutshell,
>>>         you haven't missed anything and that wheel definitely needs
>>>inventing ;-)
>>>
>>>         Best
>>>
>>>         Julien
>>>
>>>
>>>         On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com
>>>         <ma...@gmail.com>> wrote:
>>>
>>>             Hello everybody,
>>>
>>>             I'm just wondering if it is possible to fetch specific
>>>metadata with
>>>             an existing nutch plugin.
>>>
>>>             Let's take an example.
>>>             I want to extract some metadata from "div" or "td" tags
>>>from html
>>>             pages that have specific ids and name them the way I like
>>>(this is
>>>             done at parser time).
>>>             Then, at indexer time, I would use index-metadata (a very
>>>good plugin)
>>>             to add my custom metadata.
>>>
>>>             Currently from what I've seen on the wiki and by quickly
>>>analyzing
>>>             plugins I suppose I have to code my own plugin each time
>>>I've got a
>>>             new site (with a new html structure). I've already done
>>>that by using
>>>             a node walker in a custom htmlParseFilter but the
>>>extraction can be a
>>>             little bit boring :)
>>>
>>>             So on my side i've coded a little plugin that enables me to
>>>specify
>>>             xpaths in an xml file. But before diving into more
>>>functionalities I'm
>>>             just wondering if I did not missed something.
>>>             This work allowed me to explore some nutch aspects but I
>>>don't want to
>>>             reinvent the wheel or miss something.
>>>
>>>             Albin
>>>
>>>
>>>
>>>
>>>         --
>>>         *
>>>         *Open Source Solutions for Text Engineering
>>>
>>>         http://digitalpebble.blogspot.com/
>>>         http://www.digitalpebble.com
>>>         http://twitter.com/digitalpebble
>>>
>>>
>>>
>>
>

Re: Generic xsl parser plugin

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Agree with Sebastian, if we could make this part of Nutch it
would be great, as I think it would help us do page scraping
a lot better!

What do you think Albin?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Thursday, October 2, 2014 at 3:03 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: Generic xsl parser plugin

>Hi Albin,
>
>the plugin looks very nice!
>I like the clean and extensible way how
>fields are filled by XPath statements.
>To use XSLT functions to do the cleansing
>of extracted text (you hardly ever can do without!)
>is an excellent idea!
>
>I hope to find the time soon to look at it more detail
>and give it a trial.
>
>Even more I would like to see the plugin as part of Nutch.
>Are you willing to open a Jira for it and provide a patch?
>
>Thanks a lot,
>Sebastian
>
>On 10/02/2014 10:26 AM, Albinscode wrote:
>> Hi all,
>> 
>> I've created two posts on my blog to describe and use the xsl plugin:
>> 
>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>> 
>> The source code is available on
>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>> I'll update the google code wiki to gather information from my blog.
>> 
>> If you have any comment feel free.
>> As I'm currently using it to crawl different web sites related to
>>searching friends I'll have lots
>> of examples to provide.
>> 
>> Have a nice day!
>> 
>> Albin
>> 
>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>><ma...@gmail.com>>:
>> 
>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>implementation for my own needs
>>     and I'll post it to google code or other repo if the community is
>>interested.
>>     I'll work on a small doc too.
>>     Thank you for your answer.
>> 
>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>><lists.digitalpebble@gmail.com
>>     <ma...@gmail.com>> wrote:
>> 
>>         Hi Albin,
>> 
>>         You don't have to have a separate plugin for each html
>>structure you want to parse. You can
>>         have a single plugin with multiple HTMLParseFilters.
>> 
>>         Having a generic extractor with the extraction logic configured
>>in an external file is
>>         definitely a good idea and would make a great contribution to
>>the project. In a nutshell,
>>         you haven't missed anything and that wheel definitely needs
>>inventing ;-)
>> 
>>         Best
>> 
>>         Julien
>> 
>> 
>>         On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com
>>         <ma...@gmail.com>> wrote:
>> 
>>             Hello everybody,
>> 
>>             I'm just wondering if it is possible to fetch specific
>>metadata with
>>             an existing nutch plugin.
>> 
>>             Let's take an example.
>>             I want to extract some metadata from "div" or "td" tags
>>from html
>>             pages that have specific ids and name them the way I like
>>(this is
>>             done at parser time).
>>             Then, at indexer time, I would use index-metadata (a very
>>good plugin)
>>             to add my custom metadata.
>> 
>>             Currently from what I've seen on the wiki and by quickly
>>analyzing
>>             plugins I suppose I have to code my own plugin each time
>>I've got a
>>             new site (with a new html structure). I've already done
>>that by using
>>             a node walker in a custom htmlParseFilter but the
>>extraction can be a
>>             little bit boring :)
>> 
>>             So on my side i've coded a little plugin that enables me to
>>specify
>>             xpaths in an xml file. But before diving into more
>>functionalities I'm
>>             just wondering if I did not missed something.
>>             This work allowed me to explore some nutch aspects but I
>>don't want to
>>             reinvent the wheel or miss something.
>> 
>>             Albin
>> 
>> 
>> 
>> 
>>         -- 
>>         *
>>         *Open Source Solutions for Text Engineering
>> 
>>         http://digitalpebble.blogspot.com/
>>         http://www.digitalpebble.com
>>         http://twitter.com/digitalpebble
>> 
>> 
>> 
>

Re: Generic xsl parser plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Albin,

the plugin looks very nice!
I like the clean and extensible way how
fields are filled by XPath statements.
To use XSLT functions to do the cleansing
of extracted text (you hardly ever can do without!)
is an excellent idea!

I hope to find the time soon to look at it more detail
and give it a trial.

Even more I would like to see the plugin as part of Nutch.
Are you willing to open a Jira for it and provide a patch?

Thanks a lot,
Sebastian

On 10/02/2014 10:26 AM, Albinscode wrote:
> Hi all,
> 
> I've created two posts on my blog to describe and use the xsl plugin:
> http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
> 
> The source code is available on https://code.google.com/p/nutch-parse-xsl-plugin/.
> I'll update the google code wiki to gather information from my blog.
> 
> If you have any comment feel free.
> As I'm currently using it to crawl different web sites related to searching friends I'll have lots
> of examples to provide.
> 
> Have a nice day!
> 
> Albin
> 
> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com <ma...@gmail.com>>:
> 
>     Ok, perfect, so I didn't waste my time. I'm finishing my basic implementation for my own needs
>     and I'll post it to google code or other repo if the community is interested.
>     I'll work on a small doc too.
>     Thank you for your answer.
> 
>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche <lists.digitalpebble@gmail.com
>     <ma...@gmail.com>> wrote:
> 
>         Hi Albin,
> 
>         You don't have to have a separate plugin for each html structure you want to parse. You can
>         have a single plugin with multiple HTMLParseFilters.
> 
>         Having a generic extractor with the extraction logic configured in an external file is
>         definitely a good idea and would make a great contribution to the project. In a nutshell,
>         you haven't missed anything and that wheel definitely needs inventing ;-)
> 
>         Best
> 
>         Julien
> 
> 
>         On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com
>         <ma...@gmail.com>> wrote:
> 
>             Hello everybody,
> 
>             I'm just wondering if it is possible to fetch specific metadata with
>             an existing nutch plugin.
> 
>             Let's take an example.
>             I want to extract some metadata from "div" or "td" tags from html
>             pages that have specific ids and name them the way I like (this is
>             done at parser time).
>             Then, at indexer time, I would use index-metadata (a very good plugin)
>             to add my custom metadata.
> 
>             Currently from what I've seen on the wiki and by quickly analyzing
>             plugins I suppose I have to code my own plugin each time I've got a
>             new site (with a new html structure). I've already done that by using
>             a node walker in a custom htmlParseFilter but the extraction can be a
>             little bit boring :)
> 
>             So on my side i've coded a little plugin that enables me to specify
>             xpaths in an xml file. But before diving into more functionalities I'm
>             just wondering if I did not missed something.
>             This work allowed me to explore some nutch aspects but I don't want to
>             reinvent the wheel or miss something.
> 
>             Albin
> 
> 
> 
> 
>         -- 
>         *
>         *Open Source Solutions for Text Engineering
> 
>         http://digitalpebble.blogspot.com/
>         http://www.digitalpebble.com
>         http://twitter.com/digitalpebble
> 
> 
>

Re: Generic xsl parser plugin

Posted by Albinscode <al...@gmail.com>.

Hi all,

I've created two posts on my blog to describe and use the xsl plugin:
http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/

The source code is available on
https://code.google.com/p/nutch-parse-xsl-plugin/.
I'll update the google code wiki to gather information from my blog.

If you have any comment feel free.
As I'm currently using it to crawl different web sites related to searching
friends I'll have lots of examples to provide.

Have a nice day!

Albin

2014-09-25 16:18 GMT+02:00 Albin Vigier <al...@gmail.com>:

> Ok, perfect, so I didn't waste my time. I'm finishing my basic
> implementation for my own needs and I'll post it to google code or other
> repo if the community is interested.
> I'll work on a small doc too.
> Thank you for your answer.
>
> On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Albin,
>>
>> You don't have to have a separate plugin for each html structure you want
>> to parse. You can have a single plugin with multiple HTMLParseFilters.
>>
>> Having a generic extractor with the extraction logic configured in an
>> external file is definitely a good idea and would make a great contribution
>> to the project. In a nutshell, you haven't missed anything and that wheel
>> definitely needs inventing ;-)
>>
>> Best
>>
>> Julien
>>
>>
>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>
>>> Hello everybody,
>>>
>>> I'm just wondering if it is possible to fetch specific metadata with
>>> an existing nutch plugin.
>>>
>>> Let's take an example.
>>> I want to extract some metadata from "div" or "td" tags from html
>>> pages that have specific ids and name them the way I like (this is
>>> done at parser time).
>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>> to add my custom metadata.
>>>
>>> Currently from what I've seen on the wiki and by quickly analyzing
>>> plugins I suppose I have to code my own plugin each time I've got a
>>> new site (with a new html structure). I've already done that by using
>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>> little bit boring :)
>>>
>>> So on my side i've coded a little plugin that enables me to specify
>>> xpaths in an xml file. But before diving into more functionalities I'm
>>> just wondering if I did not missed something.
>>> This work allowed me to explore some nutch aspects but I don't want to
>>> reinvent the wheel or miss something.
>>>
>>> Albin
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Re: Generic xsl parser plugin

Posted by Albin Vigier <al...@gmail.com>.

Ok, perfect, so I didn't waste my time. I'm finishing my basic
implementation for my own needs and I'll post it to google code or other
repo if the community is interested.
I'll work on a small doc too.
Thank you for your answer.

On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Albin,
>
> You don't have to have a separate plugin for each html structure you want
> to parse. You can have a single plugin with multiple HTMLParseFilters.
>
> Having a generic extractor with the extraction logic configured in an
> external file is definitely a good idea and would make a great contribution
> to the project. In a nutshell, you haven't missed anything and that wheel
> definitely needs inventing ;-)
>
> Best
>
> Julien
>
>
> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>
>> Hello everybody,
>>
>> I'm just wondering if it is possible to fetch specific metadata with
>> an existing nutch plugin.
>>
>> Let's take an example.
>> I want to extract some metadata from "div" or "td" tags from html
>> pages that have specific ids and name them the way I like (this is
>> done at parser time).
>> Then, at indexer time, I would use index-metadata (a very good plugin)
>> to add my custom metadata.
>>
>> Currently from what I've seen on the wiki and by quickly analyzing
>> plugins I suppose I have to code my own plugin each time I've got a
>> new site (with a new html structure). I've already done that by using
>> a node walker in a custom htmlParseFilter but the extraction can be a
>> little bit boring :)
>>
>> So on my side i've coded a little plugin that enables me to specify
>> xpaths in an xml file. But before diving into more functionalities I'm
>> just wondering if I did not missed something.
>> This work allowed me to explore some nutch aspects but I don't want to
>> reinvent the wheel or miss something.
>>
>> Albin
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Generic xsl parser plugin

Posted by Albinscode <al...@gmail.com>.

Perfect, all is going fast in here ;)
I've looked at Emir's code but there is a small limitation: you can only
put xpath, it is not full xsl. So it doesn't fit my needs. I need to
perform real transformations (with xsl:for-each and custom xsl functions,
not only xpath).

Another thing, when implementing parser, I did get a problem when trying to
apply xpath on already provided DocumentFragment (generated by htmlParser
or tikaParser). It seems that Emir got a problem too because he is
recreating the whole DOM from raw content instead of reusing it. And then
he cleans up DOM nodes to XMLize it with another Html node cleaner (html
cleaner) instead of already used NekoHtml or TagSoup. I think I'll post a
new subject on this mailing list and ask Emir. Because it can be a
performance issue on our two plugins ;)

I've written some HOWTO to describe the main mecanism and comparison with
NodeWalker implementation. I'm performing some cleanups and I'll upload the
code:
http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/

2014-09-26 10:26 GMT+02:00 Julien Nioche <li...@gmail.com>:

> Hi Nima
>
> Thanks for reminding me about this JIRA issue, it hasn't been commented on
> for some time and I'd forgotten about it. Judging by the discussion on
> NUTCH-978 <https://issues.apache.org/jira/browse/NUTCH-978> things got
> stuck when Emmanuel tried to get in touch with Emir (who in the meantime
> seems to have stopped using Nutch - see
> http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/
> ).
>
> It would be a good thing to get in touch with him indeed, alternatively
> Albin's plugin could be a good starting point. There clearly is a need for
> such a functionality and quite a few people keen to make it happen.
>
> Thanks
>
> Julien
>
>
> On 25 September 2014 18:19, Nima Falaki <nf...@popsugar.com> wrote:
>
>> And the reason why I think this is because of this ticket (Look at the
>> conversation at the bottom between Emmanuel and Lewis John)
>>
>> https://issues.apache.org/jira/browse/NUTCH-978
>>
>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>> wrote:
>>
>>> Hi Julien:
>>>
>>> I was under the impression that the nutch community was going to use a
>>> generic xls parser? This one.
>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>> the nutch community going to use this?
>>>
>>>
>>>
>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>> lists.digitalpebble@gmail.com> wrote:
>>>
>>>> Hi Albin,
>>>>
>>>> You don't have to have a separate plugin for each html structure you
>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>
>>>> Having a generic extractor with the extraction logic configured in an
>>>> external file is definitely a good idea and would make a great contribution
>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>> definitely needs inventing ;-)
>>>>
>>>> Best
>>>>
>>>> Julien
>>>>
>>>>
>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>> an existing nutch plugin.
>>>>>
>>>>> Let's take an example.
>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>> pages that have specific ids and name them the way I like (this is
>>>>> done at parser time).
>>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>>> to add my custom metadata.
>>>>>
>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>> new site (with a new html structure). I've already done that by using
>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>> little bit boring :)
>>>>>
>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>>> just wondering if I did not missed something.
>>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>>> reinvent the wheel or miss something.
>>>>>
>>>>> Albin
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Open Source Solutions for Text Engineering
>>>>
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> nfalaki@popsugar.com
>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Generic xsl parser plugin

Posted by Julien Nioche <li...@gmail.com>.

Hi Nima

Thanks for reminding me about this JIRA issue, it hasn't been commented on
for some time and I'd forgotten about it. Judging by the discussion on
NUTCH-978 <https://issues.apache.org/jira/browse/NUTCH-978> things got
stuck when Emmanuel tried to get in touch with Emir (who in the meantime
seems to have stopped using Nutch - see
http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/
).

It would be a good thing to get in touch with him indeed, alternatively
Albin's plugin could be a good starting point. There clearly is a need for
such a functionality and quite a few people keen to make it happen.

Thanks

Julien


On 25 September 2014 18:19, Nima Falaki <nf...@popsugar.com> wrote:

> And the reason why I think this is because of this ticket (Look at the
> conversation at the bottom between Emmanuel and Lewis John)
>
> https://issues.apache.org/jira/browse/NUTCH-978
>
> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com> wrote:
>
>> Hi Julien:
>>
>> I was under the impression that the nutch community was going to use a
>> generic xls parser? This one.
>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>> the nutch community going to use this?
>>
>>
>>
>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi Albin,
>>>
>>> You don't have to have a separate plugin for each html structure you
>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>
>>> Having a generic extractor with the extraction logic configured in an
>>> external file is definitely a good idea and would make a great contribution
>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>> definitely needs inventing ;-)
>>>
>>> Best
>>>
>>> Julien
>>>
>>>
>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>>
>>>> Hello everybody,
>>>>
>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>> an existing nutch plugin.
>>>>
>>>> Let's take an example.
>>>> I want to extract some metadata from "div" or "td" tags from html
>>>> pages that have specific ids and name them the way I like (this is
>>>> done at parser time).
>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>> to add my custom metadata.
>>>>
>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>> new site (with a new html structure). I've already done that by using
>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>> little bit boring :)
>>>>
>>>> So on my side i've coded a little plugin that enables me to specify
>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>> just wondering if I did not missed something.
>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>> reinvent the wheel or miss something.
>>>>
>>>> Albin
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Generic xsl parser plugin

Posted by Albin Vigier <al...@gmail.com>.

Perfect, all is going fast in here ;)
I've looked at Emir's code but there is a small limitation: you can only
put xpath, it is not full xsl. So it doesn't fit my needs. I need to
perform real transformations (with xsl:for-each and custom xsl functions,
not only xpath).

Another thing, when implementing parser, I did get a problem when trying to
apply xpath on already provided DocumentFragment (generated by htmlParser
or tikaParser). It seems that Emir got a problem too because he is
recreating the whole DOM from raw content instead of reusing it. And then
he cleans up DOM nodes to XMLize it with another Html node cleaner (html
cleaner) instead of already used NekoHtml or TagSoup. I think I'll post a
new subject on this mailing list and ask Emir. Because it can be a
performance issue on our two plugins ;)

I've written some HOWTO to describe the main mecanism and comparison with
NodeWalker implementation. I'm performing some cleanups and I'll upload the
code:
http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/




On Fri, Sep 26, 2014 at 6:01 AM, Nima Falaki <nf...@popsugar.com> wrote:

> Yes please share. It would be useful.
> On Sep 25, 2014 8:54 PM, "Talat Uyarer" <ta...@uyarer.com> wrote:
>
>> Last thing I wrote a how to use it document. :)
>> On Sep 26, 2014 6:52 AM, "Talat Uyarer" <ta...@uyarer.com> wrote:
>>
>>> Hi all,
>>>
>>> I made some changes Emir's plugin for completable with 2.x That is
>>> useful If you need I can share my fork.
>>>
>>> Talat
>>> On Sep 26, 2014 6:47 AM, "Nima Falaki" <nf...@popsugar.com> wrote:
>>>
>>>> Hi:
>>>>
>>>> Yes, it would be very interesting. Let me know what Emir says
>>>>
>>>> Nima
>>>>
>>>> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <al...@gmail.com>
>>>> wrote:
>>>>
>>>>> Oh thanks Nima, I did found this topic last year but I thought the
>>>>> project was dead. I think there is a little reference in the nutch wiki too
>>>>> I cannot find it now.
>>>>>
>>>>> It looks like we have the same xsl approach so it can be interesting
>>>>> to share. I'll try to contact Emir while continuing documenting my small
>>>>> plugin.
>>>>>
>>>>> Thanks again for the valuable information!
>>>>>
>>>>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:
>>>>>
>>>>>> And the reason why I think this is because of this ticket (Look at
>>>>>> the conversation at the bottom between Emmanuel and Lewis John)
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>>>>
>>>>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Julien:
>>>>>>>
>>>>>>> I was under the impression that the nutch community was going to use
>>>>>>> a generic xls parser? This one.
>>>>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
>>>>>>> Is the nutch community going to use this?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Albin,
>>>>>>>>
>>>>>>>> You don't have to have a separate plugin for each html structure
>>>>>>>> you want to parse. You can have a single plugin with multiple
>>>>>>>> HTMLParseFilters.
>>>>>>>>
>>>>>>>> Having a generic extractor with the extraction logic configured in
>>>>>>>> an external file is definitely a good idea and would make a great
>>>>>>>> contribution to the project. In a nutshell, you haven't missed anything and
>>>>>>>> that wheel definitely needs inventing ;-)
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> Julien
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello everybody,
>>>>>>>>>
>>>>>>>>> I'm just wondering if it is possible to fetch specific metadata
>>>>>>>>> with
>>>>>>>>> an existing nutch plugin.
>>>>>>>>>
>>>>>>>>> Let's take an example.
>>>>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>>>>> done at parser time).
>>>>>>>>> Then, at indexer time, I would use index-metadata (a very good
>>>>>>>>> plugin)
>>>>>>>>> to add my custom metadata.
>>>>>>>>>
>>>>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>>>>> new site (with a new html structure). I've already done that by
>>>>>>>>> using
>>>>>>>>> a node walker in a custom htmlParseFilter but the extraction can
>>>>>>>>> be a
>>>>>>>>> little bit boring :)
>>>>>>>>>
>>>>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>>>>> xpaths in an xml file. But before diving into more functionalities
>>>>>>>>> I'm
>>>>>>>>> just wondering if I did not missed something.
>>>>>>>>> This work allowed me to explore some nutch aspects but I don't
>>>>>>>>> want to
>>>>>>>>> reinvent the wheel or miss something.
>>>>>>>>>
>>>>>>>>> Albin
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>>
>>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>>> http://www.digitalpebble.com
>>>>>>>> http://twitter.com/digitalpebble
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nima Falaki
>>>>>>> Software Engineer
>>>>>>> nfalaki@popsugar.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nima Falaki
>>>>>> Software Engineer
>>>>>> nfalaki@popsugar.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> nfalaki@popsugar.com
>>>>
>>>>

Re: Generic xsl parser plugin

Posted by Nima Falaki <nf...@popsugar.com>.

Yes please share. It would be useful.
On Sep 25, 2014 8:54 PM, "Talat Uyarer" <ta...@uyarer.com> wrote:

> Last thing I wrote a how to use it document. :)
> On Sep 26, 2014 6:52 AM, "Talat Uyarer" <ta...@uyarer.com> wrote:
>
>> Hi all,
>>
>> I made some changes Emir's plugin for completable with 2.x That is useful
>> If you need I can share my fork.
>>
>> Talat
>> On Sep 26, 2014 6:47 AM, "Nima Falaki" <nf...@popsugar.com> wrote:
>>
>>> Hi:
>>>
>>> Yes, it would be very interesting. Let me know what Emir says
>>>
>>> Nima
>>>
>>> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <al...@gmail.com>
>>> wrote:
>>>
>>>> Oh thanks Nima, I did found this topic last year but I thought the
>>>> project was dead. I think there is a little reference in the nutch wiki too
>>>> I cannot find it now.
>>>>
>>>> It looks like we have the same xsl approach so it can be interesting to
>>>> share. I'll try to contact Emir while continuing documenting my small
>>>> plugin.
>>>>
>>>> Thanks again for the valuable information!
>>>>
>>>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:
>>>>
>>>>> And the reason why I think this is because of this ticket (Look at the
>>>>> conversation at the bottom between Emmanuel and Lewis John)
>>>>>
>>>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>>>
>>>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Julien:
>>>>>>
>>>>>> I was under the impression that the nutch community was going to use
>>>>>> a generic xls parser? This one.
>>>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
>>>>>> Is the nutch community going to use this?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Albin,
>>>>>>>
>>>>>>> You don't have to have a separate plugin for each html structure you
>>>>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>>>>
>>>>>>> Having a generic extractor with the extraction logic configured in
>>>>>>> an external file is definitely a good idea and would make a great
>>>>>>> contribution to the project. In a nutshell, you haven't missed anything and
>>>>>>> that wheel definitely needs inventing ;-)
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Julien
>>>>>>>
>>>>>>>
>>>>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello everybody,
>>>>>>>>
>>>>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>>>>> an existing nutch plugin.
>>>>>>>>
>>>>>>>> Let's take an example.
>>>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>>>> done at parser time).
>>>>>>>> Then, at indexer time, I would use index-metadata (a very good
>>>>>>>> plugin)
>>>>>>>> to add my custom metadata.
>>>>>>>>
>>>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>>>> new site (with a new html structure). I've already done that by
>>>>>>>> using
>>>>>>>> a node walker in a custom htmlParseFilter but the extraction can be
>>>>>>>> a
>>>>>>>> little bit boring :)
>>>>>>>>
>>>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>>>> xpaths in an xml file. But before diving into more functionalities
>>>>>>>> I'm
>>>>>>>> just wondering if I did not missed something.
>>>>>>>> This work allowed me to explore some nutch aspects but I don't want
>>>>>>>> to
>>>>>>>> reinvent the wheel or miss something.
>>>>>>>>
>>>>>>>> Albin
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>
>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>> http://www.digitalpebble.com
>>>>>>> http://twitter.com/digitalpebble
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nima Falaki
>>>>>> Software Engineer
>>>>>> nfalaki@popsugar.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> Nima Falaki
>>>>> Software Engineer
>>>>> nfalaki@popsugar.com
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> nfalaki@popsugar.com
>>>
>>>

Re: Generic xsl parser plugin

Posted by Talat Uyarer <ta...@uyarer.com>.

Last thing I wrote a how to use it document. :)
On Sep 26, 2014 6:52 AM, "Talat Uyarer" <ta...@uyarer.com> wrote:

> Hi all,
>
> I made some changes Emir's plugin for completable with 2.x That is useful
> If you need I can share my fork.
>
> Talat
> On Sep 26, 2014 6:47 AM, "Nima Falaki" <nf...@popsugar.com> wrote:
>
>> Hi:
>>
>> Yes, it would be very interesting. Let me know what Emir says
>>
>> Nima
>>
>> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <al...@gmail.com>
>> wrote:
>>
>>> Oh thanks Nima, I did found this topic last year but I thought the
>>> project was dead. I think there is a little reference in the nutch wiki too
>>> I cannot find it now.
>>>
>>> It looks like we have the same xsl approach so it can be interesting to
>>> share. I'll try to contact Emir while continuing documenting my small
>>> plugin.
>>>
>>> Thanks again for the valuable information!
>>>
>>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:
>>>
>>>> And the reason why I think this is because of this ticket (Look at the
>>>> conversation at the bottom between Emmanuel and Lewis John)
>>>>
>>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>>
>>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>>>> wrote:
>>>>
>>>>> Hi Julien:
>>>>>
>>>>> I was under the impression that the nutch community was going to use a
>>>>> generic xls parser? This one.
>>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>>>> the nutch community going to use this?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>
>>>>>> Hi Albin,
>>>>>>
>>>>>> You don't have to have a separate plugin for each html structure you
>>>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>>>
>>>>>> Having a generic extractor with the extraction logic configured in an
>>>>>> external file is definitely a good idea and would make a great contribution
>>>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>>>> definitely needs inventing ;-)
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Julien
>>>>>>
>>>>>>
>>>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello everybody,
>>>>>>>
>>>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>>>> an existing nutch plugin.
>>>>>>>
>>>>>>> Let's take an example.
>>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>>> done at parser time).
>>>>>>> Then, at indexer time, I would use index-metadata (a very good
>>>>>>> plugin)
>>>>>>> to add my custom metadata.
>>>>>>>
>>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>>> new site (with a new html structure). I've already done that by using
>>>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>>>> little bit boring :)
>>>>>>>
>>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>>> xpaths in an xml file. But before diving into more functionalities
>>>>>>> I'm
>>>>>>> just wondering if I did not missed something.
>>>>>>> This work allowed me to explore some nutch aspects but I don't want
>>>>>>> to
>>>>>>> reinvent the wheel or miss something.
>>>>>>>
>>>>>>> Albin
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Open Source Solutions for Text Engineering
>>>>>>
>>>>>> http://digitalpebble.blogspot.com/
>>>>>> http://www.digitalpebble.com
>>>>>> http://twitter.com/digitalpebble
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> Nima Falaki
>>>>> Software Engineer
>>>>> nfalaki@popsugar.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> nfalaki@popsugar.com
>>>>
>>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>

Re: Generic xsl parser plugin

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi all,

I made some changes Emir's plugin for completable with 2.x That is useful
If you need I can share my fork.

Talat
On Sep 26, 2014 6:47 AM, "Nima Falaki" <nf...@popsugar.com> wrote:

> Hi:
>
> Yes, it would be very interesting. Let me know what Emir says
>
> Nima
>
> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <al...@gmail.com> wrote:
>
>> Oh thanks Nima, I did found this topic last year but I thought the
>> project was dead. I think there is a little reference in the nutch wiki too
>> I cannot find it now.
>>
>> It looks like we have the same xsl approach so it can be interesting to
>> share. I'll try to contact Emir while continuing documenting my small
>> plugin.
>>
>> Thanks again for the valuable information!
>>
>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:
>>
>>> And the reason why I think this is because of this ticket (Look at the
>>> conversation at the bottom between Emmanuel and Lewis John)
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>
>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>>> wrote:
>>>
>>>> Hi Julien:
>>>>
>>>> I was under the impression that the nutch community was going to use a
>>>> generic xls parser? This one.
>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>>> the nutch community going to use this?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> Hi Albin,
>>>>>
>>>>> You don't have to have a separate plugin for each html structure you
>>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>>
>>>>> Having a generic extractor with the extraction logic configured in an
>>>>> external file is definitely a good idea and would make a great contribution
>>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>>> definitely needs inventing ;-)
>>>>>
>>>>> Best
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>>> an existing nutch plugin.
>>>>>>
>>>>>> Let's take an example.
>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>> done at parser time).
>>>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>>>> to add my custom metadata.
>>>>>>
>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>> new site (with a new html structure). I've already done that by using
>>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>>> little bit boring :)
>>>>>>
>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>>>> just wondering if I did not missed something.
>>>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>>>> reinvent the wheel or miss something.
>>>>>>
>>>>>> Albin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.com/
>>>>> http://www.digitalpebble.com
>>>>> http://twitter.com/digitalpebble
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> nfalaki@popsugar.com
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> nfalaki@popsugar.com
>>>
>>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>

Re: Generic xsl parser plugin

Posted by Nima Falaki <nf...@popsugar.com>.

Hi:

Yes, it would be very interesting. Let me know what Emir says

Nima

On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <al...@gmail.com> wrote:

> Oh thanks Nima, I did found this topic last year but I thought the project
> was dead. I think there is a little reference in the nutch wiki too I
> cannot find it now.
>
> It looks like we have the same xsl approach so it can be interesting to
> share. I'll try to contact Emir while continuing documenting my small
> plugin.
>
> Thanks again for the valuable information!
>
> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:
>
>> And the reason why I think this is because of this ticket (Look at the
>> conversation at the bottom between Emmanuel and Lewis John)
>>
>> https://issues.apache.org/jira/browse/NUTCH-978
>>
>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com>
>> wrote:
>>
>>> Hi Julien:
>>>
>>> I was under the impression that the nutch community was going to use a
>>> generic xls parser? This one.
>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>> the nutch community going to use this?
>>>
>>>
>>>
>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>> lists.digitalpebble@gmail.com> wrote:
>>>
>>>> Hi Albin,
>>>>
>>>> You don't have to have a separate plugin for each html structure you
>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>
>>>> Having a generic extractor with the extraction logic configured in an
>>>> external file is definitely a good idea and would make a great contribution
>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>> definitely needs inventing ;-)
>>>>
>>>> Best
>>>>
>>>> Julien
>>>>
>>>>
>>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>> an existing nutch plugin.
>>>>>
>>>>> Let's take an example.
>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>> pages that have specific ids and name them the way I like (this is
>>>>> done at parser time).
>>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>>> to add my custom metadata.
>>>>>
>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>> new site (with a new html structure). I've already done that by using
>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>> little bit boring :)
>>>>>
>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>>> just wondering if I did not missed something.
>>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>>> reinvent the wheel or miss something.
>>>>>
>>>>> Albin
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Open Source Solutions for Text Engineering
>>>>
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> nfalaki@popsugar.com
>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>
>


-- 



Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Generic xsl parser plugin

Posted by Albinscode <al...@gmail.com>.

Oh thanks Nima, I did found this topic last year but I thought the project
was dead. I think there is a little reference in the nutch wiki too I
cannot find it now.

It looks like we have the same xsl approach so it can be interesting to
share. I'll try to contact Emir while continuing documenting my small
plugin.

Thanks again for the valuable information!

2014-09-25 19:19 GMT+02:00 Nima Falaki <nf...@popsugar.com>:

> And the reason why I think this is because of this ticket (Look at the
> conversation at the bottom between Emmanuel and Lewis John)
>
> https://issues.apache.org/jira/browse/NUTCH-978
>
> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com> wrote:
>
>> Hi Julien:
>>
>> I was under the impression that the nutch community was going to use a
>> generic xls parser? This one.
>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>> the nutch community going to use this?
>>
>>
>>
>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi Albin,
>>>
>>> You don't have to have a separate plugin for each html structure you
>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>
>>> Having a generic extractor with the extraction logic configured in an
>>> external file is definitely a good idea and would make a great contribution
>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>> definitely needs inventing ;-)
>>>
>>> Best
>>>
>>> Julien
>>>
>>>
>>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>>
>>>> Hello everybody,
>>>>
>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>> an existing nutch plugin.
>>>>
>>>> Let's take an example.
>>>> I want to extract some metadata from "div" or "td" tags from html
>>>> pages that have specific ids and name them the way I like (this is
>>>> done at parser time).
>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>> to add my custom metadata.
>>>>
>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>> new site (with a new html structure). I've already done that by using
>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>> little bit boring :)
>>>>
>>>> So on my side i've coded a little plugin that enables me to specify
>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>> just wondering if I did not missed something.
>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>> reinvent the wheel or miss something.
>>>>
>>>> Albin
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>

Re: Generic xsl parser plugin

Posted by Nima Falaki <nf...@popsugar.com>.

And the reason why I think this is because of this ticket (Look at the
conversation at the bottom between Emmanuel and Lewis John)

https://issues.apache.org/jira/browse/NUTCH-978

On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nf...@popsugar.com> wrote:

> Hi Julien:
>
> I was under the impression that the nutch community was going to use a
> generic xls parser? This one.
> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is the
> nutch community going to use this?
>
>
>
> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Albin,
>>
>> You don't have to have a separate plugin for each html structure you want
>> to parse. You can have a single plugin with multiple HTMLParseFilters.
>>
>> Having a generic extractor with the extraction logic configured in an
>> external file is definitely a good idea and would make a great contribution
>> to the project. In a nutshell, you haven't missed anything and that wheel
>> definitely needs inventing ;-)
>>
>> Best
>>
>> Julien
>>
>>
>> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>>
>>> Hello everybody,
>>>
>>> I'm just wondering if it is possible to fetch specific metadata with
>>> an existing nutch plugin.
>>>
>>> Let's take an example.
>>> I want to extract some metadata from "div" or "td" tags from html
>>> pages that have specific ids and name them the way I like (this is
>>> done at parser time).
>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>> to add my custom metadata.
>>>
>>> Currently from what I've seen on the wiki and by quickly analyzing
>>> plugins I suppose I have to code my own plugin each time I've got a
>>> new site (with a new html structure). I've already done that by using
>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>> little bit boring :)
>>>
>>> So on my side i've coded a little plugin that enables me to specify
>>> xpaths in an xml file. But before diving into more functionalities I'm
>>> just wondering if I did not missed something.
>>> This work allowed me to explore some nutch aspects but I don't want to
>>> reinvent the wheel or miss something.
>>>
>>> Albin
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>


-- 



Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Generic xsl parser plugin

Posted by Nima Falaki <nf...@popsugar.com>.

Hi Julien:

I was under the impression that the nutch community was going to use a
generic xls parser? This one.
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is the
nutch community going to use this?



On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Albin,
>
> You don't have to have a separate plugin for each html structure you want
> to parse. You can have a single plugin with multiple HTMLParseFilters.
>
> Having a generic extractor with the extraction logic configured in an
> external file is definitely a good idea and would make a great contribution
> to the project. In a nutshell, you haven't missed anything and that wheel
> definitely needs inventing ;-)
>
> Best
>
> Julien
>
>
> On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:
>
>> Hello everybody,
>>
>> I'm just wondering if it is possible to fetch specific metadata with
>> an existing nutch plugin.
>>
>> Let's take an example.
>> I want to extract some metadata from "div" or "td" tags from html
>> pages that have specific ids and name them the way I like (this is
>> done at parser time).
>> Then, at indexer time, I would use index-metadata (a very good plugin)
>> to add my custom metadata.
>>
>> Currently from what I've seen on the wiki and by quickly analyzing
>> plugins I suppose I have to code my own plugin each time I've got a
>> new site (with a new html structure). I've already done that by using
>> a node walker in a custom htmlParseFilter but the extraction can be a
>> little bit boring :)
>>
>> So on my side i've coded a little plugin that enables me to specify
>> xpaths in an xml file. But before diving into more functionalities I'm
>> just wondering if I did not missed something.
>> This work allowed me to explore some nutch aspects but I don't want to
>> reinvent the wheel or miss something.
>>
>> Albin
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 



Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Generic xsl parser plugin

Posted by Julien Nioche <li...@gmail.com>.

Hi Albin,

You don't have to have a separate plugin for each html structure you want
to parse. You can have a single plugin with multiple HTMLParseFilters.

Having a generic extractor with the extraction logic configured in an
external file is definitely a good idea and would make a great contribution
to the project. In a nutshell, you haven't missed anything and that wheel
definitely needs inventing ;-)

Best

Julien


On 25 September 2014 09:24, Albin Vigier <al...@gmail.com> wrote:

> Hello everybody,
>
> I'm just wondering if it is possible to fetch specific metadata with
> an existing nutch plugin.
>
> Let's take an example.
> I want to extract some metadata from "div" or "td" tags from html
> pages that have specific ids and name them the way I like (this is
> done at parser time).
> Then, at indexer time, I would use index-metadata (a very good plugin)
> to add my custom metadata.
>
> Currently from what I've seen on the wiki and by quickly analyzing
> plugins I suppose I have to code my own plugin each time I've got a
> new site (with a new html structure). I've already done that by using
> a node walker in a custom htmlParseFilter but the extraction can be a
> little bit boring :)
>
> So on my side i've coded a little plugin that enables me to specify
> xpaths in an xml file. But before diving into more functionalities I'm
> just wondering if I did not missed something.
> This work allowed me to explore some nutch aspects but I don't want to
> reinvent the wheel or miss something.
>
> Albin
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble