You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/11/01 19:47:34 UTC

[jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

     [ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1644:
----------------------------------------
    Fix Version/s:     (was: 2.3)
                   2.4

> Should have a parser that uses xpath
> ------------------------------------
>
>                 Key: NUTCH-1644
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.2.1
>            Reporter: cihad güzel
>            Assignee: Lewis John McGibbney
>              Labels: parser, xpath
>             Fix For: 2.4
>
>         Attachments: NUTCH-1644.patch
>
>
> May want to parse some url via xpath. May be blog or news web sites. Should be a plugin using xpath parse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

Posted by Albin Vigier <al...@gmail.com>.
Hello Sebastian,

I'll look at the xjb failure, so glad to see that it will be integrated
into ivy!

For the examples part, I normally added some commented tests in the tests
folders. I'll look to provide a conf also if not already existing. I'll
keep you in touch.


Thanks,
Albin

On Mon, Nov 3, 2014 at 11:50 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Albin,
>
> you mean NUTCH-1870, right?
> I'm in the process of reviewing your patch.
> Just stuck in preparing the boilerplate required
> to intregate parse-xsl into build, tests, javadoc.
> I've added the jaxb dependencies to ivy,
> but the xjb task fails. Presumably, because
> there is a version mismatch.
> See attached patch. If you can resolve this problem,
> would be great!
>
> Also we need a configuration template on conf/.
> Just one rules and one transformer file,
> ideally with some examples (commented out)
> so that people can start with, and do not need
> to read external stuff. Your blog [1] is great,
> but it's better to have it at hand. Also conf/
> it the first place to look at.
>
> Thanks,
> Sebastian
>
> [1]
> http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>
>
> On 11/01/2014 09:48 PM, Albinscode wrote:
> > Hello everybody,
> >
> > If some more efforts are to be done on NUTCH-1740, I'll be glad to
> > help. I developed this plugin because I was amongst people that didn't
> > want to create new plugins just for few metadata extraction matters ;)
> >
> > 2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <jira@apache.org
> >:
> >>
> >>      [
> https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >>
> >> Lewis John McGibbney updated NUTCH-1644:
> >> ----------------------------------------
> >>     Fix Version/s:     (was: 2.3)
> >>                    2.4
> >>
> >>> Should have a parser that uses xpath
> >>> ------------------------------------
> >>>
> >>>                 Key: NUTCH-1644
> >>>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
> >>>             Project: Nutch
> >>>          Issue Type: New Feature
> >>>          Components: parser
> >>>    Affects Versions: 2.2.1
> >>>            Reporter: cihad güzel
> >>>            Assignee: Lewis John McGibbney
> >>>              Labels: parser, xpath
> >>>             Fix For: 2.4
> >>>
> >>>         Attachments: NUTCH-1644.patch
> >>>
> >>>
> >>> May want to parse some url via xpath. May be blog or news web sites.
> Should be a plugin using xpath parse.
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.4#6332)
>
>

Re: [jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

Posted by Albinscode <al...@gmail.com>.
Hello Sebastian,

I'll look at the xjb failure, so glad to see that it will be
integrated into ivy!

For the examples part, I normally added some commented tests in the
tests folders. I'll look to provide a conf also if not already
existing. I'll keep you in touch.


Thanks,
Albin

2014-11-03 23:50 GMT+01:00 Sebastian Nagel <wa...@googlemail.com>:
> Hi Albin,
>
> you mean NUTCH-1870, right?
> I'm in the process of reviewing your patch.
> Just stuck in preparing the boilerplate required
> to intregate parse-xsl into build, tests, javadoc.
> I've added the jaxb dependencies to ivy,
> but the xjb task fails. Presumably, because
> there is a version mismatch.
> See attached patch. If you can resolve this problem,
> would be great!
>
> Also we need a configuration template on conf/.
> Just one rules and one transformer file,
> ideally with some examples (commented out)
> so that people can start with, and do not need
> to read external stuff. Your blog [1] is great,
> but it's better to have it at hand. Also conf/
> it the first place to look at.
>
> Thanks,
> Sebastian
>
> [1] http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>
>
> On 11/01/2014 09:48 PM, Albinscode wrote:
>> Hello everybody,
>>
>> If some more efforts are to be done on NUTCH-1740, I'll be glad to
>> help. I developed this plugin because I was amongst people that didn't
>> want to create new plugins just for few metadata extraction matters ;)
>>
>> 2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <ji...@apache.org>:
>>>
>>>      [ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Lewis John McGibbney updated NUTCH-1644:
>>> ----------------------------------------
>>>     Fix Version/s:     (was: 2.3)
>>>                    2.4
>>>
>>>> Should have a parser that uses xpath
>>>> ------------------------------------
>>>>
>>>>                 Key: NUTCH-1644
>>>>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
>>>>             Project: Nutch
>>>>          Issue Type: New Feature
>>>>          Components: parser
>>>>    Affects Versions: 2.2.1
>>>>            Reporter: cihad güzel
>>>>            Assignee: Lewis John McGibbney
>>>>              Labels: parser, xpath
>>>>             Fix For: 2.4
>>>>
>>>>         Attachments: NUTCH-1644.patch
>>>>
>>>>
>>>> May want to parse some url via xpath. May be blog or news web sites. Should be a plugin using xpath parse.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.3.4#6332)
>

Re: [jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Albin,

you mean NUTCH-1870, right?
I'm in the process of reviewing your patch.
Just stuck in preparing the boilerplate required
to intregate parse-xsl into build, tests, javadoc.
I've added the jaxb dependencies to ivy,
but the xjb task fails. Presumably, because
there is a version mismatch.
See attached patch. If you can resolve this problem,
would be great!

Also we need a configuration template on conf/.
Just one rules and one transformer file,
ideally with some examples (commented out)
so that people can start with, and do not need
to read external stuff. Your blog [1] is great,
but it's better to have it at hand. Also conf/
it the first place to look at.

Thanks,
Sebastian

[1] http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/


On 11/01/2014 09:48 PM, Albinscode wrote:
> Hello everybody,
> 
> If some more efforts are to be done on NUTCH-1740, I'll be glad to
> help. I developed this plugin because I was amongst people that didn't
> want to create new plugins just for few metadata extraction matters ;)
> 
> 2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <ji...@apache.org>:
>>
>>      [ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Lewis John McGibbney updated NUTCH-1644:
>> ----------------------------------------
>>     Fix Version/s:     (was: 2.3)
>>                    2.4
>>
>>> Should have a parser that uses xpath
>>> ------------------------------------
>>>
>>>                 Key: NUTCH-1644
>>>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
>>>             Project: Nutch
>>>          Issue Type: New Feature
>>>          Components: parser
>>>    Affects Versions: 2.2.1
>>>            Reporter: cihad güzel
>>>            Assignee: Lewis John McGibbney
>>>              Labels: parser, xpath
>>>             Fix For: 2.4
>>>
>>>         Attachments: NUTCH-1644.patch
>>>
>>>
>>> May want to parse some url via xpath. May be blog or news web sites. Should be a plugin using xpath parse.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)


Re: [jira] [Updated] (NUTCH-1644) Should have a parser that uses xpath

Posted by Albinscode <al...@gmail.com>.
Hello everybody,

If some more efforts are to be done on NUTCH-1740, I'll be glad to
help. I developed this plugin because I was amongst people that didn't
want to create new plugins just for few metadata extraction matters ;)

2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <ji...@apache.org>:
>
>      [ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Lewis John McGibbney updated NUTCH-1644:
> ----------------------------------------
>     Fix Version/s:     (was: 2.3)
>                    2.4
>
>> Should have a parser that uses xpath
>> ------------------------------------
>>
>>                 Key: NUTCH-1644
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
>>             Project: Nutch
>>          Issue Type: New Feature
>>          Components: parser
>>    Affects Versions: 2.2.1
>>            Reporter: cihad güzel
>>            Assignee: Lewis John McGibbney
>>              Labels: parser, xpath
>>             Fix For: 2.4
>>
>>         Attachments: NUTCH-1644.patch
>>
>>
>> May want to parse some url via xpath. May be blog or news web sites. Should be a plugin using xpath parse.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)