You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2018/06/11 14:34:50 UTC

Preparing to release Nutch 1.15 ?

Hi all,

almost 80 fixes and improvements are done now and include:

NUTCH-2375 upgrade to new mapreduce API
  It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
  Well, there have been some regressions but those are resolved now. Tests in
  pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
  million pages) on a Hadoop cluster.
  Would be great if anybody is able to test the Nutch master in combination with
  a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!

NUTCH-1480: Multiple index writer instances with different configurations
  Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
  instances. With NUTCH- (needs to be reviewed) also the routing to of documents
  to the index will be configurable.

NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
   Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
   to be addressed in NUTCH-2596.

And two important issues are almost ready to be committed soon:

NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
   Gerard Bouchard!

NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
   on the okhttp library. Supports HTTP/2.


The full list of fixes and improvements is available at [2].

I'll plan to work through the remaining 70 open issues during the next
days and hope to commit/resolve 15-25 of them and move the remaining
ones to Nutch 1.16.

Please vote for issues you want to get included. If there are open
pull requests, it will help if these can be merged, the unit tests
pass, and any review comments are addressed. Thanks!

If there are any objections or blockers, please also let us know!

I'll also plan to run a test crawl on Hadoop mid of this week.
But any help in testing is welcome.

Note that the tutorial needs to be updated (will be done after 1.15
is finally released) to reflect the changes related to NUTCH-1480.


Thanks,
Sebastian


[1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
[2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302


Re: Preparing to release Nutch 1.15 ?

Posted by Omkar Reddy <om...@gmail.com>.
+1

On 14 June 2018 at 03:09, Furkan KAMACI <fu...@gmail.com> wrote:

> +1
>
>
> 13 Haz 2018 Çar, saat 21:04 tarihinde Joe Obernberger <
> joseph.obernberger@gmail.com> şunu yazdı:
>
>> Woot!
>>
>>
>>
>> On 6/11/2018 11:55 AM, Chris Mattmann wrote:
>> > ++1!
>> >
>> >
>> >
>> > Sounds great.
>> >
>> >
>> >
>> > Cheers,
>> >
>> > Chris
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Sebastian Nagel <wa...@googlemail.com>
>> > Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> > Date: Monday, June 11, 2018 at 7:35 AM
>> > To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> > Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> > Subject: Preparing to release Nutch 1.15 ?
>> >
>> >
>> >
>> > Hi all,
>> >
>> >
>> >
>> > almost 80 fixes and improvements are done now and include:
>> >
>> >
>> >
>> > NUTCH-2375 upgrade to new mapreduce API
>> >
>> >    It was a huge change affecting more than 10,000 lines of code.
>> Thanks, Omkar!
>> >
>> >    Well, there have been some regressions but those are resolved now.
>> Tests in
>> >
>> >    pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>> (180
>> >
>> >    million pages) on a Hadoop cluster.
>> >
>> >    Would be great if anybody is able to test the Nutch master in
>> combination with
>> >
>> >    a non-HDFS file system (e.g. s3://)! Please let us know whether this
>> works. Thanks!
>> >
>> >
>> >
>> > NUTCH-1480: Multiple index writer instances with different
>> configurations
>> >
>> >    Thanks to Roannel it's now possible to index into multiple Solr or
>> Elasticsearch
>> >
>> >    instances. With NUTCH- (needs to be reviewed) also the routing to of
>> documents
>> >
>> >    to the index will be configurable.
>> >
>> >
>> >
>> > NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>> >
>> >     Nutch now runs and compiles on Java 9 + 10. Only errors in unit
>> tests need
>> >
>> >     to be addressed in NUTCH-2596.
>> >
>> >
>> >
>> > And two important issues are almost ready to be committed soon:
>> >
>> >
>> >
>> > NUTCH-2549: a long list of fixes and improvements to protocol-http.
>> Thanks to
>> >
>> >     Gerard Bouchard!
>> >
>> >
>> >
>> > NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>> based
>> >
>> >     on the okhttp library. Supports HTTP/2.
>> >
>> >
>> >
>> >
>> >
>> > The full list of fixes and improvements is available at [2].
>> >
>> >
>> >
>> > I'll plan to work through the remaining 70 open issues during the next
>> >
>> > days and hope to commit/resolve 15-25 of them and move the remaining
>> >
>> > ones to Nutch 1.16.
>> >
>> >
>> >
>> > Please vote for issues you want to get included. If there are open
>> >
>> > pull requests, it will help if these can be merged, the unit tests
>> >
>> > pass, and any review comments are addressed. Thanks!
>> >
>> >
>> >
>> > If there are any objections or blockers, please also let us know!
>> >
>> >
>> >
>> > I'll also plan to run a test crawl on Hadoop mid of this week.
>> >
>> > But any help in testing is welcome.
>> >
>> >
>> >
>> > Note that the tutorial needs to be updated (will be done after 1.15
>> >
>> > is finally released) to reflect the changes related to NUTCH-1480.
>> >
>> >
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Sebastian
>> >
>> >
>> >
>> >
>> >
>> > [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>> >
>> > [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ---
>> > This email has been checked for viruses by AVG.
>> > https://www.avg.com
>> >
>>
>>

Re: Preparing to release Nutch 1.15 ?

Posted by Furkan KAMACI <fu...@gmail.com>.
+1


13 Haz 2018 Çar, saat 21:04 tarihinde Joe Obernberger <
joseph.obernberger@gmail.com> şunu yazdı:

> Woot!
>
>
> On 6/11/2018 11:55 AM, Chris Mattmann wrote:
> > ++1!
> >
> >
> >
> > Sounds great.
> >
> >
> >
> > Cheers,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Sebastian Nagel <wa...@googlemail.com>
> > Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Date: Monday, June 11, 2018 at 7:35 AM
> > To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Subject: Preparing to release Nutch 1.15 ?
> >
> >
> >
> > Hi all,
> >
> >
> >
> > almost 80 fixes and improvements are done now and include:
> >
> >
> >
> > NUTCH-2375 upgrade to new mapreduce API
> >
> >    It was a huge change affecting more than 10,000 lines of code.
> Thanks, Omkar!
> >
> >    Well, there have been some regressions but those are resolved now.
> Tests in
> >
> >    pseudo-distributed mode [1] succeeded and also a mid-size test crawl
> (180
> >
> >    million pages) on a Hadoop cluster.
> >
> >    Would be great if anybody is able to test the Nutch master in
> combination with
> >
> >    a non-HDFS file system (e.g. s3://)! Please let us know whether this
> works. Thanks!
> >
> >
> >
> > NUTCH-1480: Multiple index writer instances with different configurations
> >
> >    Thanks to Roannel it's now possible to index into multiple Solr or
> Elasticsearch
> >
> >    instances. With NUTCH- (needs to be reviewed) also the routing to of
> documents
> >
> >    to the index will be configurable.
> >
> >
> >
> > NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> >
> >     Nutch now runs and compiles on Java 9 + 10. Only errors in unit
> tests need
> >
> >     to be addressed in NUTCH-2596.
> >
> >
> >
> > And two important issues are almost ready to be committed soon:
> >
> >
> >
> > NUTCH-2549: a long list of fixes and improvements to protocol-http.
> Thanks to
> >
> >     Gerard Bouchard!
> >
> >
> >
> > NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
> based
> >
> >     on the okhttp library. Supports HTTP/2.
> >
> >
> >
> >
> >
> > The full list of fixes and improvements is available at [2].
> >
> >
> >
> > I'll plan to work through the remaining 70 open issues during the next
> >
> > days and hope to commit/resolve 15-25 of them and move the remaining
> >
> > ones to Nutch 1.16.
> >
> >
> >
> > Please vote for issues you want to get included. If there are open
> >
> > pull requests, it will help if these can be merged, the unit tests
> >
> > pass, and any review comments are addressed. Thanks!
> >
> >
> >
> > If there are any objections or blockers, please also let us know!
> >
> >
> >
> > I'll also plan to run a test crawl on Hadoop mid of this week.
> >
> > But any help in testing is welcome.
> >
> >
> >
> > Note that the tutorial needs to be updated (will be done after 1.15
> >
> > is finally released) to reflect the changes related to NUTCH-1480.
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Sebastian
> >
> >
> >
> >
> >
> > [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> >
> > [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> >
> >
> >
> >
> >
> >
> >
> > ---
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
>
>

Re: Preparing to release Nutch 1.15 ?

Posted by Furkan KAMACI <fu...@gmail.com>.
+1


13 Haz 2018 Çar, saat 21:04 tarihinde Joe Obernberger <
joseph.obernberger@gmail.com> şunu yazdı:

> Woot!
>
>
> On 6/11/2018 11:55 AM, Chris Mattmann wrote:
> > ++1!
> >
> >
> >
> > Sounds great.
> >
> >
> >
> > Cheers,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Sebastian Nagel <wa...@googlemail.com>
> > Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Date: Monday, June 11, 2018 at 7:35 AM
> > To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Subject: Preparing to release Nutch 1.15 ?
> >
> >
> >
> > Hi all,
> >
> >
> >
> > almost 80 fixes and improvements are done now and include:
> >
> >
> >
> > NUTCH-2375 upgrade to new mapreduce API
> >
> >    It was a huge change affecting more than 10,000 lines of code.
> Thanks, Omkar!
> >
> >    Well, there have been some regressions but those are resolved now.
> Tests in
> >
> >    pseudo-distributed mode [1] succeeded and also a mid-size test crawl
> (180
> >
> >    million pages) on a Hadoop cluster.
> >
> >    Would be great if anybody is able to test the Nutch master in
> combination with
> >
> >    a non-HDFS file system (e.g. s3://)! Please let us know whether this
> works. Thanks!
> >
> >
> >
> > NUTCH-1480: Multiple index writer instances with different configurations
> >
> >    Thanks to Roannel it's now possible to index into multiple Solr or
> Elasticsearch
> >
> >    instances. With NUTCH- (needs to be reviewed) also the routing to of
> documents
> >
> >    to the index will be configurable.
> >
> >
> >
> > NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> >
> >     Nutch now runs and compiles on Java 9 + 10. Only errors in unit
> tests need
> >
> >     to be addressed in NUTCH-2596.
> >
> >
> >
> > And two important issues are almost ready to be committed soon:
> >
> >
> >
> > NUTCH-2549: a long list of fixes and improvements to protocol-http.
> Thanks to
> >
> >     Gerard Bouchard!
> >
> >
> >
> > NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
> based
> >
> >     on the okhttp library. Supports HTTP/2.
> >
> >
> >
> >
> >
> > The full list of fixes and improvements is available at [2].
> >
> >
> >
> > I'll plan to work through the remaining 70 open issues during the next
> >
> > days and hope to commit/resolve 15-25 of them and move the remaining
> >
> > ones to Nutch 1.16.
> >
> >
> >
> > Please vote for issues you want to get included. If there are open
> >
> > pull requests, it will help if these can be merged, the unit tests
> >
> > pass, and any review comments are addressed. Thanks!
> >
> >
> >
> > If there are any objections or blockers, please also let us know!
> >
> >
> >
> > I'll also plan to run a test crawl on Hadoop mid of this week.
> >
> > But any help in testing is welcome.
> >
> >
> >
> > Note that the tutorial needs to be updated (will be done after 1.15
> >
> > is finally released) to reflect the changes related to NUTCH-1480.
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Sebastian
> >
> >
> >
> >
> >
> > [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> >
> > [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> >
> >
> >
> >
> >
> >
> >
> > ---
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
>
>

Re: Preparing to release Nutch 1.15 ?

Posted by Joe Obernberger <jo...@gmail.com>.
Woot!


On 6/11/2018 11:55 AM, Chris Mattmann wrote:
> ++1!
>
>   
>
> Sounds great.
>
>   
>
> Cheers,
>
> Chris
>
>   
>
>   
>
>   
>
>   
>
> From: Sebastian Nagel <wa...@googlemail.com>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Monday, June 11, 2018 at 7:35 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Preparing to release Nutch 1.15 ?
>
>   
>
> Hi all,
>
>   
>
> almost 80 fixes and improvements are done now and include:
>
>   
>
> NUTCH-2375 upgrade to new mapreduce API
>
>    It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
>
>    Well, there have been some regressions but those are resolved now. Tests in
>
>    pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
>
>    million pages) on a Hadoop cluster.
>
>    Would be great if anybody is able to test the Nutch master in combination with
>
>    a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!
>
>   
>
> NUTCH-1480: Multiple index writer instances with different configurations
>
>    Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
>
>    instances. With NUTCH- (needs to be reviewed) also the routing to of documents
>
>    to the index will be configurable.
>
>   
>
> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>
>     Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
>
>     to be addressed in NUTCH-2596.
>
>   
>
> And two important issues are almost ready to be committed soon:
>
>   
>
> NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
>
>     Gerard Bouchard!
>
>   
>
> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
>
>     on the okhttp library. Supports HTTP/2.
>
>   
>
>   
>
> The full list of fixes and improvements is available at [2].
>
>   
>
> I'll plan to work through the remaining 70 open issues during the next
>
> days and hope to commit/resolve 15-25 of them and move the remaining
>
> ones to Nutch 1.16.
>
>   
>
> Please vote for issues you want to get included. If there are open
>
> pull requests, it will help if these can be merged, the unit tests
>
> pass, and any review comments are addressed. Thanks!
>
>   
>
> If there are any objections or blockers, please also let us know!
>
>   
>
> I'll also plan to run a test crawl on Hadoop mid of this week.
>
> But any help in testing is welcome.
>
>   
>
> Note that the tutorial needs to be updated (will be done after 1.15
>
> is finally released) to reflect the changes related to NUTCH-1480.
>
>   
>
>   
>
> Thanks,
>
> Sebastian
>
>   
>
>   
>
> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>
> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>
>   
>
>   
>
>
>
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
>


Re: [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by Jurian Broertjes <ju...@openindex.io>.
+1 Nice work all!


On 11-06-18 23:44, BlackIce wrote:
> +1
>
> stoopid question, but I can't find any info on it... can we now parse Open
> Graph metatags?
>
> Greetz
>
> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <ro...@uci.cu>
> wrote:
>
>> +1
>>
>> Regards
>>
>> ----- Chris Mattmann <ma...@apache.org> escribió:
>>> ++1!
>>>
>>>
>>>
>>> Sounds great.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Sebastian Nagel <wa...@googlemail.com>
>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Date: Monday, June 11, 2018 at 7:35 AM
>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Subject: Preparing to release Nutch 1.15 ?
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> almost 80 fixes and improvements are done now and include:
>>>
>>>
>>>
>>> NUTCH-2375 upgrade to new mapreduce API
>>>
>>>    It was a huge change affecting more than 10,000 lines of code. Thanks,
>> Omkar!
>>>    Well, there have been some regressions but those are resolved now.
>> Tests in
>>>    pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>> (180
>>>    million pages) on a Hadoop cluster.
>>>
>>>    Would be great if anybody is able to test the Nutch master in
>> combination with
>>>    a non-HDFS file system (e.g. s3://)! Please let us know whether this
>> works. Thanks!
>>>
>>>
>>> NUTCH-1480: Multiple index writer instances with different configurations
>>>
>>>    Thanks to Roannel it's now possible to index into multiple Solr or
>> Elasticsearch
>>>    instances. With NUTCH- (needs to be reviewed) also the routing to of
>> documents
>>>    to the index will be configurable.
>>>
>>>
>>>
>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>>>
>>>     Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests
>> need
>>>     to be addressed in NUTCH-2596.
>>>
>>>
>>>
>>> And two important issues are almost ready to be committed soon:
>>>
>>>
>>>
>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>> Thanks to
>>>     Gerard Bouchard!
>>>
>>>
>>>
>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>> based
>>>     on the okhttp library. Supports HTTP/2.
>>>
>>>
>>>
>>>
>>>
>>> The full list of fixes and improvements is available at [2].
>>>
>>>
>>>
>>> I'll plan to work through the remaining 70 open issues during the next
>>>
>>> days and hope to commit/resolve 15-25 of them and move the remaining
>>>
>>> ones to Nutch 1.16.
>>>
>>>
>>>
>>> Please vote for issues you want to get included. If there are open
>>>
>>> pull requests, it will help if these can be merged, the unit tests
>>>
>>> pass, and any review comments are addressed. Thanks!
>>>
>>>
>>>
>>> If there are any objections or blockers, please also let us know!
>>>
>>>
>>>
>>> I'll also plan to run a test crawl on Hadoop mid of this week.
>>>
>>> But any help in testing is welcome.
>>>
>>>
>>>
>>> Note that the tutorial needs to be updated (will be done after 1.15
>>>
>>> is finally released) to reflect the changes related to NUTCH-1480.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sebastian
>>>
>>>
>>>
>>>
>>>
>>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>>>
>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>>>
>>>
>>>
>>>
>>>
>> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
>> de las Ciencias Informáticas.
>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>>


Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by BlackIce <bl...@gmail.com>.
nutch-site.xml, thats what I meant....
 so the syntax in index.parse.md would be:
metatag.og:image,metatag.og:image:alt?

Since we are at it.....
 Rel-tag, I believe we have a plugin for this.. but from what I gather it
only extracts the "rel=tag" and no other "rel" tags, or am I mistaken?

and

 In nutch-site.xml we have:
<name>metatags.names</name>
  <value>*</value>

Aparently one can just use the wildcard "*" as a  value for this, but can
this also be done for:

 <name>index.parse.md</name>
<value>metatag.*</value>
?
if so, would this be indexed into solr into a dynamic field
llike: <dynamicField name="metatag.*"... ?
(and from there one could juts copy the metatags that are relevant to
perform whatever magic)

Sorry, if this seems trivial.. but I've been doing so many things just by
trial and error that at some point I just need to ask.

Greetz

On Tue, Jun 12, 2018 at 3:02 PM Sebastian Nagel <wa...@googlemail.com>
wrote:

> Yes, of course, defining properties in the nutch-site.xml (but not
> "site.xml")
> does also work. It's the usual hiearchy:
>  bin/nutch command -Dkey=value ...
>   overwrites property in nutch-site.xml
>      (must be on classpath: runtime/local/conf resp. inside the nutch.job)
>    overwrites definition in nutch-default.xml
>
> On 06/12/2018 02:26 PM, BlackIce wrote:
> > PS: Does this work when configured in site.xml like regular metatdata?
> >
> > On Tue, Jun 12, 2018 at 1:31 PM BlackIce <bl...@gmail.com> wrote:
> >
> >> sweet thnx!
> >>
> >> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel <
> >> wastl.nagel@googlemail.com> wrote:
> >>
> >>>> stoopid question, but I can't find any info on it... can we now parse
> >>> Open
> >>>> Graph metatags?
> >>>
> >>> parse-tika extracts og:* metatags
> >>>
> >>> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'
> >>> http://ogp.me/
> >>> ...
> >>> Parse Metadata: og:image=http://ogp.me/logo.png og:type=website
> >>> og:image:width=300
> >>>   og:image:alt=The Open Graph logo og:title=Open Graph protocol ...
> >>>
> >>> % bin/nutch indexchecker -Dindex.parse.md
> =og:image,og:title,og:description
> >>> \
> >>>     -Dplugin.includes='protocol-http|parse-tika|index-metadata'
> >>> http://ogp.me/
> >>> ...
> >>> og:image :      http://ogp.me/logo.png
> >>> og:title :      Open Graph protocol
> >>> digest :        f98d6d5e5894ef83561630ebef3bf060
> >>> id :    http://ogp.me/
> >>> og:description :        The Open Graph protocol enables any web page to
> >>> become a rich object in a
> >>> social graph.
> >>>
> >>>
> >>> On 06/11/2018 11:44 PM, BlackIce wrote:
> >>>> +1
> >>>>
> >>>> stoopid question, but I can't find any info on it... can we now parse
> >>> Open
> >>>> Graph metatags?
> >>>>
> >>>> Greetz
> >>>>
> >>>> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <
> >>> roannel@uci.cu>
> >>>> wrote:
> >>>>
> >>>>> +1
> >>>>>
> >>>>> Regards
> >>>>>
> >>>>> ----- Chris Mattmann <ma...@apache.org> escribió:
> >>>>>> ++1!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Sounds great.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Chris
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> From: Sebastian Nagel <wa...@googlemail.com>
> >>>>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>>>>> Date: Monday, June 11, 2018 at 7:35 AM
> >>>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >>>>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>>>>> Subject: Preparing to release Nutch 1.15 ?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> almost 80 fixes and improvements are done now and include:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> NUTCH-2375 upgrade to new mapreduce API
> >>>>>>
> >>>>>>   It was a huge change affecting more than 10,000 lines of code.
> >>> Thanks,
> >>>>> Omkar!
> >>>>>>
> >>>>>>   Well, there have been some regressions but those are resolved now.
> >>>>> Tests in
> >>>>>>
> >>>>>>   pseudo-distributed mode [1] succeeded and also a mid-size test
> crawl
> >>>>> (180
> >>>>>>
> >>>>>>   million pages) on a Hadoop cluster.
> >>>>>>
> >>>>>>   Would be great if anybody is able to test the Nutch master in
> >>>>> combination with
> >>>>>>
> >>>>>>   a non-HDFS file system (e.g. s3://)! Please let us know whether
> this
> >>>>> works. Thanks!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> NUTCH-1480: Multiple index writer instances with different
> >>> configurations
> >>>>>>
> >>>>>>   Thanks to Roannel it's now possible to index into multiple Solr or
> >>>>> Elasticsearch
> >>>>>>
> >>>>>>   instances. With NUTCH- (needs to be reviewed) also the routing to
> of
> >>>>> documents
> >>>>>>
> >>>>>>   to the index will be configurable.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> >>>>>>
> >>>>>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit
> >>> tests
> >>>>> need
> >>>>>>
> >>>>>>    to be addressed in NUTCH-2596.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> And two important issues are almost ready to be committed soon:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
> >>>>> Thanks to
> >>>>>>
> >>>>>>    Gerard Bouchard!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol
> implementation
> >>>>> based
> >>>>>>
> >>>>>>    on the okhttp library. Supports HTTP/2.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> The full list of fixes and improvements is available at [2].
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I'll plan to work through the remaining 70 open issues during the
> next
> >>>>>>
> >>>>>> days and hope to commit/resolve 15-25 of them and move the remaining
> >>>>>>
> >>>>>> ones to Nutch 1.16.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Please vote for issues you want to get included. If there are open
> >>>>>>
> >>>>>> pull requests, it will help if these can be merged, the unit tests
> >>>>>>
> >>>>>> pass, and any review comments are addressed. Thanks!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> If there are any objections or blockers, please also let us know!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I'll also plan to run a test crawl on Hadoop mid of this week.
> >>>>>>
> >>>>>> But any help in testing is welcome.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Note that the tutorial needs to be updated (will be done after 1.15
> >>>>>>
> >>>>>> is finally released) to reflect the changes related to NUTCH-1480.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Sebastian
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]
> https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> >>>>>>
> >>>>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> UCIENCIA 2018: III Conferencia Científica Internacional de la
> >>> Universidad
> >>>>> de las Ciencias Informáticas.
> >>>>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
> >>> http://eventos.uci.cu
> >>>>>
> >>>>
> >>>
> >>>
> >
>
>

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Yes, of course, defining properties in the nutch-site.xml (but not "site.xml")
does also work. It's the usual hiearchy:
 bin/nutch command -Dkey=value ...
  overwrites property in nutch-site.xml
     (must be on classpath: runtime/local/conf resp. inside the nutch.job)
   overwrites definition in nutch-default.xml

On 06/12/2018 02:26 PM, BlackIce wrote:
> PS: Does this work when configured in site.xml like regular metatdata?
> 
> On Tue, Jun 12, 2018 at 1:31 PM BlackIce <bl...@gmail.com> wrote:
> 
>> sweet thnx!
>>
>> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel <
>> wastl.nagel@googlemail.com> wrote:
>>
>>>> stoopid question, but I can't find any info on it... can we now parse
>>> Open
>>>> Graph metatags?
>>>
>>> parse-tika extracts og:* metatags
>>>
>>> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'
>>> http://ogp.me/
>>> ...
>>> Parse Metadata: og:image=http://ogp.me/logo.png og:type=website
>>> og:image:width=300
>>>   og:image:alt=The Open Graph logo og:title=Open Graph protocol ...
>>>
>>> % bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description
>>> \
>>>     -Dplugin.includes='protocol-http|parse-tika|index-metadata'
>>> http://ogp.me/
>>> ...
>>> og:image :      http://ogp.me/logo.png
>>> og:title :      Open Graph protocol
>>> digest :        f98d6d5e5894ef83561630ebef3bf060
>>> id :    http://ogp.me/
>>> og:description :        The Open Graph protocol enables any web page to
>>> become a rich object in a
>>> social graph.
>>>
>>>
>>> On 06/11/2018 11:44 PM, BlackIce wrote:
>>>> +1
>>>>
>>>> stoopid question, but I can't find any info on it... can we now parse
>>> Open
>>>> Graph metatags?
>>>>
>>>> Greetz
>>>>
>>>> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <
>>> roannel@uci.cu>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Regards
>>>>>
>>>>> ----- Chris Mattmann <ma...@apache.org> escribió:
>>>>>> ++1!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sounds great.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: Sebastian Nagel <wa...@googlemail.com>
>>>>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>>>> Date: Monday, June 11, 2018 at 7:35 AM
>>>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>>>> Subject: Preparing to release Nutch 1.15 ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>>
>>>>>> almost 80 fixes and improvements are done now and include:
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2375 upgrade to new mapreduce API
>>>>>>
>>>>>>   It was a huge change affecting more than 10,000 lines of code.
>>> Thanks,
>>>>> Omkar!
>>>>>>
>>>>>>   Well, there have been some regressions but those are resolved now.
>>>>> Tests in
>>>>>>
>>>>>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>>>>> (180
>>>>>>
>>>>>>   million pages) on a Hadoop cluster.
>>>>>>
>>>>>>   Would be great if anybody is able to test the Nutch master in
>>>>> combination with
>>>>>>
>>>>>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
>>>>> works. Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-1480: Multiple index writer instances with different
>>> configurations
>>>>>>
>>>>>>   Thanks to Roannel it's now possible to index into multiple Solr or
>>>>> Elasticsearch
>>>>>>
>>>>>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
>>>>> documents
>>>>>>
>>>>>>   to the index will be configurable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>>>>>>
>>>>>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit
>>> tests
>>>>> need
>>>>>>
>>>>>>    to be addressed in NUTCH-2596.
>>>>>>
>>>>>>
>>>>>>
>>>>>> And two important issues are almost ready to be committed soon:
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>>>>> Thanks to
>>>>>>
>>>>>>    Gerard Bouchard!
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>>>>> based
>>>>>>
>>>>>>    on the okhttp library. Supports HTTP/2.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The full list of fixes and improvements is available at [2].
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'll plan to work through the remaining 70 open issues during the next
>>>>>>
>>>>>> days and hope to commit/resolve 15-25 of them and move the remaining
>>>>>>
>>>>>> ones to Nutch 1.16.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please vote for issues you want to get included. If there are open
>>>>>>
>>>>>> pull requests, it will help if these can be merged, the unit tests
>>>>>>
>>>>>> pass, and any review comments are addressed. Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> If there are any objections or blockers, please also let us know!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'll also plan to run a test crawl on Hadoop mid of this week.
>>>>>>
>>>>>> But any help in testing is welcome.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Note that the tutorial needs to be updated (will be done after 1.15
>>>>>>
>>>>>> is finally released) to reflect the changes related to NUTCH-1480.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>>>>>>
>>>>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> UCIENCIA 2018: III Conferencia Científica Internacional de la
>>> Universidad
>>>>> de las Ciencias Informáticas.
>>>>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
>>> http://eventos.uci.cu
>>>>>
>>>>
>>>
>>>
> 


Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by BlackIce <bl...@gmail.com>.
PS: Does this work when configured in site.xml like regular metatdata?

On Tue, Jun 12, 2018 at 1:31 PM BlackIce <bl...@gmail.com> wrote:

> sweet thnx!
>
> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> > stoopid question, but I can't find any info on it... can we now parse
>> Open
>> > Graph metatags?
>>
>> parse-tika extracts og:* metatags
>>
>> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'
>> http://ogp.me/
>> ...
>> Parse Metadata: og:image=http://ogp.me/logo.png og:type=website
>> og:image:width=300
>>   og:image:alt=The Open Graph logo og:title=Open Graph protocol ...
>>
>> % bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description
>> \
>>     -Dplugin.includes='protocol-http|parse-tika|index-metadata'
>> http://ogp.me/
>> ...
>> og:image :      http://ogp.me/logo.png
>> og:title :      Open Graph protocol
>> digest :        f98d6d5e5894ef83561630ebef3bf060
>> id :    http://ogp.me/
>> og:description :        The Open Graph protocol enables any web page to
>> become a rich object in a
>> social graph.
>>
>>
>> On 06/11/2018 11:44 PM, BlackIce wrote:
>> > +1
>> >
>> > stoopid question, but I can't find any info on it... can we now parse
>> Open
>> > Graph metatags?
>> >
>> > Greetz
>> >
>> > On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <
>> roannel@uci.cu>
>> > wrote:
>> >
>> >> +1
>> >>
>> >> Regards
>> >>
>> >> ----- Chris Mattmann <ma...@apache.org> escribió:
>> >>> ++1!
>> >>>
>> >>>
>> >>>
>> >>> Sounds great.
>> >>>
>> >>>
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Chris
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Sebastian Nagel <wa...@googlemail.com>
>> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>> Date: Monday, June 11, 2018 at 7:35 AM
>> >>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> >>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>> Subject: Preparing to release Nutch 1.15 ?
>> >>>
>> >>>
>> >>>
>> >>> Hi all,
>> >>>
>> >>>
>> >>>
>> >>> almost 80 fixes and improvements are done now and include:
>> >>>
>> >>>
>> >>>
>> >>> NUTCH-2375 upgrade to new mapreduce API
>> >>>
>> >>>   It was a huge change affecting more than 10,000 lines of code.
>> Thanks,
>> >> Omkar!
>> >>>
>> >>>   Well, there have been some regressions but those are resolved now.
>> >> Tests in
>> >>>
>> >>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>> >> (180
>> >>>
>> >>>   million pages) on a Hadoop cluster.
>> >>>
>> >>>   Would be great if anybody is able to test the Nutch master in
>> >> combination with
>> >>>
>> >>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
>> >> works. Thanks!
>> >>>
>> >>>
>> >>>
>> >>> NUTCH-1480: Multiple index writer instances with different
>> configurations
>> >>>
>> >>>   Thanks to Roannel it's now possible to index into multiple Solr or
>> >> Elasticsearch
>> >>>
>> >>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
>> >> documents
>> >>>
>> >>>   to the index will be configurable.
>> >>>
>> >>>
>> >>>
>> >>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>> >>>
>> >>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit
>> tests
>> >> need
>> >>>
>> >>>    to be addressed in NUTCH-2596.
>> >>>
>> >>>
>> >>>
>> >>> And two important issues are almost ready to be committed soon:
>> >>>
>> >>>
>> >>>
>> >>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>> >> Thanks to
>> >>>
>> >>>    Gerard Bouchard!
>> >>>
>> >>>
>> >>>
>> >>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>> >> based
>> >>>
>> >>>    on the okhttp library. Supports HTTP/2.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> The full list of fixes and improvements is available at [2].
>> >>>
>> >>>
>> >>>
>> >>> I'll plan to work through the remaining 70 open issues during the next
>> >>>
>> >>> days and hope to commit/resolve 15-25 of them and move the remaining
>> >>>
>> >>> ones to Nutch 1.16.
>> >>>
>> >>>
>> >>>
>> >>> Please vote for issues you want to get included. If there are open
>> >>>
>> >>> pull requests, it will help if these can be merged, the unit tests
>> >>>
>> >>> pass, and any review comments are addressed. Thanks!
>> >>>
>> >>>
>> >>>
>> >>> If there are any objections or blockers, please also let us know!
>> >>>
>> >>>
>> >>>
>> >>> I'll also plan to run a test crawl on Hadoop mid of this week.
>> >>>
>> >>> But any help in testing is welcome.
>> >>>
>> >>>
>> >>>
>> >>> Note that the tutorial needs to be updated (will be done after 1.15
>> >>>
>> >>> is finally released) to reflect the changes related to NUTCH-1480.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Sebastian
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>> >>>
>> >>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >> UCIENCIA 2018: III Conferencia Científica Internacional de la
>> Universidad
>> >> de las Ciencias Informáticas.
>> >> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
>> http://eventos.uci.cu
>> >>
>> >
>>
>>

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by BlackIce <bl...@gmail.com>.
sweet thnx!

On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel <wa...@googlemail.com>
wrote:

> > stoopid question, but I can't find any info on it... can we now parse
> Open
> > Graph metatags?
>
> parse-tika extracts og:* metatags
>
> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'
> http://ogp.me/
> ...
> Parse Metadata: og:image=http://ogp.me/logo.png og:type=website
> og:image:width=300
>   og:image:alt=The Open Graph logo og:title=Open Graph protocol ...
>
> % bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description
> \
>     -Dplugin.includes='protocol-http|parse-tika|index-metadata'
> http://ogp.me/
> ...
> og:image :      http://ogp.me/logo.png
> og:title :      Open Graph protocol
> digest :        f98d6d5e5894ef83561630ebef3bf060
> id :    http://ogp.me/
> og:description :        The Open Graph protocol enables any web page to
> become a rich object in a
> social graph.
>
>
> On 06/11/2018 11:44 PM, BlackIce wrote:
> > +1
> >
> > stoopid question, but I can't find any info on it... can we now parse
> Open
> > Graph metatags?
> >
> > Greetz
> >
> > On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <
> roannel@uci.cu>
> > wrote:
> >
> >> +1
> >>
> >> Regards
> >>
> >> ----- Chris Mattmann <ma...@apache.org> escribió:
> >>> ++1!
> >>>
> >>>
> >>>
> >>> Sounds great.
> >>>
> >>>
> >>>
> >>> Cheers,
> >>>
> >>> Chris
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Sebastian Nagel <wa...@googlemail.com>
> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>> Date: Monday, June 11, 2018 at 7:35 AM
> >>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>> Subject: Preparing to release Nutch 1.15 ?
> >>>
> >>>
> >>>
> >>> Hi all,
> >>>
> >>>
> >>>
> >>> almost 80 fixes and improvements are done now and include:
> >>>
> >>>
> >>>
> >>> NUTCH-2375 upgrade to new mapreduce API
> >>>
> >>>   It was a huge change affecting more than 10,000 lines of code.
> Thanks,
> >> Omkar!
> >>>
> >>>   Well, there have been some regressions but those are resolved now.
> >> Tests in
> >>>
> >>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
> >> (180
> >>>
> >>>   million pages) on a Hadoop cluster.
> >>>
> >>>   Would be great if anybody is able to test the Nutch master in
> >> combination with
> >>>
> >>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
> >> works. Thanks!
> >>>
> >>>
> >>>
> >>> NUTCH-1480: Multiple index writer instances with different
> configurations
> >>>
> >>>   Thanks to Roannel it's now possible to index into multiple Solr or
> >> Elasticsearch
> >>>
> >>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
> >> documents
> >>>
> >>>   to the index will be configurable.
> >>>
> >>>
> >>>
> >>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> >>>
> >>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit
> tests
> >> need
> >>>
> >>>    to be addressed in NUTCH-2596.
> >>>
> >>>
> >>>
> >>> And two important issues are almost ready to be committed soon:
> >>>
> >>>
> >>>
> >>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
> >> Thanks to
> >>>
> >>>    Gerard Bouchard!
> >>>
> >>>
> >>>
> >>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
> >> based
> >>>
> >>>    on the okhttp library. Supports HTTP/2.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> The full list of fixes and improvements is available at [2].
> >>>
> >>>
> >>>
> >>> I'll plan to work through the remaining 70 open issues during the next
> >>>
> >>> days and hope to commit/resolve 15-25 of them and move the remaining
> >>>
> >>> ones to Nutch 1.16.
> >>>
> >>>
> >>>
> >>> Please vote for issues you want to get included. If there are open
> >>>
> >>> pull requests, it will help if these can be merged, the unit tests
> >>>
> >>> pass, and any review comments are addressed. Thanks!
> >>>
> >>>
> >>>
> >>> If there are any objections or blockers, please also let us know!
> >>>
> >>>
> >>>
> >>> I'll also plan to run a test crawl on Hadoop mid of this week.
> >>>
> >>> But any help in testing is welcome.
> >>>
> >>>
> >>>
> >>> Note that the tutorial needs to be updated (will be done after 1.15
> >>>
> >>> is finally released) to reflect the changes related to NUTCH-1480.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Sebastian
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> >>>
> >>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >> UCIENCIA 2018: III Conferencia Científica Internacional de la
> Universidad
> >> de las Ciencias Informáticas.
> >> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
> http://eventos.uci.cu
> >>
> >
>
>

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by Sebastian Nagel <wa...@googlemail.com>.
> stoopid question, but I can't find any info on it... can we now parse Open
> Graph metatags?

parse-tika extracts og:* metatags

% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' http://ogp.me/
...
Parse Metadata: og:image=http://ogp.me/logo.png og:type=website og:image:width=300
  og:image:alt=The Open Graph logo og:title=Open Graph protocol ...

% bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description \
    -Dplugin.includes='protocol-http|parse-tika|index-metadata' http://ogp.me/
...
og:image :      http://ogp.me/logo.png
og:title :      Open Graph protocol
digest :        f98d6d5e5894ef83561630ebef3bf060
id :    http://ogp.me/
og:description :        The Open Graph protocol enables any web page to become a rich object in a
social graph.


On 06/11/2018 11:44 PM, BlackIce wrote:
> +1
> 
> stoopid question, but I can't find any info on it... can we now parse Open
> Graph metatags?
> 
> Greetz
> 
> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <ro...@uci.cu>
> wrote:
> 
>> +1
>>
>> Regards
>>
>> ----- Chris Mattmann <ma...@apache.org> escribió:
>>> ++1!
>>>
>>>
>>>
>>> Sounds great.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Sebastian Nagel <wa...@googlemail.com>
>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Date: Monday, June 11, 2018 at 7:35 AM
>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Subject: Preparing to release Nutch 1.15 ?
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> almost 80 fixes and improvements are done now and include:
>>>
>>>
>>>
>>> NUTCH-2375 upgrade to new mapreduce API
>>>
>>>   It was a huge change affecting more than 10,000 lines of code. Thanks,
>> Omkar!
>>>
>>>   Well, there have been some regressions but those are resolved now.
>> Tests in
>>>
>>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>> (180
>>>
>>>   million pages) on a Hadoop cluster.
>>>
>>>   Would be great if anybody is able to test the Nutch master in
>> combination with
>>>
>>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
>> works. Thanks!
>>>
>>>
>>>
>>> NUTCH-1480: Multiple index writer instances with different configurations
>>>
>>>   Thanks to Roannel it's now possible to index into multiple Solr or
>> Elasticsearch
>>>
>>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
>> documents
>>>
>>>   to the index will be configurable.
>>>
>>>
>>>
>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>>>
>>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests
>> need
>>>
>>>    to be addressed in NUTCH-2596.
>>>
>>>
>>>
>>> And two important issues are almost ready to be committed soon:
>>>
>>>
>>>
>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>> Thanks to
>>>
>>>    Gerard Bouchard!
>>>
>>>
>>>
>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>> based
>>>
>>>    on the okhttp library. Supports HTTP/2.
>>>
>>>
>>>
>>>
>>>
>>> The full list of fixes and improvements is available at [2].
>>>
>>>
>>>
>>> I'll plan to work through the remaining 70 open issues during the next
>>>
>>> days and hope to commit/resolve 15-25 of them and move the remaining
>>>
>>> ones to Nutch 1.16.
>>>
>>>
>>>
>>> Please vote for issues you want to get included. If there are open
>>>
>>> pull requests, it will help if these can be merged, the unit tests
>>>
>>> pass, and any review comments are addressed. Thanks!
>>>
>>>
>>>
>>> If there are any objections or blockers, please also let us know!
>>>
>>>
>>>
>>> I'll also plan to run a test crawl on Hadoop mid of this week.
>>>
>>> But any help in testing is welcome.
>>>
>>>
>>>
>>> Note that the tutorial needs to be updated (will be done after 1.15
>>>
>>> is finally released) to reflect the changes related to NUTCH-1480.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sebastian
>>>
>>>
>>>
>>>
>>>
>>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>>>
>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>>>
>>>
>>>
>>>
>>>
>>
>> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
>> de las Ciencias Informáticas.
>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>>
> 


Re: [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by BlackIce <bl...@gmail.com>.
+1

stoopid question, but I can't find any info on it... can we now parse Open
Graph metatags?

Greetz

On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <ro...@uci.cu>
wrote:

> +1
>
> Regards
>
> ----- Chris Mattmann <ma...@apache.org> escribió:
> > ++1!
> >
> >
> >
> > Sounds great.
> >
> >
> >
> > Cheers,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Sebastian Nagel <wa...@googlemail.com>
> > Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Date: Monday, June 11, 2018 at 7:35 AM
> > To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> > Subject: Preparing to release Nutch 1.15 ?
> >
> >
> >
> > Hi all,
> >
> >
> >
> > almost 80 fixes and improvements are done now and include:
> >
> >
> >
> > NUTCH-2375 upgrade to new mapreduce API
> >
> >   It was a huge change affecting more than 10,000 lines of code. Thanks,
> Omkar!
> >
> >   Well, there have been some regressions but those are resolved now.
> Tests in
> >
> >   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
> (180
> >
> >   million pages) on a Hadoop cluster.
> >
> >   Would be great if anybody is able to test the Nutch master in
> combination with
> >
> >   a non-HDFS file system (e.g. s3://)! Please let us know whether this
> works. Thanks!
> >
> >
> >
> > NUTCH-1480: Multiple index writer instances with different configurations
> >
> >   Thanks to Roannel it's now possible to index into multiple Solr or
> Elasticsearch
> >
> >   instances. With NUTCH- (needs to be reviewed) also the routing to of
> documents
> >
> >   to the index will be configurable.
> >
> >
> >
> > NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> >
> >    Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests
> need
> >
> >    to be addressed in NUTCH-2596.
> >
> >
> >
> > And two important issues are almost ready to be committed soon:
> >
> >
> >
> > NUTCH-2549: a long list of fixes and improvements to protocol-http.
> Thanks to
> >
> >    Gerard Bouchard!
> >
> >
> >
> > NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
> based
> >
> >    on the okhttp library. Supports HTTP/2.
> >
> >
> >
> >
> >
> > The full list of fixes and improvements is available at [2].
> >
> >
> >
> > I'll plan to work through the remaining 70 open issues during the next
> >
> > days and hope to commit/resolve 15-25 of them and move the remaining
> >
> > ones to Nutch 1.16.
> >
> >
> >
> > Please vote for issues you want to get included. If there are open
> >
> > pull requests, it will help if these can be merged, the unit tests
> >
> > pass, and any review comments are addressed. Thanks!
> >
> >
> >
> > If there are any objections or blockers, please also let us know!
> >
> >
> >
> > I'll also plan to run a test crawl on Hadoop mid of this week.
> >
> > But any help in testing is welcome.
> >
> >
> >
> > Note that the tutorial needs to be updated (will be done after 1.15
> >
> > is finally released) to reflect the changes related to NUTCH-1480.
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Sebastian
> >
> >
> >
> >
> >
> > [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> >
> > [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> >
> >
> >
> >
> >
>
> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
> de las Ciencias Informáticas.
> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>

Re: [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

Posted by Roannel Fernández Hernández <ro...@uci.cu>.
+1

Regards

----- Chris Mattmann <ma...@apache.org> escribió:
> ++1!
> 
>  
> 
> Sounds great.
> 
>  
> 
> Cheers,
> 
> Chris
> 
>  
> 
>  
> 
>  
> 
>  
> 
> From: Sebastian Nagel <wa...@googlemail.com>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Monday, June 11, 2018 at 7:35 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Preparing to release Nutch 1.15 ?
> 
>  
> 
> Hi all,
> 
>  
> 
> almost 80 fixes and improvements are done now and include:
> 
>  
> 
> NUTCH-2375 upgrade to new mapreduce API
> 
>   It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
> 
>   Well, there have been some regressions but those are resolved now. Tests in
> 
>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
> 
>   million pages) on a Hadoop cluster.
> 
>   Would be great if anybody is able to test the Nutch master in combination with
> 
>   a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!
> 
>  
> 
> NUTCH-1480: Multiple index writer instances with different configurations
> 
>   Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
> 
>   instances. With NUTCH- (needs to be reviewed) also the routing to of documents
> 
>   to the index will be configurable.
> 
>  
> 
> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
> 
>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
> 
>    to be addressed in NUTCH-2596.
> 
>  
> 
> And two important issues are almost ready to be committed soon:
> 
>  
> 
> NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
> 
>    Gerard Bouchard!
> 
>  
> 
> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
> 
>    on the okhttp library. Supports HTTP/2.
> 
>  
> 
>  
> 
> The full list of fixes and improvements is available at [2].
> 
>  
> 
> I'll plan to work through the remaining 70 open issues during the next
> 
> days and hope to commit/resolve 15-25 of them and move the remaining
> 
> ones to Nutch 1.16.
> 
>  
> 
> Please vote for issues you want to get included. If there are open
> 
> pull requests, it will help if these can be merged, the unit tests
> 
> pass, and any review comments are addressed. Thanks!
> 
>  
> 
> If there are any objections or blockers, please also let us know!
> 
>  
> 
> I'll also plan to run a test crawl on Hadoop mid of this week.
> 
> But any help in testing is welcome.
> 
>  
> 
> Note that the tutorial needs to be updated (will be done after 1.15
> 
> is finally released) to reflect the changes related to NUTCH-1480.
> 
>  
> 
>  
> 
> Thanks,
> 
> Sebastian
> 
>  
> 
>  
> 
> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
> 
> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
> 
>  
> 
>  
> 

UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad de las Ciencias Informáticas. 
Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu

Re: Preparing to release Nutch 1.15 ?

Posted by Chris Mattmann <ma...@apache.org>.
++1!

 

Sounds great.

 

Cheers,

Chris

 

 

 

 

From: Sebastian Nagel <wa...@googlemail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Monday, June 11, 2018 at 7:35 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Preparing to release Nutch 1.15 ?

 

Hi all,

 

almost 80 fixes and improvements are done now and include:

 

NUTCH-2375 upgrade to new mapreduce API

  It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!

  Well, there have been some regressions but those are resolved now. Tests in

  pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180

  million pages) on a Hadoop cluster.

  Would be great if anybody is able to test the Nutch master in combination with

  a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!

 

NUTCH-1480: Multiple index writer instances with different configurations

  Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch

  instances. With NUTCH- (needs to be reviewed) also the routing to of documents

  to the index will be configurable.

 

NUTCH-2583: Ralf contributed a huge upgrade of dependencies.

   Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need

   to be addressed in NUTCH-2596.

 

And two important issues are almost ready to be committed soon:

 

NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to

   Gerard Bouchard!

 

NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based

   on the okhttp library. Supports HTTP/2.

 

 

The full list of fixes and improvements is available at [2].

 

I'll plan to work through the remaining 70 open issues during the next

days and hope to commit/resolve 15-25 of them and move the remaining

ones to Nutch 1.16.

 

Please vote for issues you want to get included. If there are open

pull requests, it will help if these can be merged, the unit tests

pass, and any review comments are addressed. Thanks!

 

If there are any objections or blockers, please also let us know!

 

I'll also plan to run a test crawl on Hadoop mid of this week.

But any help in testing is welcome.

 

Note that the tutorial needs to be updated (will be done after 1.15

is finally released) to reflect the changes related to NUTCH-1480.

 

 

Thanks,

Sebastian

 

 

[1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster

[2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302

 

 


Re: Preparing to release Nutch 1.15 ?

Posted by Chris Mattmann <ma...@apache.org>.
++1!

 

Sounds great.

 

Cheers,

Chris

 

 

 

 

From: Sebastian Nagel <wa...@googlemail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Monday, June 11, 2018 at 7:35 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Preparing to release Nutch 1.15 ?

 

Hi all,

 

almost 80 fixes and improvements are done now and include:

 

NUTCH-2375 upgrade to new mapreduce API

  It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!

  Well, there have been some regressions but those are resolved now. Tests in

  pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180

  million pages) on a Hadoop cluster.

  Would be great if anybody is able to test the Nutch master in combination with

  a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!

 

NUTCH-1480: Multiple index writer instances with different configurations

  Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch

  instances. With NUTCH- (needs to be reviewed) also the routing to of documents

  to the index will be configurable.

 

NUTCH-2583: Ralf contributed a huge upgrade of dependencies.

   Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need

   to be addressed in NUTCH-2596.

 

And two important issues are almost ready to be committed soon:

 

NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to

   Gerard Bouchard!

 

NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based

   on the okhttp library. Supports HTTP/2.

 

 

The full list of fixes and improvements is available at [2].

 

I'll plan to work through the remaining 70 open issues during the next

days and hope to commit/resolve 15-25 of them and move the remaining

ones to Nutch 1.16.

 

Please vote for issues you want to get included. If there are open

pull requests, it will help if these can be merged, the unit tests

pass, and any review comments are addressed. Thanks!

 

If there are any objections or blockers, please also let us know!

 

I'll also plan to run a test crawl on Hadoop mid of this week.

But any help in testing is welcome.

 

Note that the tutorial needs to be updated (will be done after 1.15

is finally released) to reflect the changes related to NUTCH-1480.

 

 

Thanks,

Sebastian

 

 

[1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster

[2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302