You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com.INVALID> on 2019/07/26 09:39:31 UTC

Nutch Wiki migrated

Hi all,

the Nutch wiki has been migrated from MoinMoin to Confluence.

You'll find it now on
  https://cwiki.apache.org/confluence/display/NUTCH/Home

Work on improving the Wiki - updating information and moving outdated stuff
into "Archive and Legacy" - is ongoing. Help is welcome, if you want to
contribute documentation please read
  https://cwiki.apache.org/confluence/display/NUTCH/Home#Home-HowtoeditthisWiki

Cheers,
Sebastian

Re: Injection from webservice

Posted by Dave Beckstrom <db...@collectivefls.com>.
Or use a scheduled wget job to pull them from the remote server and store
them on a path that Nutch can access locally.

Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckstrom@collectivefls.com <ah...@collectivefls.com>
ph: 763.323.3499


On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt <
betancourt.jorge@gmail.com> wrote:

> Hi Roannel,
>
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
>
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
>
> The REST API implementation just writes the content in the expected format
> (
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
>
> Best Regards,
> Jorge
>
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roannel@uci.cu>
> wrote:
>
> > Hi folks,
> >
> > Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> > via http or https)?
> >
> > I mean this, for instance:
> >
> > bin/nutch inject crawl http://example.org/seed
> >
> > Regards
> > 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> > Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/> 




Re: [MASSMAIL]Re: Injection from webservice

Posted by Jorge Betancourt <be...@gmail.com>.
TBH I'm not entirely sure. Downloading the file can be scripted around
without a lot of troubles. My feeling is that the Injector class has a good
enough scope already. There are valid reasons for having a custom injector
(reading the seed URLs from a DB comes to my mind). When I needed a custom
injector it was for very requirements, and it made more sense to have a
custom injector instead of generating a seed file (this was before having a
REST API, which right now provides a nice API around the injector).

It is a valid point that we don't have an extension point for the Injector
logic which could allow for having different seed URL providers without
developers needing to worry about the specific injection logic.

My main concern is if we want to put this additional complexity in Nutch.
It is really valuable to all of our users to have HTTP/DB/custom injectors
available out of the box in a pluggable way?

I would love to hear what other people have to say.

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 8:53 PM Roannel Fernandez Hernandez <ro...@uci.cu>
wrote:

> Thanks Jorge for your answer. Do you think an injector that accepts
> local/hdfs paths and in addition API endpoints could be a good improvement
> for Nutch.
>
> Regards, Roannel
>
> ----- Original Message -----
> > From: "Jorge Betancourt" <be...@gmail.com>
> > To: "user" <us...@nutch.apache.org>
> > Sent: Lunes, 16 de Septiembre 2019 13:14:36
> > Subject: [MASSMAIL]Re: Injection from webservice
>
> > Hi Roannel,
> >
> > The current implementation of the injector only accepts a path (actually
> an
> > org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> > directly unless you download the content first.
> >
> > If you use the REST API you can send the seed file using the API
> endpoint.
> > Otherwise, you could write your own injector with the proper logic to
> deal
> > with a list of URLs coming from an URL.
> >
> > The REST API implementation just writes the content in the expected
> format (
> >
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> > )
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roannel@uci.cu>
> > wrote:
> >
> >> Hi folks,
> >>
> >> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> >> via http or https)?
> >>
> >> I mean this, for instance:
> >>
> >> bin/nutch inject crawl http://example.org/seed
> >>
> >> Regards
> >> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> >> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >>
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>

Re: [MASSMAIL]Re: Injection from webservice

Posted by Roannel Fernandez Hernandez <ro...@uci.cu>.
Thanks Jorge for your answer. Do you think an injector that accepts local/hdfs paths and in addition API endpoints could be a good improvement for Nutch.

Regards, Roannel

----- Original Message -----
> From: "Jorge Betancourt" <be...@gmail.com>
> To: "user" <us...@nutch.apache.org>
> Sent: Lunes, 16 de Septiembre 2019 13:14:36
> Subject: [MASSMAIL]Re: Injection from webservice

> Hi Roannel,
> 
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
> 
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
> 
> The REST API implementation just writes the content in the expected format (
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
> 
> Best Regards,
> Jorge
> 
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <ro...@uci.cu>
> wrote:
> 
>> Hi folks,
>>
>> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
>> via http or https)?
>>
>> I mean this, for instance:
>>
>> bin/nutch inject crawl http://example.org/seed
>>
>> Regards
>> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
>> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>>
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500


Re: Injection from webservice

Posted by Jorge Betancourt <be...@gmail.com>.
Hi Roannel,

The current implementation of the injector only accepts a path (actually an
org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
directly unless you download the content first.

If you use the REST API you can send the seed file using the API endpoint.
Otherwise, you could write your own injector with the proper logic to deal
with a list of URLs coming from an URL.

The REST API implementation just writes the content in the expected format (
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
)

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <ro...@uci.cu>
wrote:

> Hi folks,
>
> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> via http or https)?
>
> I mean this, for instance:
>
> bin/nutch inject crawl http://example.org/seed
>
> Regards
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>

Injection from webservice

Posted by Roannel Fernandez Hernandez <ro...@uci.cu>.
Hi folks,

Is there any way in Nutch 1.15 to inject a remote seed file (accessible via http or https)?

I mean this, for instance:

bin/nutch inject crawl http://example.org/seed

Regards
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500


Injection from webservice

Posted by Roannel Fernandez Hernandez <ro...@uci.cu>.
Hi folks,

Is there any way in Nutch 1.15 to inject a remote seed file (accessible via http or https)?

I mean this, for instance:

bin/nutch inject crawl http://example.org/seed

Regards
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500


Re: Nutch Wiki migrated

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Furkan,

yes, of course. The Confluence wiki provides a useful tree view
which was absent in the MoinMoin wiki. And we need to organize
the page list into a real tree to make the navigation easier.
In the old wiki there was only the start/home page for navigation.
Of course, we could just start with this tree-like structure.

I've already started to move the obviously outdated stuff
("Nutch 0.9 tutorial" etc.) below "Archive and Legacy".
It's unbelievable how many outdated stuff we have. In the
old wiki it was just invisible.

There are other pages which are just stub such as
  https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage
Let's just remove it. - I've did right now.

If you could help here, feel free to carry on. Thanks!

> or something like that to gather suggestions of the new wiki structure?

Let's just start a wiki page to collectively develop and discuss the structure.
Here it is:
  https://cwiki.apache.org/confluence/display/NUTCH/Wiki+Page+Tree+Structure
If you have good ideas, just go on!

Thanks,
Sebastian


On 7/26/19 11:58 AM, Furkan KAMACI wrote:
> Hi Sebastian,
> 
> It seems that we need to organize wiki pages. There are 110 child pages, and some of them are
> useless (i.e. https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage). We can create a
> Google Docs document or something like that to gather suggestions of the new wiki structure?
> 
> Kind Regards,
> Furkan KAMACI
> 
> On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel <wa...@googlemail.com.invalid> wrote:
> 
>     Hi all,
> 
>     the Nutch wiki has been migrated from MoinMoin to Confluence.
> 
>     You'll find it now on
>       https://cwiki.apache.org/confluence/display/NUTCH/Home
> 
>     Work on improving the Wiki - updating information and moving outdated stuff
>     into "Archive and Legacy" - is ongoing. Help is welcome, if you want to
>     contribute documentation please read
>       https://cwiki.apache.org/confluence/display/NUTCH/Home#Home-HowtoeditthisWiki
> 
>     Cheers,
>     Sebastian
> 


Re: Nutch Wiki migrated

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Furkan,

yes, of course. The Confluence wiki provides a useful tree view
which was absent in the MoinMoin wiki. And we need to organize
the page list into a real tree to make the navigation easier.
In the old wiki there was only the start/home page for navigation.
Of course, we could just start with this tree-like structure.

I've already started to move the obviously outdated stuff
("Nutch 0.9 tutorial" etc.) below "Archive and Legacy".
It's unbelievable how many outdated stuff we have. In the
old wiki it was just invisible.

There are other pages which are just stub such as
  https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage
Let's just remove it. - I've did right now.

If you could help here, feel free to carry on. Thanks!

> or something like that to gather suggestions of the new wiki structure?

Let's just start a wiki page to collectively develop and discuss the structure.
Here it is:
  https://cwiki.apache.org/confluence/display/NUTCH/Wiki+Page+Tree+Structure
If you have good ideas, just go on!

Thanks,
Sebastian


On 7/26/19 11:58 AM, Furkan KAMACI wrote:
> Hi Sebastian,
> 
> It seems that we need to organize wiki pages. There are 110 child pages, and some of them are
> useless (i.e. https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage). We can create a
> Google Docs document or something like that to gather suggestions of the new wiki structure?
> 
> Kind Regards,
> Furkan KAMACI
> 
> On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel <wa...@googlemail.com.invalid> wrote:
> 
>     Hi all,
> 
>     the Nutch wiki has been migrated from MoinMoin to Confluence.
> 
>     You'll find it now on
>       https://cwiki.apache.org/confluence/display/NUTCH/Home
> 
>     Work on improving the Wiki - updating information and moving outdated stuff
>     into "Archive and Legacy" - is ongoing. Help is welcome, if you want to
>     contribute documentation please read
>       https://cwiki.apache.org/confluence/display/NUTCH/Home#Home-HowtoeditthisWiki
> 
>     Cheers,
>     Sebastian
> 


Re: Nutch Wiki migrated

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Sebastian,

It seems that we need to organize wiki pages. There are 110 child pages,
and some of them are useless (i.e.
https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage). We can
create a Google Docs document or something like that to gather suggestions
of the new wiki structure?

Kind Regards,
Furkan KAMACI

On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi all,
>
> the Nutch wiki has been migrated from MoinMoin to Confluence.
>
> You'll find it now on
>   https://cwiki.apache.org/confluence/display/NUTCH/Home
>
> Work on improving the Wiki - updating information and moving outdated stuff
> into "Archive and Legacy" - is ongoing. Help is welcome, if you want to
> contribute documentation please read
>
> https://cwiki.apache.org/confluence/display/NUTCH/Home#Home-HowtoeditthisWiki
>
> Cheers,
> Sebastian
>

Re: Nutch Wiki migrated

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Sebastian,

It seems that we need to organize wiki pages. There are 110 child pages,
and some of them are useless (i.e.
https://cwiki.apache.org/confluence/display/NUTCH/PythonLanguage). We can
create a Google Docs document or something like that to gather suggestions
of the new wiki structure?

Kind Regards,
Furkan KAMACI

On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi all,
>
> the Nutch wiki has been migrated from MoinMoin to Confluence.
>
> You'll find it now on
>   https://cwiki.apache.org/confluence/display/NUTCH/Home
>
> Work on improving the Wiki - updating information and moving outdated stuff
> into "Archive and Legacy" - is ongoing. Help is welcome, if you want to
> contribute documentation please read
>
> https://cwiki.apache.org/confluence/display/NUTCH/Home#Home-HowtoeditthisWiki
>
> Cheers,
> Sebastian
>