You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Sami Siren <ss...@gmail.com> on 2009/03/28 20:53:52 UTC

[ANNOUNCE] Apache Nutch 1.0

I am pleased to announce the availability of  Apache Nutch 1.0.

Apache Nutch, a subproject of Apache Lucene, is open source web-search 
software. It builds on Lucene Java, adding web-specifics, such as a 
crawler, a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as 
Solr Integration, new indexing framework and new scoring framework just 
to mention a few. Details can be found in the changes file:

http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

When downloading from a mirror site, please remember to verify the 
downloads using signatures found on the Apache site:
http://www.apache.org/dist/lucene/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch

-- Sami Siren (on behalf of the Apache Nutch community)

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Dennis Kubes <ku...@apache.org>.

That is already in the works.  See:

https://issues.apache.org/jira/browse/NUTCH-650

Dennis

Ryan Smith wrote:
> Dennis,
> Thank you.  Ok, then one other question please :).  I want to use heritrix,
> and the plugin for heritrix that writes records directly to hbase using
> hbase-writer:
> http://code.google.com/p/hbase-writer/
> (Hbase runs on top of hadoop)
> Would it be feasible/make sense for someone (maybe myself) to write a new
> plugin for nutch to read its input data from hbase tables instead of arc
> files?
> Thanks again.
> -Ryan
> 
> On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes <ku...@apache.org> wrote:
> 
>> To a point yes.  Heritrix will output in arc format.  Then you can use the
>> o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments.
>>  From there you can run other tools on the segments as normal.  What you
>> won't get is Heritrix access to the crawldb.
>>
>> Dennis
>>
>>
>> Ryan Smith wrote:
>>
>>> Is it possible to use heritrix as nutch's crawler?
>>>
>>>
>>> On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren <ss...@gmail.com> wrote:
>>>
>>>  I am pleased to announce the availability of  Apache Nutch 1.0.
>>>> Apache Nutch, a subproject of Apache Lucene, is open source web-search
>>>> software. It builds on Lucene Java, adding web-specifics, such as a
>>>> crawler,
>>>> a link-graph database, parsers for HTML and other document formats.
>>>>
>>>> Apache Nutch 1.0 contains a number of bug fixes and improvements such as
>>>> Solr Integration, new indexing framework and new scoring framework just
>>>> to
>>>> mention a few. Details can be found in the changes file:
>>>>
>>>> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
>>>>
>>>> Apache Nutch is available for download from the following download page:
>>>> http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
>>>>
>>>> When downloading from a mirror site, please remember to verify the
>>>> downloads using signatures found on the Apache site:
>>>> http://www.apache.org/dist/lucene/nutch/KEYS
>>>>
>>>> For more information on Apache Nutch, visit the project home page:
>>>> http://lucene.apache.org/nutch
>>>>
>>>> -- Sami Siren (on behalf of the Apache Nutch community)
>>>>
>>>>
>

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Ryan Smith <ry...@gmail.com>.

Dennis,
Thank you.  Ok, then one other question please :).  I want to use heritrix,
and the plugin for heritrix that writes records directly to hbase using
hbase-writer:
http://code.google.com/p/hbase-writer/
(Hbase runs on top of hadoop)
Would it be feasible/make sense for someone (maybe myself) to write a new
plugin for nutch to read its input data from hbase tables instead of arc
files?
Thanks again.
-Ryan

On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes <ku...@apache.org> wrote:

> To a point yes.  Heritrix will output in arc format.  Then you can use the
> o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments.
>  From there you can run other tools on the segments as normal.  What you
> won't get is Heritrix access to the crawldb.
>
> Dennis
>
>
> Ryan Smith wrote:
>
>> Is it possible to use heritrix as nutch's crawler?
>>
>>
>> On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren <ss...@gmail.com> wrote:
>>
>>  I am pleased to announce the availability of  Apache Nutch 1.0.
>>>
>>> Apache Nutch, a subproject of Apache Lucene, is open source web-search
>>> software. It builds on Lucene Java, adding web-specifics, such as a
>>> crawler,
>>> a link-graph database, parsers for HTML and other document formats.
>>>
>>> Apache Nutch 1.0 contains a number of bug fixes and improvements such as
>>> Solr Integration, new indexing framework and new scoring framework just
>>> to
>>> mention a few. Details can be found in the changes file:
>>>
>>> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
>>>
>>> Apache Nutch is available for download from the following download page:
>>> http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
>>>
>>> When downloading from a mirror site, please remember to verify the
>>> downloads using signatures found on the Apache site:
>>> http://www.apache.org/dist/lucene/nutch/KEYS
>>>
>>> For more information on Apache Nutch, visit the project home page:
>>> http://lucene.apache.org/nutch
>>>
>>> -- Sami Siren (on behalf of the Apache Nutch community)
>>>
>>>
>>

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Dennis Kubes <ku...@apache.org>.

To a point yes.  Heritrix will output in arc format.  Then you can use 
the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to 
segments.  From there you can run other tools on the segments as normal. 
  What you won't get is Heritrix access to the crawldb.

Dennis

Ryan Smith wrote:
> Is it possible to use heritrix as nutch's crawler?
> 
> 
> On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren <ss...@gmail.com> wrote:
> 
>> I am pleased to announce the availability of  Apache Nutch 1.0.
>>
>> Apache Nutch, a subproject of Apache Lucene, is open source web-search
>> software. It builds on Lucene Java, adding web-specifics, such as a crawler,
>> a link-graph database, parsers for HTML and other document formats.
>>
>> Apache Nutch 1.0 contains a number of bug fixes and improvements such as
>> Solr Integration, new indexing framework and new scoring framework just to
>> mention a few. Details can be found in the changes file:
>>
>> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
>>
>> Apache Nutch is available for download from the following download page:
>> http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
>>
>> When downloading from a mirror site, please remember to verify the
>> downloads using signatures found on the Apache site:
>> http://www.apache.org/dist/lucene/nutch/KEYS
>>
>> For more information on Apache Nutch, visit the project home page:
>> http://lucene.apache.org/nutch
>>
>> -- Sami Siren (on behalf of the Apache Nutch community)
>>
>

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Ryan Smith <ry...@gmail.com>.

Is it possible to use heritrix as nutch's crawler?


On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren <ss...@gmail.com> wrote:

> I am pleased to announce the availability of  Apache Nutch 1.0.
>
> Apache Nutch, a subproject of Apache Lucene, is open source web-search
> software. It builds on Lucene Java, adding web-specifics, such as a crawler,
> a link-graph database, parsers for HTML and other document formats.
>
> Apache Nutch 1.0 contains a number of bug fixes and improvements such as
> Solr Integration, new indexing framework and new scoring framework just to
> mention a few. Details can be found in the changes file:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
>
> Apache Nutch is available for download from the following download page:
> http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
>
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found on the Apache site:
> http://www.apache.org/dist/lucene/nutch/KEYS
>
> For more information on Apache Nutch, visit the project home page:
> http://lucene.apache.org/nutch
>
> -- Sami Siren (on behalf of the Apache Nutch community)
>

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Ryan Smith <ry...@gmail.com>.

Dennis, Thanks a lot.
-Ryan

2009/3/28 Tony Wang <iv...@gmail.com>

> Hi Sami,
>
> Thank you so much for the good news. Is there going to be documentation for
> Solr integration? Sorry to Otis, I know you are going to ask me to try to
> find it out by myself ;)
>
> Thanks! - Tony
>
> On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren <ss...@gmail.com> wrote:
>
> > I am pleased to announce the availability of  Apache Nutch 1.0.
> >
> > Apache Nutch, a subproject of Apache Lucene, is open source web-search
> > software. It builds on Lucene Java, adding web-specifics, such as a
> crawler,
> > a link-graph database, parsers for HTML and other document formats.
> >
> > Apache Nutch 1.0 contains a number of bug fixes and improvements such as
> > Solr Integration, new indexing framework and new scoring framework just
> to
> > mention a few. Details can be found in the changes file:
> >
> >
> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
> >
> > Apache Nutch is available for download from the following download page:
> > http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
> >
> > When downloading from a mirror site, please remember to verify the
> > downloads using signatures found on the Apache site:
> > http://www.apache.org/dist/lucene/nutch/KEYS
> >
> > For more information on Apache Nutch, visit the project home page:
> > http://lucene.apache.org/nutch
> >
> > -- Sami Siren (on behalf of the Apache Nutch community)
> >
>
>
>
> --
> Are you RCholic? www.RCholic.com
> 温 良 恭 俭 让 仁 义 礼 智 信
> ~ ..~
>  (oo)
>

Re: [ANNOUNCE] Apache Nutch 1.0

Posted by Tony Wang <iv...@gmail.com>.

Hi Sami,

Thank you so much for the good news. Is there going to be documentation for
Solr integration? Sorry to Otis, I know you are going to ask me to try to
find it out by myself ;)

Thanks! - Tony

On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren <ss...@gmail.com> wrote:

> I am pleased to announce the availability of  Apache Nutch 1.0.
>
> Apache Nutch, a subproject of Apache Lucene, is open source web-search
> software. It builds on Lucene Java, adding web-specifics, such as a crawler,
> a link-graph database, parsers for HTML and other document formats.
>
> Apache Nutch 1.0 contains a number of bug fixes and improvements such as
> Solr Integration, new indexing framework and new scoring framework just to
> mention a few. Details can be found in the changes file:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
>
> Apache Nutch is available for download from the following download page:
> http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
>
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found on the Apache site:
> http://www.apache.org/dist/lucene/nutch/KEYS
>
> For more information on Apache Nutch, visit the project home page:
> http://lucene.apache.org/nutch
>
> -- Sami Siren (on behalf of the Apache Nutch community)
>



-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信
~ ..~
 (oo)

Re: lukeall-0.9.1 to manually add indexes

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Mon, 2009-03-30 at 01:16 -0400, alxsss@aim.com wrote:
> Hello,
> 
> I used lukeall-0.9.1 to manually add a document to indexes generated by nutch-1.0. However, in search the manually added documents do not show up. 
> Thanks for any suggestions.

Not sure (not using nutch anymore), but in Solr you would need to
commit. Maybe in Nutch there is something like that as well.

HTH
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: nutch-1.0 with solr

Posted by al...@aim.com.

 


 the add request is like this


 curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --> 
data-binary '<add>
 <doc boost="2.5">
 <field name="segment">20090512170318</field>
 <field name="digest">86937aaee8e748ac3007ed8b66477624</field>
 <field name="boost">0.21189615</field>
 <field name="url">test.com</field>
 <field name="title">test test</field>
 <field name="tstamp"> 20090513003210909</field>
 </doc> </add>'


 

-----Original Message-----
From: alxsss@aim.com
To: nutch-user@lucene.apache.org
Sent: Wed, 13 May 2009 10:18 am
Subject: Re: nutch-1.0 with solr











 I went through that page. But when I try to add indexes manually  using

curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary 
'<commit waitFlush="false" waitSearcher="false"/>'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">453</int></lst>
</response>

I get

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">113</int></lst>
</response>


then I do 

curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary 
'<commit waitFlush="false" waitSearcher="false"/>'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">453</int></lst>
</response>


and added keywords are not in the search results.
So I am not sure what went wrong.

Thanks.
Alex.


 


 

-----Original Message-----
From: Raymond Balmès <ra...@gmail.com>
To
: nutch-user@lucene.apache.org
Sent: Wed, 13 May 2009 1:18 am
Subject: Re: nutch-1.0 with solr










Just a perfect page, worked first time right for me.

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

-Raymond-
2009/5/12 <al...@aim.com>

>
>  Hello,
>
> I just heard that nutch-1.0 has solr integration. Is there any tutorials on
> how to add data to nutch-1.0 indexes using solr manually?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>

Re: nutch-1.0 with solr

Posted by al...@aim.com.

 I went through that page. But when I try to add indexes manually  using

curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary '<commit waitFlush="false" waitSearcher="false"/>'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">453</int></lst>
</response>

I get

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">113</int></lst>
</response>


then I do 

curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary '<commit waitFlush="false" waitSearcher="false"/>'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">453</int></lst>
</response>


and added keywords are not in the search results.
So I am not sure what went wrong.

Thanks.
Alex.


 


 

-----Original Message-----
From: Raymond Balmès <ra...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Wed, 13 May 2009 1:18 am
Subject: Re: nutch-1.0 with solr










Just a perfect page, worked first time right for me.

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

-Raymond-
2009/5/12 <al...@aim.com>

>
>  Hello,
>
> I just heard that nutch-1.0 has solr integration. Is there any tutorials on
> how to add data to nutch-1.0 indexes using solr manually?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>

Re: nutch-1.0 with solr

Posted by Raymond Balmès <ra...@gmail.com>.

Just a perfect page, worked first time right for me.

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

-Raymond-
2009/5/12 <al...@aim.com>

>
>  Hello,
>
> I just heard that nutch-1.0 has solr integration. Is there any tutorials on
> how to add data to nutch-1.0 indexes using solr manually?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>

nutch-1.0 with solr

Posted by al...@aim.com.

 Hello,

I just heard that nutch-1.0 has solr integration. Is there any tutorials on how to add data to nutch-1.0 indexes using solr manually?

Thanks.
Alex.

Re: lukeall-0.9.1 to manually add indexes

Posted by Andrzej Bialecki <ab...@getopt.org>.

alxsss@aim.com wrote:
> Hello,
> 
> Thanks all for your suggestions. My situation is the following. I had
> Nutch -1.0 to crawl. fetch and index a lot of files. Then I needed to
> index a few files also. But I know keywords for those files and their
> locations. I thought it would be easier to add keywords to the index
> that I have instead of having nutch-1.0 to do crawling, fetching and
> indexing.? So, what is the step by step procedure of adding data to
> the index that I have manually?

There is no procedure except running a Fetcher and fetching these 
additional urls, and then creating an additional small index for this 
new segment. You can then merge this small index with the main index (or 
use it as it is - NutchBean handles multiple indexes).

This situation is related to the fact that Nutch was designed and 
optimized for large crawls and massive updates - as a consequence small 
updates are cumbersome and inefficient.

If you need to do frequent small updates, please consider using Solr.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: lukeall-0.9.1 to manually add indexes

Posted by al...@aim.com.

 Hello,

Thanks all for your suggestions. My situation is the following. I had Nutch -1.0 to crawl. fetch and index a lot of files. Then I needed to index a few files also. But I know keywords for those files and their locations. I thought it would be easier to add keywords to the index that I have instead of having nutch-1.0 to do crawling, fetching and indexing.? So, what is the step by step procedure of adding data to the index that I have manually?

Thanks in advance.
Alex.

-----Original Message-----
From: Andrzej Bialecki <ab...@getopt.org>
To: nutch-user@lucene.apache.org
Sent: Wed, 1 Apr 2009 3:19 am
Subject: Re: lukeall-0.9.1 to manually add indexes

Lyndon Maydwell wrote:?

> I've noticed that you need to optimize the index for nutch to pick up changes.?

> 
> Have you tried this??

> 
> On Wed, Apr 1, 2009 at 12:42 PM,  <al...@aim.com> wrote:?

>>  Thanks for you response. In?

>> luke there is also option to commit. I opened new index again, and?

>> there is the document I created. But the search does not return?

>> anything for the added keywords. Will try Solr if it works.?
?

Hm, I don't know what you are trying to do ... First, the information 
from alxsss is misleading - there is no commit() operation in Nutch. 
Also, the index doesn't have to be optimized. The most likely reason why 
the added document is not visible is that Nutch also needs a 
corresponding record in the segments/... data. This is not possible to 
create separately, you need to use Fetcher to create a new segment 
(which you can subsequently merge with the first segment), and then 
create a new index from this new segment.?
?

-- 
Best regards,?

Andrzej Bialecki     <><?

?___. ___ ___ ___ _ _   __________________________________?

[__ || __|__/|__||\/|  Information Retrieval, Semantic Web?

___|||__||  \|  ||  |  Embedded Unix, System Integration?

http://www.sigram.com  Contact: info at sigram dot com?
?

Re: lukeall-0.9.1 to manually add indexes

Posted by Andrzej Bialecki <ab...@getopt.org>.

Lyndon Maydwell wrote:
> I've noticed that you need to optimize the index for nutch to pick up changes.
> 
> Have you tried this?
> 
> On Wed, Apr 1, 2009 at 12:42 PM,  <al...@aim.com> wrote:
>>  Thanks for you response. In
>> luke there is also option to commit. I opened new index again, and
>> there is the document I created. But the search does not return
>> anything for the added keywords. Will try Solr if it works.

Hm, I don't know what you are trying to do ... First, the information 
from alxsss is misleading - there is no commit() operation in Nutch. 
Also, the index doesn't have to be optimized. The most likely reason why 
the added document is not visible is that Nutch also needs a 
corresponding record in the segments/... data. This is not possible to 
create separately, you need to use Fetcher to create a new segment 
(which you can subsequently merge with the first segment), and then 
create a new index from this new segment.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: lukeall-0.9.1 to manually add indexes

Posted by Lyndon Maydwell <ma...@gmail.com>.

I've noticed that you need to optimize the index for nutch to pick up changes.

Have you tried this?

On Wed, Apr 1, 2009 at 12:42 PM,  <al...@aim.com> wrote:
>
>  Thanks for you response. In
> luke there is also option to commit. I opened new index again, and
> there is the document I created. But the search does not return
> anything for the added keywords. Will try Solr if it works.
>
>
>
>
>
>
>
>
>

Re: lukeall-0.9.1 to manually add indexes

Posted by al...@aim.com.

 Thanks for you response. In
luke there is also option to commit. I opened new index again, and
there is the document I created. But the search does not return
anything for the added keywords. Will try Solr if it works.

lukeall-0.9.1 to manually add indexes

Posted by al...@aim.com.

Hello,

I used lukeall-0.9.1 to manually add a document to indexes generated by nutch-1.0. However, in search the manually added documents do not show up. 
Thanks for any suggestions.
A.