You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tim Patton <tp...@dealcatcher.com> on 2006/03/02 22:14:29 UTC

Question about Index Writing/Merging

I'm working on a project that uses pieces of Nutch to store a Lucene index
in Hadoop (basically I am using the FsDirectory and related classes).  When
trying to write to an index I got an unsupported exception since FsDirectory
doesn't support "seek" which Lucene uses on closing an IndexWriter, the file
system is write-once.  After looking through the Nutch code I saw that an
index is worked on locally, either with writing or merging, then transferred
into the dfs when finished.  I just was checking to make sure I understood
this correctly.  If I was to work on a multi-gigabyte index I would need
that much free space on my local drive to transfer the index to and it would
take a while to copy each way.  How does this work for the really huge
indexes people want to build with Nutch?  Would there be many smaller Lucene
indexes in the dfs, since obviously one huge terabyte index couldn't be
downloaded?  I'm just trying to have a better understanding of how Nutch
works.

 

Thanks,

Tim

 


Re[2]: Nutch administration web interface?

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Stefan.

pitty....


> just 0.8.

> Am 11.04.2006 um 23:08 schrieb carmmello:

>> Will this interface also cope with Nutch 0.7 or just the new 0.8?
>>
>>
>> ----- Original Message ----- From: "Stefan Groschupf" <sg@media- 
>> style.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Tuesday, April 11, 2006 5:53 PM
>> Subject: Re: Nutch administration web interface?
>>
>>
>>> ... a beta will be available soon.
>>> Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:
>>>> Hi Robert,
>>>> You can see this page
>>>> http://wiki.apache.org/nutch/NutchAdministrationUserInterface.  
>>>> But  I don't
>>>> have any idea about the advancement of this project.
>>>> Best regards.
>>>> On 4/10/06, Robert Douglass <ro...@robshouse.net> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> has anyone done any work on a web interface for administering  
>>>>> Nutch?
>>>>>
>>>>> How would one go about doing this? In Java, I imagine you'd use
>>>>> the Java
>>>>> classes directly (the command line tool is just a wrapper for  
>>>>> the  Java,
>>>>> after all), but in other languages (I'm thinking PHP), would it
>>>>> be  most
>>>>> sensible to call the command line tools?
>>>>>
>>>>> cheers,
>>>>>
>>>>> Robert Douglass
>>>>>
>>> ---------------------------------------------
>>> blog: http://www.find23.org
>>> company: http://www.media-style.com
>>> -- 
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.385 / Virus Database: 268.4.1/307 - Release Date:  
>>> 10/4/2006
>>>
>>

> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com





> __________ NOD32 1.1482 (20060411) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge


Seacrh for keywords by url

Posted by Richard Braman <rb...@bramantax.com>.
if I wanted to submit a url to nutch as see what keywords it scored on,
how would I do that?

Richard


Re: Nutch administration web interface?

Posted by Stefan Groschupf <sg...@media-style.com>.
just 0.8.

Am 11.04.2006 um 23:08 schrieb carmmello:

> Will this interface also cope with Nutch 0.7 or just the new 0.8?
>
>
> ----- Original Message ----- From: "Stefan Groschupf" <sg@media- 
> style.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, April 11, 2006 5:53 PM
> Subject: Re: Nutch administration web interface?
>
>
>> ... a beta will be available soon.
>> Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:
>>> Hi Robert,
>>> You can see this page
>>> http://wiki.apache.org/nutch/NutchAdministrationUserInterface.  
>>> But  I don't
>>> have any idea about the advancement of this project.
>>> Best regards.
>>> On 4/10/06, Robert Douglass <ro...@robshouse.net> wrote:
>>>>
>>>> Hi,
>>>>
>>>> has anyone done any work on a web interface for administering  
>>>> Nutch?
>>>>
>>>> How would one go about doing this? In Java, I imagine you'd use   
>>>> the Java
>>>> classes directly (the command line tool is just a wrapper for  
>>>> the  Java,
>>>> after all), but in other languages (I'm thinking PHP), would it  
>>>> be  most
>>>> sensible to call the command line tools?
>>>>
>>>> cheers,
>>>>
>>>> Robert Douglass
>>>>
>> ---------------------------------------------
>> blog: http://www.find23.org
>> company: http://www.media-style.com
>> -- 
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.385 / Virus Database: 268.4.1/307 - Release Date:  
>> 10/4/2006
>>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



Re: Nutch administration web interface?

Posted by carmmello <ca...@globo.com>.
Will this interface also cope with Nutch 0.7 or just the new 0.8?


----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, April 11, 2006 5:53 PM
Subject: Re: Nutch administration web interface?


> ... a beta will be available soon.
> 
> Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:
> 
>> Hi Robert,
>> You can see this page
>> http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But  
>> I don't
>> have any idea about the advancement of this project.
>> Best regards.
>> On 4/10/06, Robert Douglass <ro...@robshouse.net> wrote:
>>>
>>> Hi,
>>>
>>> has anyone done any work on a web interface for administering Nutch?
>>>
>>> How would one go about doing this? In Java, I imagine you'd use  
>>> the Java
>>> classes directly (the command line tool is just a wrapper for the  
>>> Java,
>>> after all), but in other languages (I'm thinking PHP), would it be  
>>> most
>>> sensible to call the command line tools?
>>>
>>> cheers,
>>>
>>> Robert Douglass
>>>
> 
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
> 
> 
> 
> 
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.385 / Virus Database: 268.4.1/307 - Release Date: 10/4/2006
> 
>

Re: Nutch administration web interface?

Posted by Stefan Groschupf <sg...@media-style.com>.
... a beta will be available soon.

Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:

> Hi Robert,
> You can see this page
> http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But  
> I don't
> have any idea about the advancement of this project.
> Best regards.
> On 4/10/06, Robert Douglass <ro...@robshouse.net> wrote:
>>
>> Hi,
>>
>> has anyone done any work on a web interface for administering Nutch?
>>
>> How would one go about doing this? In Java, I imagine you'd use  
>> the Java
>> classes directly (the command line tool is just a wrapper for the  
>> Java,
>> after all), but in other languages (I'm thinking PHP), would it be  
>> most
>> sensible to call the command line tools?
>>
>> cheers,
>>
>> Robert Douglass
>>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



Re: Nutch administration web interface?

Posted by Rida Benjelloun <ri...@doculibre.com>.
Hi Robert,
You can see this page
http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't
have any idea about the advancement of this project.
Best regards.
On 4/10/06, Robert Douglass <ro...@robshouse.net> wrote:
>
> Hi,
>
> has anyone done any work on a web interface for administering Nutch?
>
> How would one go about doing this? In Java, I imagine you'd use the Java
> classes directly (the command line tool is just a wrapper for the Java,
> after all), but in other languages (I'm thinking PHP), would it be most
> sensible to call the command line tools?
>
> cheers,
>
> Robert Douglass
>

Nutch administration web interface?

Posted by Robert Douglass <ro...@robshouse.net>.
Hi,

has anyone done any work on a web interface for administering Nutch?

How would one go about doing this? In Java, I imagine you'd use the Java 
classes directly (the command line tool is just a wrapper for the Java, 
after all), but in other languages (I'm thinking PHP), would it be most 
sensible to call the command line tools?

cheers,

Robert Douglass

Re: Question about Index Writing/Merging

Posted by Doug Cutting <cu...@apache.org>.
Tim Patton wrote:
> Thanks, that's exactly what I was thinking.  Do you have any recommendations
> on maximum index size (obviously we'd be testing ourselves, but its good to
> get an idea)?

Searches tend to get too slow somewhere betwen 10M and 100M pages.

Using a sorted index (IndexSorter & searcher.max.hits) can improve the 
situation dramatically but may not be appropriate for all applications. 
  For a discussion of this feature, see:

http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg06423.html

and

http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01950.html

Doug

RE: Question about Index Writing/Merging

Posted by Tim Patton <tp...@dealcatcher.com>.
Thanks, that's exactly what I was thinking.  Do you have any recommendations
on maximum index size (obviously we'd be testing ourselves, but its good to
get an idea)?

Tim

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Thursday, March 02, 2006 7:34 PM
To: nutch-user@lucene.apache.org
Subject: Re: Question about Index Writing/Merging


Tim Patton wrote:
> I'm working on a project that uses pieces of Nutch to store a Lucene index
> in Hadoop (basically I am using the FsDirectory and related classes).
When
> trying to write to an index I got an unsupported exception since
FsDirectory
> doesn't support "seek" which Lucene uses on closing an IndexWriter, the
file
> system is write-once.  After looking through the Nutch code I saw that an
> index is worked on locally, either with writing or merging, then
transferred
> into the dfs when finished.  I just was checking to make sure I understood
> this correctly.

Yes, this is correct.

> If I was to work on a multi-gigabyte index I would need
> that much free space on my local drive to transfer the index to and it
would
> take a while to copy each way.  How does this work for the really huge
> indexes people want to build with Nutch?  Would there be many smaller
Lucene
> indexes in the dfs, since obviously one huge terabyte index couldn't be
> downloaded?  I'm just trying to have a better understanding of how Nutch
> works.

Terabyte indexes aren't actually very useful, since they take too long 
to search.  So with big collections (>100M pages) one will keep multiple 
indexes and use distributed search to search them all in parallel.

Doug


Re: Question about Index Writing/Merging

Posted by Doug Cutting <cu...@apache.org>.
Tim Patton wrote:
> I'm working on a project that uses pieces of Nutch to store a Lucene index
> in Hadoop (basically I am using the FsDirectory and related classes).  When
> trying to write to an index I got an unsupported exception since FsDirectory
> doesn't support "seek" which Lucene uses on closing an IndexWriter, the file
> system is write-once.  After looking through the Nutch code I saw that an
> index is worked on locally, either with writing or merging, then transferred
> into the dfs when finished.  I just was checking to make sure I understood
> this correctly.

Yes, this is correct.

> If I was to work on a multi-gigabyte index I would need
> that much free space on my local drive to transfer the index to and it would
> take a while to copy each way.  How does this work for the really huge
> indexes people want to build with Nutch?  Would there be many smaller Lucene
> indexes in the dfs, since obviously one huge terabyte index couldn't be
> downloaded?  I'm just trying to have a better understanding of how Nutch
> works.

Terabyte indexes aren't actually very useful, since they take too long 
to search.  So with big collections (>100M pages) one will keep multiple 
indexes and use distributed search to search them all in parallel.

Doug