You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Trey Spiva <tr...@spiva.com> on 2007/12/04 18:20:55 UTC

Hadoop distributed search.

According to a hadoop tutorial  (http://wiki.apache.org/nutch/ 
NutchHadoopTutorial) on wiki,

"you don't want to search using DFS, you want to search using local  
filesystems.    Once the index has been created on the DFS you can  
use the hadoop copyToLocal command to move it to the local file  
system as such" ... "Understand that at this point we are not using  
the DFS or MapReduce to do the searching, all of it is on a local  
machine".

So my understanding is that hadoop is only good for batch index  
building, and is not proper for incremental index building and  
search. Is this true?

The reason I am asking is that when I read the article ACM article by  
Mike Cafarella and Doug Cutting, to me it  sounded like the concern  
was to make the index structures fit in the primary memory, not the  
entire crawled database.  Did I miss understand the ACM article?

Re: Hadoop distributed search.

Posted by Dennis Kubes <ku...@apache.org>.


Trey Spiva wrote:
> Thanks for your help.
> 
> On Dec 4, 2007, at 10:08 AM, Jasper Kamperman wrote:
> 
>>>>> According to a hadoop tutorial  
>>>>> (http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
>>>>> "you don't want to search using DFS, you want to search using local 
>>>>> filesystems.    Once the index has been created on the DFS you can 
>>>>> use the hadoop copyToLocal command to move it to the local file 
>>>>> system as such" ... "Understand that at this point we are not using 
>>>>> the DFS or MapReduce to do the searching, all of it is on a local 
>>>>> machine".
>>>>> So my understanding is that hadoop is only good for batch index 
>>>>> building, and is not proper for incremental index building and 
>>>>> search. Is this true?
>>>>
>>>> That is correct.  DFS for batch processing and MapReduce jobs.  
>>>> Local servers (disks) for serving indexes.  Even better put local 
>>>> indexes (not segments, just indexes) in RAM.
>>>
>>> In the NutchHadoopTutorial it says "the directory which it points to 
>>> should contain not just the index directory but also the linkdb, 
>>> segments, etc. All of these different databases are used by the 
>>> search. This is why we copied over the crawled directory and not just 
>>> the index directory."

Yes, that is correct.  Only indexes in memory, other databases on local 
disk.

We have found that a 4G machine can handle roughly 2M pages in the index 
  with no swapping occurring.  Also load on the machine drop to 
practically nothing even for 20+ queries per second because there is 
virtually zero IO.

Dennis Kubes

>>>
>>> If I understand your comment correctly, you are saying to not copy 
>>> the linkdb, and segments data just the index directory.  Is that 
>>> correct?  I think this is the source of my confusion, because it 
>>> sounds like the entire crawl data needs to be copied to each search 
>>> machine.
>>
>> In the trick below, /path/to is the directory that has all the crawl 
>> data, /path/to/linkdb, /path/to/segments, etc. The trick moves the 
>> /path/to/indexes to another directory, then mounts a RAM filesystem on 
>> /path/to/indexes. So to your nutch everything just looks like a big 
>> crawl dir, but whenever it is accessing an index it is actually 
>> getting it from RAM.
>>
>>>
>>>>
>>>>> The reason I am asking is that when I read the article ACM article 
>>>>> by Mike Cafarella and Doug Cutting, to me it  sounded like the 
>>>>> concern was to make the index structures fit in the primary memory, 
>>>>> not the entire crawled database.  Did I miss understand the ACM 
>>>>> article?
>>>>
>>>> No, what they are saying is the more pages per index per machine on 
>>>> hard disk the slower the search.  Keeping the main indexes, but not 
>>>> the segments which hold raw page content, in RAM can speed up search 
>>>> significantly.
>>>>
>>>> One way to do this if you are running on linux is to create a tempfs 
>>>> (which is ram) and then mount the filesystem in the ram.  Then your 
>>>> index acts normally to the application but is essentially served 
>>>> from Ram.  This is how we server the Nutch lucene indexes on our web 
>>>> search engine (www.visvo.com) which is ~100M pages.  Below is how 
>>>> you can achieve this, assuming your indexes are in /path/to/indexes:
>>>>
>>>>
>>>> mv /path/to/indexes /path/to/indexes.dist
>>>> mkdir /path/to/indexes
>>>> cd /path/to
>>>> mount -t tmpfs -o size=2684354560 none /path/to/indexes
>>>> rsync --progress -aptv indexes.dist/* indexes/
>>>> chown -R user:group indexes
>>>>
>>>> This would of course be limited by the amount of RAM you have on the 
>>>> machine.  But with this approach most searches are sub-second.
>>>
>>> Thanks for the information.
>>>
>>>>
>>>> Dennis Kubes
>>>
>>>
>>
>

Re: Hadoop distributed search.

Posted by Trey Spiva <tr...@spiva.com>.

Thanks for your help.

On Dec 4, 2007, at 10:08 AM, Jasper Kamperman wrote:

>>>> According to a hadoop tutorial  (http://wiki.apache.org/nutch/ 
>>>> NutchHadoopTutorial) on wiki,
>>>> "you don't want to search using DFS, you want to search using  
>>>> local filesystems.    Once the index has been created on the DFS  
>>>> you can use the hadoop copyToLocal command to move it to the  
>>>> local file system as such" ... "Understand that at this point we  
>>>> are not using the DFS or MapReduce to do the searching, all of  
>>>> it is on a local machine".
>>>> So my understanding is that hadoop is only good for batch index  
>>>> building, and is not proper for incremental index building and  
>>>> search. Is this true?
>>>
>>> That is correct.  DFS for batch processing and MapReduce jobs.   
>>> Local servers (disks) for serving indexes.  Even better put local  
>>> indexes (not segments, just indexes) in RAM.
>>
>> In the NutchHadoopTutorial it says "the directory which it points  
>> to should contain not just the index directory but also the  
>> linkdb, segments, etc. All of these different databases are used  
>> by the search. This is why we copied over the crawled directory  
>> and not just the index directory."
>>
>> If I understand your comment correctly, you are saying to not copy  
>> the linkdb, and segments data just the index directory.  Is that  
>> correct?  I think this is the source of my confusion, because it  
>> sounds like the entire crawl data needs to be copied to each  
>> search machine.
>
> In the trick below, /path/to is the directory that has all the  
> crawl data, /path/to/linkdb, /path/to/segments, etc. The trick  
> moves the /path/to/indexes to another directory, then mounts a RAM  
> filesystem on /path/to/indexes. So to your nutch everything just  
> looks like a big crawl dir, but whenever it is accessing an index  
> it is actually getting it from RAM.
>
>>
>>>
>>>> The reason I am asking is that when I read the article ACM  
>>>> article by Mike Cafarella and Doug Cutting, to me it  sounded  
>>>> like the concern was to make the index structures fit in the  
>>>> primary memory, not the entire crawled database.  Did I miss  
>>>> understand the ACM article?
>>>
>>> No, what they are saying is the more pages per index per machine  
>>> on hard disk the slower the search.  Keeping the main indexes,  
>>> but not the segments which hold raw page content, in RAM can  
>>> speed up search significantly.
>>>
>>> One way to do this if you are running on linux is to create a  
>>> tempfs (which is ram) and then mount the filesystem in the ram.   
>>> Then your index acts normally to the application but is  
>>> essentially served from Ram.  This is how we server the Nutch  
>>> lucene indexes on our web search engine (www.visvo.com) which is  
>>> ~100M pages.  Below is how you can achieve this, assuming your  
>>> indexes are in /path/to/indexes:
>>>
>>>
>>> mv /path/to/indexes /path/to/indexes.dist
>>> mkdir /path/to/indexes
>>> cd /path/to
>>> mount -t tmpfs -o size=2684354560 none /path/to/indexes
>>> rsync --progress -aptv indexes.dist/* indexes/
>>> chown -R user:group indexes
>>>
>>> This would of course be limited by the amount of RAM you have on  
>>> the machine.  But with this approach most searches are sub-second.
>>
>> Thanks for the information.
>>
>>>
>>> Dennis Kubes
>>
>>
>

Re: Hadoop distributed search.

Posted by Jasper Kamperman <ja...@openwaternet.com>.

>>> According to a hadoop tutorial  (http://wiki.apache.org/nutch/ 
>>> NutchHadoopTutorial) on wiki,
>>> "you don't want to search using DFS, you want to search using  
>>> local filesystems.    Once the index has been created on the DFS  
>>> you can use the hadoop copyToLocal command to move it to the  
>>> local file system as such" ... "Understand that at this point we  
>>> are not using the DFS or MapReduce to do the searching, all of it  
>>> is on a local machine".
>>> So my understanding is that hadoop is only good for batch index  
>>> building, and is not proper for incremental index building and  
>>> search. Is this true?
>>
>> That is correct.  DFS for batch processing and MapReduce jobs.   
>> Local servers (disks) for serving indexes.  Even better put local  
>> indexes (not segments, just indexes) in RAM.
>
> In the NutchHadoopTutorial it says "the directory which it points  
> to should contain not just the index directory but also the linkdb,  
> segments, etc. All of these different databases are used by the  
> search. This is why we copied over the crawled directory and not  
> just the index directory."
>
> If I understand your comment correctly, you are saying to not copy  
> the linkdb, and segments data just the index directory.  Is that  
> correct?  I think this is the source of my confusion, because it  
> sounds like the entire crawl data needs to be copied to each search  
> machine.

In the trick below, /path/to is the directory that has all the crawl  
data, /path/to/linkdb, /path/to/segments, etc. The trick moves the / 
path/to/indexes to another directory, then mounts a RAM filesystem  
on /path/to/indexes. So to your nutch everything just looks like a  
big crawl dir, but whenever it is accessing an index it is actually  
getting it from RAM.

>
>>
>>> The reason I am asking is that when I read the article ACM  
>>> article by Mike Cafarella and Doug Cutting, to me it  sounded  
>>> like the concern was to make the index structures fit in the  
>>> primary memory, not the entire crawled database.  Did I miss  
>>> understand the ACM article?
>>
>> No, what they are saying is the more pages per index per machine  
>> on hard disk the slower the search.  Keeping the main indexes, but  
>> not the segments which hold raw page content, in RAM can speed up  
>> search significantly.
>>
>> One way to do this if you are running on linux is to create a  
>> tempfs (which is ram) and then mount the filesystem in the ram.   
>> Then your index acts normally to the application but is  
>> essentially served from Ram.  This is how we server the Nutch  
>> lucene indexes on our web search engine (www.visvo.com) which is  
>> ~100M pages.  Below is how you can achieve this, assuming your  
>> indexes are in /path/to/indexes:
>>
>>
>> mv /path/to/indexes /path/to/indexes.dist
>> mkdir /path/to/indexes
>> cd /path/to
>> mount -t tmpfs -o size=2684354560 none /path/to/indexes
>> rsync --progress -aptv indexes.dist/* indexes/
>> chown -R user:group indexes
>>
>> This would of course be limited by the amount of RAM you have on  
>> the machine.  But with this approach most searches are sub-second.
>
> Thanks for the information.
>
>>
>> Dennis Kubes
>
>

Re: Hadoop distributed search.

Posted by Trey Spiva <tr...@spiva.com>.

On Dec 4, 2007, at 9:37 AM, Dennis Kubes wrote:

>
>
> Trey Spiva wrote:
>> According to a hadoop tutorial  (http://wiki.apache.org/nutch/ 
>> NutchHadoopTutorial) on wiki,
>> "you don't want to search using DFS, you want to search using  
>> local filesystems.    Once the index has been created on the DFS  
>> you can use the hadoop copyToLocal command to move it to the local  
>> file system as such" ... "Understand that at this point we are not  
>> using the DFS or MapReduce to do the searching, all of it is on a  
>> local machine".
>> So my understanding is that hadoop is only good for batch index  
>> building, and is not proper for incremental index building and  
>> search. Is this true?
>
> That is correct.  DFS for batch processing and MapReduce jobs.   
> Local servers (disks) for serving indexes.  Even better put local  
> indexes (not segments, just indexes) in RAM.

In the NutchHadoopTutorial it says "the directory which it points to  
should contain not just the index directory but also the linkdb,  
segments, etc. All of these different databases are used by the  
search. This is why we copied over the crawled directory and not just  
the index directory."

If I understand your comment correctly, you are saying to not copy  
the linkdb, and segments data just the index directory.  Is that  
correct?  I think this is the source of my confusion, because it  
sounds like the entire crawl data needs to be copied to each search  
machine.

>
>> The reason I am asking is that when I read the article ACM article  
>> by Mike Cafarella and Doug Cutting, to me it  sounded like the  
>> concern was to make the index structures fit in the primary  
>> memory, not the entire crawled database.  Did I miss understand  
>> the ACM article?
>
> No, what they are saying is the more pages per index per machine on  
> hard disk the slower the search.  Keeping the main indexes, but not  
> the segments which hold raw page content, in RAM can speed up  
> search significantly.
>
> One way to do this if you are running on linux is to create a  
> tempfs (which is ram) and then mount the filesystem in the ram.   
> Then your index acts normally to the application but is essentially  
> served from Ram.  This is how we server the Nutch lucene indexes on  
> our web search engine (www.visvo.com) which is ~100M pages.  Below  
> is how you can achieve this, assuming your indexes are in /path/to/ 
> indexes:
>
>
> mv /path/to/indexes /path/to/indexes.dist
> mkdir /path/to/indexes
> cd /path/to
> mount -t tmpfs -o size=2684354560 none /path/to/indexes
> rsync --progress -aptv indexes.dist/* indexes/
> chown -R user:group indexes
>
> This would of course be limited by the amount of RAM you have on  
> the machine.  But with this approach most searches are sub-second.

Thanks for the information.

>
> Dennis Kubes

Re: Hadoop distributed search.

Posted by Dennis Kubes <ku...@apache.org>.

I haven't done those benchmarks, but if I get some time I will.

The way we run the processes at Visvo, and I will open source the 
framework soon, is we have python scripts which run a continuous job 
stream of indexing and moving shards of x million pages to search 
servers.  Each search server has the linkdb, segments, and indexes for 
only that shard.  We use master crawldb and linkdb but have processes 
which will create shard linkdbs for only the content in the shard 
segments.

Each shard is pushed out to its own search server.  The indexes are 
about 2G in size and the segments about 20G.  So the tempfs was a quick 
hack that turned out to work very well.  We could push the shard out and 
then run a simple script to mount the indexes in memory.  We could also 
unmount them back to disk if needed.  All of this is transparent to the 
SearchServer which thinks it is looking at a local file system.

I do think it would be useful for the SearchServer to have an option for 
a RAMDirectory.  I don't know if it currently does.  This goes into 
discussions of creating a master/slave type framework for monitoring and 
maintaining shards though.

Dennis Kubes

Enis Soztutar wrote:
> Dennis,
> 
> Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs. 
> Intuitively I believe RAMDirectory should be faster, isn't it ? Do you 
> have any benchmark for the two?
> 
> Dennis Kubes wrote:
>>
>>
>> Trey Spiva wrote:
>>> According to a hadoop tutorial  
>>> (http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
>>>
>>> "you don't want to search using DFS, you want to search using local 
>>> filesystems.    Once the index has been created on the DFS you can 
>>> use the hadoop copyToLocal command to move it to the local file 
>>> system as such" ... "Understand that at this point we are not using 
>>> the DFS or MapReduce to do the searching, all of it is on a local 
>>> machine".
>>>
>>> So my understanding is that hadoop is only good for batch index 
>>> building, and is not proper for incremental index building and 
>>> search. Is this true?
>>
>> That is correct.  DFS for batch processing and MapReduce jobs.  Local 
>> servers (disks) for serving indexes.  Even better put local indexes 
>> (not segments, just indexes) in RAM.
>>
>>>
>>> The reason I am asking is that when I read the article ACM article by 
>>> Mike Cafarella and Doug Cutting, to me it  sounded like the concern 
>>> was to make the index structures fit in the primary memory, not the 
>>> entire crawled database.  Did I miss understand the ACM article?
>>
>> No, what they are saying is the more pages per index per machine on 
>> hard disk the slower the search.  Keeping the main indexes, but not 
>> the segments which hold raw page content, in RAM can speed up search 
>> significantly.
>>
>> One way to do this if you are running on linux is to create a tempfs 
>> (which is ram) and then mount the filesystem in the ram.  Then your 
>> index acts normally to the application but is essentially served from 
>> Ram.  This is how we server the Nutch lucene indexes on our web search 
>> engine (www.visvo.com) which is ~100M pages.  Below is how you can 
>> achieve this, assuming your indexes are in /path/to/indexes:
>>
>>
>> mv /path/to/indexes /path/to/indexes.dist
>> mkdir /path/to/indexes
>> cd /path/to
>> mount -t tmpfs -o size=2684354560 none /path/to/indexes
>> rsync --progress -aptv indexes.dist/* indexes/
>> chown -R user:group indexes
>>
>> This would of course be limited by the amount of RAM you have on the 
>> machine.  But with this approach most searches are sub-second.
>>
>> Dennis Kubes
>>

Re: Hadoop distributed search.

Posted by Enis Soztutar <en...@gmail.com>.

Dennis,

Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs. 
Intuitively I believe RAMDirectory should be faster, isn't it ? Do you 
have any benchmark for the two?

Dennis Kubes wrote:
>
>
> Trey Spiva wrote:
>> According to a hadoop tutorial  
>> (http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
>>
>> "you don't want to search using DFS, you want to search using local 
>> filesystems.    Once the index has been created on the DFS you can 
>> use the hadoop copyToLocal command to move it to the local file 
>> system as such" ... "Understand that at this point we are not using 
>> the DFS or MapReduce to do the searching, all of it is on a local 
>> machine".
>>
>> So my understanding is that hadoop is only good for batch index 
>> building, and is not proper for incremental index building and 
>> search. Is this true?
>
> That is correct.  DFS for batch processing and MapReduce jobs.  Local 
> servers (disks) for serving indexes.  Even better put local indexes 
> (not segments, just indexes) in RAM.
>
>>
>> The reason I am asking is that when I read the article ACM article by 
>> Mike Cafarella and Doug Cutting, to me it  sounded like the concern 
>> was to make the index structures fit in the primary memory, not the 
>> entire crawled database.  Did I miss understand the ACM article?
>
> No, what they are saying is the more pages per index per machine on 
> hard disk the slower the search.  Keeping the main indexes, but not 
> the segments which hold raw page content, in RAM can speed up search 
> significantly.
>
> One way to do this if you are running on linux is to create a tempfs 
> (which is ram) and then mount the filesystem in the ram.  Then your 
> index acts normally to the application but is essentially served from 
> Ram.  This is how we server the Nutch lucene indexes on our web search 
> engine (www.visvo.com) which is ~100M pages.  Below is how you can 
> achieve this, assuming your indexes are in /path/to/indexes:
>
>
> mv /path/to/indexes /path/to/indexes.dist
> mkdir /path/to/indexes
> cd /path/to
> mount -t tmpfs -o size=2684354560 none /path/to/indexes
> rsync --progress -aptv indexes.dist/* indexes/
> chown -R user:group indexes
>
> This would of course be limited by the amount of RAM you have on the 
> machine.  But with this approach most searches are sub-second.
>
> Dennis Kubes
>

Re: Hadoop distributed search.

Posted by Dennis Kubes <ku...@apache.org>.

Trey Spiva wrote:
> According to a hadoop tutorial  
> (http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
> 
> "you don't want to search using DFS, you want to search using local 
> filesystems.    Once the index has been created on the DFS you can use 
> the hadoop copyToLocal command to move it to the local file system as 
> such" ... "Understand that at this point we are not using the DFS or 
> MapReduce to do the searching, all of it is on a local machine".
> 
> So my understanding is that hadoop is only good for batch index 
> building, and is not proper for incremental index building and search. 
> Is this true?

That is correct.  DFS for batch processing and MapReduce jobs.  Local 
servers (disks) for serving indexes.  Even better put local indexes (not 
segments, just indexes) in RAM.

> 
> The reason I am asking is that when I read the article ACM article by 
> Mike Cafarella and Doug Cutting, to me it  sounded like the concern was 
> to make the index structures fit in the primary memory, not the entire 
> crawled database.  Did I miss understand the ACM article?

No, what they are saying is the more pages per index per machine on hard 
disk the slower the search.  Keeping the main indexes, but not the 
segments which hold raw page content, in RAM can speed up search 
significantly.

One way to do this if you are running on linux is to create a tempfs 
(which is ram) and then mount the filesystem in the ram.  Then your 
index acts normally to the application but is essentially served from 
Ram.  This is how we server the Nutch lucene indexes on our web search 
engine (www.visvo.com) which is ~100M pages.  Below is how you can 
achieve this, assuming your indexes are in /path/to/indexes:

mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes

This would of course be limited by the amount of RAM you have on the 
machine.  But with this approach most searches are sub-second.

Dennis Kubes

Re: Hadoop distributed search.

Posted by hzhong <he...@gmail.com>.

Hello,

Why do we not want to search using DFS?  Why is it not proper for
incremental indexing?

Thanks


Trey Spiva-3 wrote:
> 
> According to a hadoop tutorial  (http://wiki.apache.org/nutch/ 
> NutchHadoopTutorial) on wiki,
> 
> "you don't want to search using DFS, you want to search using local  
> filesystems.    Once the index has been created on the DFS you can  
> use the hadoop copyToLocal command to move it to the local file  
> system as such" ... "Understand that at this point we are not using  
> the DFS or MapReduce to do the searching, all of it is on a local  
> machine".
> 
> So my understanding is that hadoop is only good for batch index  
> building, and is not proper for incremental index building and  
> search. Is this true?
> 
> The reason I am asking is that when I read the article ACM article by  
> Mike Cafarella and Doug Cutting, to me it  sounded like the concern  
> was to make the index structures fit in the primary memory, not the  
> entire crawled database.  Did I miss understand the ACM article?
> 

-- 
View this message in context: http://www.nabble.com/Hadoop-distributed-search.-tp14155234p14265977.html
Sent from the Nutch - User mailing list archive at Nabble.com.