You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Safdar Kureishy <sa...@gmail.com> on 2012/06/23 16:44:34 UTC

SolrIndex eats up lots of disk space for intermediate data

Hi,

I couldn't find an answer to this question online, so I'm posting to the
mailing list.

I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M
pages, since it includes the fetched + failed + unfetched pages). I've also
got a freshly updated linkdb and webgraphdb (having run linkrank). I'm
trying to index the fetched pages (content + anchor links) using solrindex.

When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb>
-dir <segmentsdir>" command, the disk space utilization really jumps.
Before running the solrindex stage, I had about 50% of disk space remaining
for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about
100G left over. However, when running the solrindex phase, by the end of
the map phase, the disk space utilization nears 100% and the available HDFS
space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump
in storage is for non-DFS data (i.e. intermediate data) and it happens
during the map phase of the IndexerMapReduce job (solrindex).

What can I do to reduce the intermediate data being generated for
solrindex? Any configuration settings I should change? I'm using all the
defaults, for the indexing phase, and I'm not using any custom plugins
either.

Thanks,
Safdar

Re: SolrIndex eats up lots of disk space for intermediate data

Posted by Harsh J <ha...@cloudera.com>.
Hey Safdar,

This question is best asked on the Apache Solr mailing lists. I
believe you'll get better responses there, so I've redirected to
Solr's own list (solr-user[at]lucene.apache.org).

BCC'd common-user[at]hadoop.apache.org and CC'd you in case you
haven't subscribed to Solr.

On Sat, Jun 23, 2012 at 8:14 PM, Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I couldn't find an answer to this question online, so I'm posting to the
> mailing list.
>
> I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M
> pages, since it includes the fetched + failed + unfetched pages). I've also
> got a freshly updated linkdb and webgraphdb (having run linkrank). I'm
> trying to index the fetched pages (content + anchor links) using solrindex.
>
> When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb>
> -dir <segmentsdir>" command, the disk space utilization really jumps.
> Before running the solrindex stage, I had about 50% of disk space remaining
> for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about
> 100G left over. However, when running the solrindex phase, by the end of
> the map phase, the disk space utilization nears 100% and the available HDFS
> space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump
> in storage is for non-DFS data (i.e. intermediate data) and it happens
> during the map phase of the IndexerMapReduce job (solrindex).
>
> What can I do to reduce the intermediate data being generated for
> solrindex? Any configuration settings I should change? I'm using all the
> defaults, for the indexing phase, and I'm not using any custom plugins
> either.
>
> Thanks,
> Safdar



-- 
Harsh J

Re: SolrIndex eats up lots of disk space for intermediate data

Posted by Harsh J <ha...@cloudera.com>.
Hey Safdar,

This question is best asked on the Apache Solr mailing lists. I
believe you'll get better responses there, so I've redirected to
Solr's own list (solr-user[at]lucene.apache.org).

BCC'd common-user[at]hadoop.apache.org and CC'd you in case you
haven't subscribed to Solr.

On Sat, Jun 23, 2012 at 8:14 PM, Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I couldn't find an answer to this question online, so I'm posting to the
> mailing list.
>
> I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M
> pages, since it includes the fetched + failed + unfetched pages). I've also
> got a freshly updated linkdb and webgraphdb (having run linkrank). I'm
> trying to index the fetched pages (content + anchor links) using solrindex.
>
> When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb>
> -dir <segmentsdir>" command, the disk space utilization really jumps.
> Before running the solrindex stage, I had about 50% of disk space remaining
> for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about
> 100G left over. However, when running the solrindex phase, by the end of
> the map phase, the disk space utilization nears 100% and the available HDFS
> space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump
> in storage is for non-DFS data (i.e. intermediate data) and it happens
> during the map phase of the IndexerMapReduce job (solrindex).
>
> What can I do to reduce the intermediate data being generated for
> solrindex? Any configuration settings I should change? I'm using all the
> defaults, for the indexing phase, and I'm not using any custom plugins
> either.
>
> Thanks,
> Safdar



-- 
Harsh J