You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Harsh J <ha...@cloudera.com> on 2012/06/23 18:57:00 UTC

Re: SolrIndex eats up lots of disk space for intermediate data

Hey Safdar,

This question is best asked on the Apache Solr mailing lists. I
believe you'll get better responses there, so I've redirected to
Solr's own list (solr-user[at]lucene.apache.org).

BCC'd common-user[at]hadoop.apache.org and CC'd you in case you
haven't subscribed to Solr.

On Sat, Jun 23, 2012 at 8:14 PM, Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I couldn't find an answer to this question online, so I'm posting to the
> mailing list.
>
> I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M
> pages, since it includes the fetched + failed + unfetched pages). I've also
> got a freshly updated linkdb and webgraphdb (having run linkrank). I'm
> trying to index the fetched pages (content + anchor links) using solrindex.
>
> When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb>
> -dir <segmentsdir>" command, the disk space utilization really jumps.
> Before running the solrindex stage, I had about 50% of disk space remaining
> for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about
> 100G left over. However, when running the solrindex phase, by the end of
> the map phase, the disk space utilization nears 100% and the available HDFS
> space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump
> in storage is for non-DFS data (i.e. intermediate data) and it happens
> during the map phase of the IndexerMapReduce job (solrindex).
>
> What can I do to reduce the intermediate data being generated for
> solrindex? Any configuration settings I should change? I'm using all the
> defaults, for the indexing phase, and I'm not using any custom plugins
> either.
>
> Thanks,
> Safdar



-- 
Harsh J