You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2011/08/01 21:26:39 UTC

RE: Nutch not indexing full collection

I'm still having trouble with this.  In addition to the nutch-site-xml posted below, I have now modified my schema.xml (in both nutch and solr) to include the following important line:
<field name="content" type="text" stored="true" indexed="true"/>

Now, when I search, the full text of each document shows up under <str name="content">.  I'm clearly getting everything.  And yet, when I search for text toward the end of a long document, I still don't get that document in my search results.  

It sounds like this might be an issue with my Solr setup.  Can anyone think of what I might be missing?

Chip

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: Thursday, July 28, 2011 3:29 PM
To: user@nutch.apache.org
Subject: RE: Nutch not indexing full collection

Thanks!  This has solved half of my problem.  I am now indexing material from every document I want.  However, I'm still not indexing words from toward the end of longer documents.  I'm not sure what else I could be missing.  

The current contents of my nutch-site.xml are:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --> <configuration>  <property>
  <name>http.agent.name</name>
  <value>OHI Spider</value>
 </property>
 <property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>http.content.limit</name>
  <value>-1</value>
 </property>
</configuration>

And I'm still indexing with this command:
bin/nutch crawl urls -dir crawl -depth 15 -topN 500000


-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
Sent: Wednesday, July 27, 2011 12:18 PM
To: user@nutch.apache.org
Subject: Re: Nutch not indexing full collection

has this been solved?

If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr.

On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <cc...@aip.org> wrote:

> I'm still having trouble.  I've set a windows environment variable, 
> NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I now 
> have my urls and crawl directories in that 
> C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not crawling 
> files later on my urls list, and apparently I can't search for words 
> or phrases toward the end of any of my documents.  Am I misremembering 
> that there was a total file size value somewhere in Nutch or Solr that needs to be increased?
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Wednesday, July 20, 2011 5:23 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch not indexing full collection
>
> Hi Chip,
>
> I would try running your scripts after setting the environment 
> variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
>
> On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <cc...@aip.org> wrote:
>
> > I've been working with
> > $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
> > and I'm pretty sure that's the correct file.  I run my commands 
> > while in $NUTCH_HOME/ , which means all of my commands begin with 
> > "runtime/local/bin/nutch..." .  That means my urls directory is 
> > $NUTCH_HOME/urls/ and my crawl directory ends up being 
> > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ 
> > and so forth), but it does seem to at least be getting my urlfilters 
> > from $NUTCH_HOME/runtime/local/conf/ .
> >
> > I get no output when I try runtime/local/bin/nutch readdb -stats , 
> > so that's weird.
> >
> > I dimly recall there being a total index size value somewhere in 
> > Nutch or Solr which has to be increased, but I can no longer find 
> > any reference to it.
> >
> > Chip
> >
> > -----Original Message-----
> > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > Sent: Wednesday, July 20, 2011 10:06 AM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not indexing full collection
> >
> > I'd have suspected db.max.outlinks.per.page but you seem to have set 
> > it up correctly. Are you running Nutch in runtime/local? in which 
> > case you modified nutch-site.xml in runtime/local/conf, right?
> >
> > nutch readdb -stats will give you the total number of pages known etc....
> >
> > Julien
> >
> > On 20 July 2011 14:51, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi,
> > >
> > > I'm using Nutch 1.3 to crawl a section of our website, and it 
> > > doesn't seem to crawl the entire thing.  I'm probably missing 
> > > something simple, so I hope somebody can help me.
> > >
> > > My urls/nutch file contains a single URL:
> > > http://www.aip.org/history/ohilist/transcripts.html , which is an 
> > > alphabetical listing of other pages.  It looks like the indexer 
> > > stops partway down this page, meaning that entries later in the 
> > > alphabet aren't indexed.
> > >
> > > My nutch-site.xml has the following content:
> > > <?xml version="1.0"?>
> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > <!-- Put site-specific property overrides in this file. --> 
> > > <configuration> <property>  <name>http.agent.name</name> 
> > > <value>OHI Spider</value> </property> <property> 
> > > <name>db.max.outlinks.per.page</name>
> > >  <value>-1</value>
> > >  <description>The maximum number of outlinks that we'll process 
> > > for a
> > page.
> > >  If this value is nonnegative (>=0), at most 
> > > db.max.outlinks.per.page outlinks  will be processed for a page; 
> > > otherwise, all outlinks will be processed.
> > >  </description>
> > > </property>
> > > </configuration>
> > >
> > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
> > > following, which should allow access to everything I want:
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > > # skip everything else
> > > -.
> > >
> > > I've crawled with the following command:
> > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN
> > > 500000
> > >
> > > Note that since we don't have NutchBean anymore, I can't tell 
> > > whether this is actually a Nutch problem or whether something is 
> > > failing when I port to Solr.  What am I missing?
> > >
> > > Thanks,
> > > Chip
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



--
*Lewis*

Re: Nutch not indexing full collection

Posted by Markus Jelsma <ma...@openindex.io>.

Be careful, only use limit -1 for sites you trust or it will bite you.

> That did it!  For the convenience of anyone who finds this in the list
> archives later on, I'll paste what it took:
> 
> $NUTCH_HOME/runtime/local/conf/nutch-site.xml (full contents):
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>  <property>
>   <name>http.agent.name</name>
>   <value>OHI Spider</value>
>  </property>
>  <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a
> page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks will be processed for a page; otherwise, all outlinks will be
> processed. </description>
>  </property>
>  <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
>   truncated; otherwise, no truncation at all.
>   </description>
>  </property>
> </configuration>
> 
> $NUTCH_HOME/runtime/local/conf/schema.xml &
> $SOLR_HOME/example/solr/conf/schema.xml: Replace this: 	<field
> name="content" type="text" stored="false" indexed="true"/> With this:
> 	<field name="content" type="text" stored="true" indexed="true"/>
> 
> $SOLR_HOME/example/solr/conf/solrconfig.xml:
> Replace this: 	<maxFieldLength>10000</maxFieldLength>
> With this: 	<maxFieldLength>2147483647</maxFieldLength>
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Monday, August 01, 2011 3:45 PM
> To: user@nutch.apache.org
> Cc: Chip Calhoun
> Subject: Re: Nutch not indexing full collection
> 
> Nutch truncates content longer than configured and Solr truncates content
> exceeding max field length. Maybe check your limits.
> 
> > I'm still having trouble with this.  In addition to the nutch-site-xml
> > posted below, I have now modified my schema.xml (in both nutch and
> > solr) to include the following important line: <field name="content"
> > type="text" stored="true" indexed="true"/>
> > 
> > Now, when I search, the full text of each document shows up under <str
> > name="content">.  I'm clearly getting everything.  And yet, when I
> > search for text toward the end of a long document, I still don't get
> > that document in my search results.
> > 
> > It sounds like this might be an issue with my Solr setup.  Can anyone
> > think of what I might be missing?
> > 
> > Chip
> > 
> > -----Original Message-----
> > From: Chip Calhoun [mailto:ccalhoun@aip.org]
> > Sent: Thursday, July 28, 2011 3:29 PM
> > To: user@nutch.apache.org
> > Subject: RE: Nutch not indexing full collection
> > 
> > Thanks!  This has solved half of my problem.  I am now indexing
> > material from every document I want.  However, I'm still not indexing
> > words from toward the end of longer documents.  I'm not sure what else
> > I could be missing.
> > 
> > The current contents of my nutch-site.xml are:
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > <!-- Put site-specific property overrides in this file. -->
> > <configuration>  <property> <name>http.agent.name</name>
> > 
> >   <value>OHI Spider</value>
> >  
> >  </property>
> >  <property>
> >  
> >   <name>db.max.outlinks.per.page</name>
> >   <value>-1</value>
> >   <description>The maximum number of outlinks that we'll process for a
> > 
> > page. If this value is nonnegative (>=0), at most
> > db.max.outlinks.per.page outlinks will be processed for a page;
> > otherwise, all outlinks will be processed. </description>  </property>
> > <property>
> > 
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >  
> >  </property>
> > 
> > </configuration>
> > 
> > And I'm still indexing with this command:
> > bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> > 
> > 
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Wednesday, July 27, 2011 12:18 PM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not indexing full collection
> > 
> > has this been solved?
> > 
> > If your http.content.limit has not been increased in nutch-site.xml
> > then you will not be able to store this data and index with Solr.
> > 
> > On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <cc...@aip.org> wrote:
> > > I'm still having trouble.  I've set a windows environment variable,
> > > NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I
> > > now have my urls and crawl directories in that
> > > C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not
> > > crawling files later on my urls list, and apparently I can't search
> > > for words or phrases toward the end of any of my documents.  Am I
> > > misremembering that there was a total file size value somewhere in
> > > Nutch or Solr that needs to be increased?
> > > 
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Wednesday, July 20, 2011 5:23 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch not indexing full collection
> > > 
> > > Hi Chip,
> > > 
> > > I would try running your scripts after setting the environment
> > > variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
> > > 
> > > On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <cc...@aip.org> wrote:
> > > > I've been working with
> > > > $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
> > > > and I'm pretty sure that's the correct file.  I run my commands
> > > > while in $NUTCH_HOME/ , which means all of my commands begin with
> > > > "runtime/local/bin/nutch..." .  That means my urls directory is
> > > > $NUTCH_HOME/urls/ and my crawl directory ends up being
> > > > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/
> > > > and so forth), but it does seem to at least be getting my
> > > > urlfilters from $NUTCH_HOME/runtime/local/conf/ .
> > > > 
> > > > I get no output when I try runtime/local/bin/nutch readdb -stats ,
> > > > so that's weird.
> > > > 
> > > > I dimly recall there being a total index size value somewhere in
> > > > Nutch or Solr which has to be increased, but I can no longer find
> > > > any reference to it.
> > > > 
> > > > Chip
> > > > 
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > Sent: Wednesday, July 20, 2011 10:06 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Nutch not indexing full collection
> > > > 
> > > > I'd have suspected db.max.outlinks.per.page but you seem to have
> > > > set it up correctly. Are you running Nutch in runtime/local? in
> > > > which case you modified nutch-site.xml in runtime/local/conf, right?
> > > > 
> > > > nutch readdb -stats will give you the total number of pages known
> > > > etc....
> > > > 
> > > > Julien
> > > > 
> > > > On 20 July 2011 14:51, Chip Calhoun <cc...@aip.org> wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm using Nutch 1.3 to crawl a section of our website, and it
> > > > > doesn't seem to crawl the entire thing.  I'm probably missing
> > > > > something simple, so I hope somebody can help me.
> > > > > 
> > > > > My urls/nutch file contains a single URL:
> > > > > http://www.aip.org/history/ohilist/transcripts.html , which is
> > > > > an alphabetical listing of other pages.  It looks like the
> > > > > indexer stops partway down this page, meaning that entries later
> > > > > in the alphabet aren't indexed.
> > > > > 
> > > > > My nutch-site.xml has the following content:
> > > > > <?xml version="1.0"?>
> > > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > > > <!-- Put site-specific property overrides in this file. -->
> > > > > <configuration> <property>  <name>http.agent.name</name>
> > > > > <value>OHI Spider</value> </property> <property>
> > > > > <name>db.max.outlinks.per.page</name>
> > > > > 
> > > > >  <value>-1</value>
> > > > >  <description>The maximum number of outlinks that we'll process
> > > > > 
> > > > > for a
> > > > 
> > > > page.
> > > > 
> > > > >  If this value is nonnegative (>=0), at most
> > > > > 
> > > > > db.max.outlinks.per.page outlinks  will be processed for a page;
> > > > > otherwise, all outlinks will be processed.
> > > > > 
> > > > >  </description>
> > > > > 
> > > > > </property>
> > > > > </configuration>
> > > > > 
> > > > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the
> > > > > following, which should allow access to everything I want:
> > > > > # accept hosts in MY.DOMAIN.NAME
> > > > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > > > > # skip everything else
> > > > > -.
> > > > > 
> > > > > I've crawled with the following command:
> > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN
> > > > > 500000
> > > > > 
> > > > > Note that since we don't have NutchBean anymore, I can't tell
> > > > > whether this is actually a Nutch problem or whether something is
> > > > > failing when I port to Solr.  What am I missing?
> > > > > 
> > > > > Thanks,
> > > > > Chip
> > > > 
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > > 
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > 
> > > --
> > > *Lewis*
> > 
> > --
> > *Lewis*

RE: Nutch not indexing full collection

Posted by Chip Calhoun <cc...@aip.org>.

That did it!  For the convenience of anyone who finds this in the list archives later on, I'll paste what it took:

$NUTCH_HOME/runtime/local/conf/nutch-site.xml (full contents):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>OHI Spider</value>
 </property>
 <property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  </description>
 </property>
</configuration>

$NUTCH_HOME/runtime/local/conf/schema.xml & $SOLR_HOME/example/solr/conf/schema.xml:
Replace this: 	<field name="content" type="text" stored="false" indexed="true"/>
With this: 	<field name="content" type="text" stored="true" indexed="true"/>

$SOLR_HOME/example/solr/conf/solrconfig.xml:
Replace this: 	<maxFieldLength>10000</maxFieldLength>
With this: 	<maxFieldLength>2147483647</maxFieldLength>

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Monday, August 01, 2011 3:45 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Nutch not indexing full collection

Nutch truncates content longer than configured and Solr truncates content exceeding max field length. Maybe check your limits.

> I'm still having trouble with this.  In addition to the nutch-site-xml 
> posted below, I have now modified my schema.xml (in both nutch and 
> solr) to include the following important line: <field name="content" type="text"
> stored="true" indexed="true"/>
> 
> Now, when I search, the full text of each document shows up under <str 
> name="content">.  I'm clearly getting everything.  And yet, when I 
> search for text toward the end of a long document, I still don't get 
> that document in my search results.
> 
> It sounds like this might be an issue with my Solr setup.  Can anyone 
> think of what I might be missing?
> 
> Chip
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org]
> Sent: Thursday, July 28, 2011 3:29 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch not indexing full collection
> 
> Thanks!  This has solved half of my problem.  I am now indexing 
> material from every document I want.  However, I'm still not indexing 
> words from toward the end of longer documents.  I'm not sure what else 
> I could be missing.
> 
> The current contents of my nutch-site.xml are:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. --> 
> <configuration>  <property> <name>http.agent.name</name>
>   <value>OHI Spider</value>
>  </property>
>  <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a 
> page. If this value is nonnegative (>=0), at most 
> db.max.outlinks.per.page outlinks will be processed for a page; 
> otherwise, all outlinks will be processed. </description>  </property>  
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>  </property>
> </configuration>
> 
> And I'm still indexing with this command:
> bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Wednesday, July 27, 2011 12:18 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch not indexing full collection
> 
> has this been solved?
> 
> If your http.content.limit has not been increased in nutch-site.xml 
> then you will not be able to store this data and index with Solr.
> 
> On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <cc...@aip.org> wrote:
> > I'm still having trouble.  I've set a windows environment variable, 
> > NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I 
> > now have my urls and crawl directories in that 
> > C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not 
> > crawling files later on my urls list, and apparently I can't search 
> > for words or phrases toward the end of any of my documents.  Am I 
> > misremembering that there was a total file size value somewhere in 
> > Nutch or Solr that needs to be increased?
> > 
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Wednesday, July 20, 2011 5:23 PM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not indexing full collection
> > 
> > Hi Chip,
> > 
> > I would try running your scripts after setting the environment 
> > variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
> > 
> > On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <cc...@aip.org> wrote:
> > > I've been working with
> > > $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
> > > and I'm pretty sure that's the correct file.  I run my commands 
> > > while in $NUTCH_HOME/ , which means all of my commands begin with 
> > > "runtime/local/bin/nutch..." .  That means my urls directory is 
> > > $NUTCH_HOME/urls/ and my crawl directory ends up being 
> > > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ 
> > > and so forth), but it does seem to at least be getting my 
> > > urlfilters from $NUTCH_HOME/runtime/local/conf/ .
> > > 
> > > I get no output when I try runtime/local/bin/nutch readdb -stats , 
> > > so that's weird.
> > > 
> > > I dimly recall there being a total index size value somewhere in 
> > > Nutch or Solr which has to be increased, but I can no longer find 
> > > any reference to it.
> > > 
> > > Chip
> > > 
> > > -----Original Message-----
> > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > Sent: Wednesday, July 20, 2011 10:06 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch not indexing full collection
> > > 
> > > I'd have suspected db.max.outlinks.per.page but you seem to have 
> > > set it up correctly. Are you running Nutch in runtime/local? in 
> > > which case you modified nutch-site.xml in runtime/local/conf, right?
> > > 
> > > nutch readdb -stats will give you the total number of pages known 
> > > etc....
> > > 
> > > Julien
> > > 
> > > On 20 July 2011 14:51, Chip Calhoun <cc...@aip.org> wrote:
> > > > Hi,
> > > > 
> > > > I'm using Nutch 1.3 to crawl a section of our website, and it 
> > > > doesn't seem to crawl the entire thing.  I'm probably missing 
> > > > something simple, so I hope somebody can help me.
> > > > 
> > > > My urls/nutch file contains a single URL:
> > > > http://www.aip.org/history/ohilist/transcripts.html , which is 
> > > > an alphabetical listing of other pages.  It looks like the 
> > > > indexer stops partway down this page, meaning that entries later 
> > > > in the alphabet aren't indexed.
> > > > 
> > > > My nutch-site.xml has the following content:
> > > > <?xml version="1.0"?>
> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > > <!-- Put site-specific property overrides in this file. --> 
> > > > <configuration> <property>  <name>http.agent.name</name> 
> > > > <value>OHI Spider</value> </property> <property> 
> > > > <name>db.max.outlinks.per.page</name>
> > > > 
> > > >  <value>-1</value>
> > > >  <description>The maximum number of outlinks that we'll process
> > > > 
> > > > for a
> > > 
> > > page.
> > > 
> > > >  If this value is nonnegative (>=0), at most
> > > > 
> > > > db.max.outlinks.per.page outlinks  will be processed for a page; 
> > > > otherwise, all outlinks will be processed.
> > > > 
> > > >  </description>
> > > > 
> > > > </property>
> > > > </configuration>
> > > > 
> > > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
> > > > following, which should allow access to everything I want:
> > > > # accept hosts in MY.DOMAIN.NAME
> > > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > > > # skip everything else
> > > > -.
> > > > 
> > > > I've crawled with the following command:
> > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN
> > > > 500000
> > > > 
> > > > Note that since we don't have NutchBean anymore, I can't tell 
> > > > whether this is actually a Nutch problem or whether something is 
> > > > failing when I port to Solr.  What am I missing?
> > > > 
> > > > Thanks,
> > > > Chip
> > > 
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > > 
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > 
> > --
> > *Lewis*
> 
> --
> *Lewis*

Re: Nutch not indexing full collection

Posted by Markus Jelsma <ma...@openindex.io>.

Nutch truncates content longer than configured and Solr truncates content 
exceeding max field length. Maybe check your limits.

> I'm still having trouble with this.  In addition to the nutch-site-xml
> posted below, I have now modified my schema.xml (in both nutch and solr)
> to include the following important line: <field name="content" type="text"
> stored="true" indexed="true"/>
> 
> Now, when I search, the full text of each document shows up under <str
> name="content">.  I'm clearly getting everything.  And yet, when I search
> for text toward the end of a long document, I still don't get that
> document in my search results.
> 
> It sounds like this might be an issue with my Solr setup.  Can anyone think
> of what I might be missing?
> 
> Chip
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org]
> Sent: Thursday, July 28, 2011 3:29 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch not indexing full collection
> 
> Thanks!  This has solved half of my problem.  I am now indexing material
> from every document I want.  However, I'm still not indexing words from
> toward the end of longer documents.  I'm not sure what else I could be
> missing.
> 
> The current contents of my nutch-site.xml are:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. --> <configuration>
>  <property> <name>http.agent.name</name>
>   <value>OHI Spider</value>
>  </property>
>  <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a
> page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks will be processed for a page; otherwise, all outlinks will be
> processed. </description>
>  </property>
>  <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>  </property>
> </configuration>
> 
> And I'm still indexing with this command:
> bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Wednesday, July 27, 2011 12:18 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch not indexing full collection
> 
> has this been solved?
> 
> If your http.content.limit has not been increased in nutch-site.xml then
> you will not be able to store this data and index with Solr.
> 
> On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <cc...@aip.org> wrote:
> > I'm still having trouble.  I've set a windows environment variable,
> > NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I now
> > have my urls and crawl directories in that
> > C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not crawling
> > files later on my urls list, and apparently I can't search for words
> > or phrases toward the end of any of my documents.  Am I misremembering
> > that there was a total file size value somewhere in Nutch or Solr that
> > needs to be increased?
> > 
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Wednesday, July 20, 2011 5:23 PM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not indexing full collection
> > 
> > Hi Chip,
> > 
> > I would try running your scripts after setting the environment
> > variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
> > 
> > On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <cc...@aip.org> wrote:
> > > I've been working with
> > > $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
> > > and I'm pretty sure that's the correct file.  I run my commands
> > > while in $NUTCH_HOME/ , which means all of my commands begin with
> > > "runtime/local/bin/nutch..." .  That means my urls directory is
> > > $NUTCH_HOME/urls/ and my crawl directory ends up being
> > > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/
> > > and so forth), but it does seem to at least be getting my urlfilters
> > > from $NUTCH_HOME/runtime/local/conf/ .
> > > 
> > > I get no output when I try runtime/local/bin/nutch readdb -stats ,
> > > so that's weird.
> > > 
> > > I dimly recall there being a total index size value somewhere in
> > > Nutch or Solr which has to be increased, but I can no longer find
> > > any reference to it.
> > > 
> > > Chip
> > > 
> > > -----Original Message-----
> > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > Sent: Wednesday, July 20, 2011 10:06 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch not indexing full collection
> > > 
> > > I'd have suspected db.max.outlinks.per.page but you seem to have set
> > > it up correctly. Are you running Nutch in runtime/local? in which
> > > case you modified nutch-site.xml in runtime/local/conf, right?
> > > 
> > > nutch readdb -stats will give you the total number of pages known
> > > etc....
> > > 
> > > Julien
> > > 
> > > On 20 July 2011 14:51, Chip Calhoun <cc...@aip.org> wrote:
> > > > Hi,
> > > > 
> > > > I'm using Nutch 1.3 to crawl a section of our website, and it
> > > > doesn't seem to crawl the entire thing.  I'm probably missing
> > > > something simple, so I hope somebody can help me.
> > > > 
> > > > My urls/nutch file contains a single URL:
> > > > http://www.aip.org/history/ohilist/transcripts.html , which is an
> > > > alphabetical listing of other pages.  It looks like the indexer
> > > > stops partway down this page, meaning that entries later in the
> > > > alphabet aren't indexed.
> > > > 
> > > > My nutch-site.xml has the following content:
> > > > <?xml version="1.0"?>
> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > > <!-- Put site-specific property overrides in this file. -->
> > > > <configuration> <property>  <name>http.agent.name</name>
> > > > <value>OHI Spider</value> </property> <property>
> > > > <name>db.max.outlinks.per.page</name>
> > > > 
> > > >  <value>-1</value>
> > > >  <description>The maximum number of outlinks that we'll process
> > > > 
> > > > for a
> > > 
> > > page.
> > > 
> > > >  If this value is nonnegative (>=0), at most
> > > > 
> > > > db.max.outlinks.per.page outlinks  will be processed for a page;
> > > > otherwise, all outlinks will be processed.
> > > > 
> > > >  </description>
> > > > 
> > > > </property>
> > > > </configuration>
> > > > 
> > > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the
> > > > following, which should allow access to everything I want:
> > > > # accept hosts in MY.DOMAIN.NAME
> > > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > > > # skip everything else
> > > > -.
> > > > 
> > > > I've crawled with the following command:
> > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN
> > > > 500000
> > > > 
> > > > Note that since we don't have NutchBean anymore, I can't tell
> > > > whether this is actually a Nutch problem or whether something is
> > > > failing when I port to Solr.  What am I missing?
> > > > 
> > > > Thanks,
> > > > Chip
> > > 
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > > 
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > 
> > --
> > *Lewis*
> 
> --
> *Lewis*