You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jann Forrer <ja...@id.uzh.ch> on 2011/09/19 11:02:17 UTC

nutch 1.3 solrindex empty content field

Hi

I tried to run nutch-1.3 together with solr  3.x according to 
http://wiki.apache.org/nutch/NutchTutorial.

That worked as described but if I try to search the index using the Solr 
admin
interface i always get an empty result.

http://localhost:8983/solr/admin/schema.jsp

Using the Schema Browser I see entries in different fields (e.g. the url 
field) but the content field is emtpy. I
was looking for similar problem on the mailing list but I didn't found a 
solution for this problem.

Here is what  I did:

1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
2.) Dumping the segment (./bin/nutch readseg -dump 
crawl/segments/20110916124747 test). The script
      did also dump the content of the web pages. All seems to be ok here.
3.) Copy the nutch schema.xml to the solr conf directory
4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb 
crawl/linkdb crawl/segments/*
5.) And then trying to search using http://localhost:8983/solr/admin/. 
but didn't found any HTML-content.
      However if there was a pdf-File to crawl, this pdf-Content is found.

BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could use 
these version but I am upgrading
from an older Nutch Version and it would be nice if I could use the 
newer version where nutch and solr
are better integrated.

Any Ideas what might be wrong?

Jann



-- 

Jann Forrer
Informatikdienste
Universität Zürich
Winterthurerstr. 190
CH-8057 Zürich

oooO   mail:jann.forrer@id.uzh.ch
(  )   phone: +41 44 63 56772
  \ (   fax:   +41 44 63 54505
   \_)http://www.id.uzh.ch

Re: nutch 1.3 solrindex empty content field

Posted by Markus Jelsma <ma...@openindex.io>.


On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:
> Yes, what Markus has pointed out is the problem I think Jann. This means
> you need to re-index you're data and change the stored and index value to
> true.
> 
> Markus', out of interest do you know the pro's/con's if we were to make
> this default in the Nutch schema? For example, with small indexes I
> wouldn't imagine there would be much difference, however non-trivial sized
> indexes I would imagine would be a different story...

The index size ~ *2.1
> 
> Any thoughts.
> 
> On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Check line 79 of your Solr schema:
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi
> > ew=markup
> > 
> > Maybe we should configure the field to be stored in 1.4. I can imagine
> > this causes a lot of headaches for new users. Also highlighting will
> > never work with unstored fields.
> > 
> > On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> > > Hi
> > > 
> > > I tried to run nutch-1.3 together with solr  3.x according to
> > > http://wiki.apache.org/nutch/NutchTutorial.
> > > 
> > > That worked as described but if I try to search the index using the
> > > Solr admin
> > > interface i always get an empty result.
> > > 
> > > http://localhost:8983/solr/admin/schema.jsp
> > > 
> > > Using the Schema Browser I see entries in different fields (e.g. the
> > > url field) but the content field is emtpy. I
> > > was looking for similar problem on the mailing list but I didn't found
> > > a solution for this problem.
> > > 
> > > Here is what  I did:
> > > 
> > > 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > > 2.) Dumping the segment (./bin/nutch readseg -dump
> > > crawl/segments/20110916124747 test). The script
> > > 
> > >       did also dump the content of the web pages. All seems to be ok
> > 
> > here.
> > 
> > > 3.) Copy the nutch schema.xml to the solr conf directory
> > > 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > 5.) And then trying to search using http://localhost:8983/solr/admin/.
> > > but didn't found any HTML-content.
> > > 
> > >       However if there was a pdf-File to crawl, this pdf-Content is
> > 
> > found.
> > 
> > > BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
> > > use these version but I am upgrading
> > > from an older Nutch Version and it would be nice if I could use the
> > > newer version where nutch and solr
> > > are better integrated.
> > > 
> > > Any Ideas what might be wrong?
> > > 
> > > Jann
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch 1.3 solrindex empty content field

Posted by Jann Forrer <ja...@id.uzh.ch>.

Hi

Thanks for your fast help.

On 09/19/2011 04:26 PM, lewis john mcgibbney wrote:
> Does this solve you're problem Jann?
>
No, unfortunately not.  I  changed the content entry within the nutch 
schema
    runtime/local/conf/schema.xml
and the solr schema
    example/solr/conf/schema.xml
to
<field name="content" type="text" stored="true" indexed="true"/>

After that I deleted the whole crawl-directory and the solr data-directory
and try to re-index using:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb 
crawl/linkdb crawl/segments/*

But still I got no results doing a simple search. Looking at the content 
field within the solr admin
page I got:

Field Type: text

Properties: Indexed, Tokenized, Stored

Schema: Indexed, Tokenized, Stored

Position Increment Gap: 100

Index Analyzer: org.apache.solr.analysis.TokenizerChain Details

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

    1. org.apache.solr.analysis.StopFilterFactory args:{words: 
stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_31 }
    2. org.apache.solr.analysis.WordDelimiterFilterFactory 
args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1 
luceneMatchVersion: LUCENE_31 generateWordParts: 1 catenateAll: 0 
catenateNumbers: 1 }
    3. org.apache.solr.analysis.LowerCaseFilterFactory 
args:{luceneMatchVersion: LUCENE_31 }
    4. org.apache.solr.analysis.EnglishPorterFilterFactory 
args:{protected: protwords.txt luceneMatchVersion: LUCENE_31 }
    5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
args:{luceneMatchVersion: LUCENE_31 }

Query Analyzer: org.apache.solr.analysis.TokenizerChain Details

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

    1. org.apache.solr.analysis.StopFilterFactory args:{words: 
stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_31 }
    2. org.apache.solr.analysis.WordDelimiterFilterFactory 
args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1 
luceneMatchVersion: LUCENE_31 generateWordParts: 1 catenateAll: 0 
catenateNumbers: 1 }
    3. org.apache.solr.analysis.LowerCaseFilterFactory 
args:{luceneMatchVersion: LUCENE_31 }
    4. org.apache.solr.analysis.EnglishPorterFilterFactory 
args:{protected: protwords.txt luceneMatchVersion: LUCENE_31 }
    5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
args:{luceneMatchVersion: LUCENE_31 }

Docs: 0


BTW I did crawl http://www.rauchfrei.uzh.ch/ and try to search 
"Passivrauchen", a word occuring on the index page.

Jann

> Is this worth filing an issue for as it is rather trivial to address but
> could help more users unfamiliar with specifics of Nutch (or Solr) Schema(s)
>
> On Mon, Sep 19, 2011 at 3:06 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> *previous sent by accident
>>
>> On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:
>>> Yes, what Markus has pointed out is the problem I think Jann. This means
>>> you need to re-index you're data and change the stored and index value to
>>> true.
>>>
>>> Markus', out of interest do you know the pro's/con's if we were to make
>>> this default in the Nutch schema? For example, with small indexes I
>>> wouldn't imagine there would be much difference, however non-trivial
>> sized
>>> indexes I would imagine would be a different story...
>> The index size ~*2.1 depending on analyzers etc (stopwords mostly).
>> However,
>> uses that set up very large indexes are expected to be at least
>> intermediate
>> Solr users and have proper understanding of the schema.
>>
>> They will toggle settings as they see fit whereas new users don't but
>> expect
>> output.
>>
>>> Any thoughts.
>>>
>>> On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
>>>
>>> <ma...@openindex.io>wrote:
>>>> Check line 79 of your Solr schema:
>>>>
>>>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi
>>>> ew=markup
>>>>
>>>> Maybe we should configure the field to be stored in 1.4. I can imagine
>>>> this causes a lot of headaches for new users. Also highlighting will
>>>> never work with unstored fields.
>>>>
>>>> On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
>>>>> Hi
>>>>>
>>>>> I tried to run nutch-1.3 together with solr  3.x according to
>>>>> http://wiki.apache.org/nutch/NutchTutorial.
>>>>>
>>>>> That worked as described but if I try to search the index using the
>>>>> Solr admin
>>>>> interface i always get an empty result.
>>>>>
>>>>> http://localhost:8983/solr/admin/schema.jsp
>>>>>
>>>>> Using the Schema Browser I see entries in different fields (e.g. the
>>>>> url field) but the content field is emtpy. I
>>>>> was looking for similar problem on the mailing list but I didn't
>> found
>>>>> a solution for this problem.
>>>>>
>>>>> Here is what  I did:
>>>>>
>>>>> 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>> 2.) Dumping the segment (./bin/nutch readseg -dump
>>>>> crawl/segments/20110916124747 test). The script
>>>>>
>>>>>        did also dump the content of the web pages. All seems to be ok
>>>> here.
>>>>
>>>>> 3.) Copy the nutch schema.xml to the solr conf directory
>>>>> 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>>>>> crawl/linkdb crawl/segments/*
>>>>> 5.) And then trying to search using
>> http://localhost:8983/solr/admin/.
>>>>> but didn't found any HTML-content.
>>>>>
>>>>>        However if there was a pdf-File to crawl, this pdf-Content is
>>>> found.
>>>>
>>>>> BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
>>>>> use these version but I am upgrading
>>>>> from an older Nutch Version and it would be nice if I could use the
>>>>> newer version where nutch and solr
>>>>> are better integrated.
>>>>>
>>>>> Any Ideas what might be wrong?
>>>>>
>>>>> Jann
>>>> --
>>>> Markus Jelsma - CTO - Openindex
>>>> http://www.linkedin.com/in/markus17
>>>> 050-8536620 / 06-50258350
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>


-- 
Jann Forrer
Informatikdienste
Universität Zürich
Winterthurerstr. 190
CH-8057 Zürich

oooO   mail:  jann.forrer@id.uzh.ch
(  )   phone: +41 44 63 56772
  \ (   fax:   +41 44 63 54505
   \_)  http://www.id.uzh.ch

Re: nutch 1.3 solrindex empty content field

Posted by lewis john mcgibbney <le...@gmail.com>.

Does this solve you're problem Jann?

Is this worth filing an issue for as it is rather trivial to address but
could help more users unfamiliar with specifics of Nutch (or Solr) Schema(s)

On Mon, Sep 19, 2011 at 3:06 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> *previous sent by accident
>
> On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:
> > Yes, what Markus has pointed out is the problem I think Jann. This means
> > you need to re-index you're data and change the stored and index value to
> > true.
> >
> > Markus', out of interest do you know the pro's/con's if we were to make
> > this default in the Nutch schema? For example, with small indexes I
> > wouldn't imagine there would be much difference, however non-trivial
> sized
> > indexes I would imagine would be a different story...
>
> The index size ~*2.1 depending on analyzers etc (stopwords mostly).
> However,
> uses that set up very large indexes are expected to be at least
> intermediate
> Solr users and have proper understanding of the schema.
>
> They will toggle settings as they see fit whereas new users don't but
> expect
> output.
>
> >
> > Any thoughts.
> >
> > On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > Check line 79 of your Solr schema:
> > >
> > >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi
> > > ew=markup
> > >
> > > Maybe we should configure the field to be stored in 1.4. I can imagine
> > > this causes a lot of headaches for new users. Also highlighting will
> > > never work with unstored fields.
> > >
> > > On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> > > > Hi
> > > >
> > > > I tried to run nutch-1.3 together with solr  3.x according to
> > > > http://wiki.apache.org/nutch/NutchTutorial.
> > > >
> > > > That worked as described but if I try to search the index using the
> > > > Solr admin
> > > > interface i always get an empty result.
> > > >
> > > > http://localhost:8983/solr/admin/schema.jsp
> > > >
> > > > Using the Schema Browser I see entries in different fields (e.g. the
> > > > url field) but the content field is emtpy. I
> > > > was looking for similar problem on the mailing list but I didn't
> found
> > > > a solution for this problem.
> > > >
> > > > Here is what  I did:
> > > >
> > > > 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > > > 2.) Dumping the segment (./bin/nutch readseg -dump
> > > > crawl/segments/20110916124747 test). The script
> > > >
> > > >       did also dump the content of the web pages. All seems to be ok
> > >
> > > here.
> > >
> > > > 3.) Copy the nutch schema.xml to the solr conf directory
> > > > 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > > crawl/linkdb crawl/segments/*
> > > > 5.) And then trying to search using
> http://localhost:8983/solr/admin/.
> > > > but didn't found any HTML-content.
> > > >
> > > >       However if there was a pdf-File to crawl, this pdf-Content is
> > >
> > > found.
> > >
> > > > BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
> > > > use these version but I am upgrading
> > > > from an older Nutch Version and it would be nice if I could use the
> > > > newer version where nutch and solr
> > > > are better integrated.
> > > >
> > > > Any Ideas what might be wrong?
> > > >
> > > > Jann
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*Lewis*

Re: nutch 1.3 solrindex empty content field

Posted by Markus Jelsma <ma...@openindex.io>.

*previous sent by accident

On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:
> Yes, what Markus has pointed out is the problem I think Jann. This means
> you need to re-index you're data and change the stored and index value to
> true.
> 
> Markus', out of interest do you know the pro's/con's if we were to make
> this default in the Nutch schema? For example, with small indexes I
> wouldn't imagine there would be much difference, however non-trivial sized
> indexes I would imagine would be a different story...

The index size ~*2.1 depending on analyzers etc (stopwords mostly). However, 
uses that set up very large indexes are expected to be at least intermediate 
Solr users and have proper understanding of the schema.

They will toggle settings as they see fit whereas new users don't but expect 
output.

> 
> Any thoughts.
> 
> On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Check line 79 of your Solr schema:
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi
> > ew=markup
> > 
> > Maybe we should configure the field to be stored in 1.4. I can imagine
> > this causes a lot of headaches for new users. Also highlighting will
> > never work with unstored fields.
> > 
> > On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> > > Hi
> > > 
> > > I tried to run nutch-1.3 together with solr  3.x according to
> > > http://wiki.apache.org/nutch/NutchTutorial.
> > > 
> > > That worked as described but if I try to search the index using the
> > > Solr admin
> > > interface i always get an empty result.
> > > 
> > > http://localhost:8983/solr/admin/schema.jsp
> > > 
> > > Using the Schema Browser I see entries in different fields (e.g. the
> > > url field) but the content field is emtpy. I
> > > was looking for similar problem on the mailing list but I didn't found
> > > a solution for this problem.
> > > 
> > > Here is what  I did:
> > > 
> > > 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > > 2.) Dumping the segment (./bin/nutch readseg -dump
> > > crawl/segments/20110916124747 test). The script
> > > 
> > >       did also dump the content of the web pages. All seems to be ok
> > 
> > here.
> > 
> > > 3.) Copy the nutch schema.xml to the solr conf directory
> > > 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > 5.) And then trying to search using http://localhost:8983/solr/admin/.
> > > but didn't found any HTML-content.
> > > 
> > >       However if there was a pdf-File to crawl, this pdf-Content is
> > 
> > found.
> > 
> > > BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
> > > use these version but I am upgrading
> > > from an older Nutch Version and it would be nice if I could use the
> > > newer version where nutch and solr
> > > are better integrated.
> > > 
> > > Any Ideas what might be wrong?
> > > 
> > > Jann
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch 1.3 solrindex empty content field

Posted by lewis john mcgibbney <le...@gmail.com>.

Yes, what Markus has pointed out is the problem I think Jann. This means you
need to re-index you're data and change the stored and index value to true.

Markus', out of interest do you know the pro's/con's if we were to make this
default in the Nutch schema? For example, with small indexes I wouldn't
imagine there would be much difference, however non-trivial sized indexes I
would imagine would be a different story...

Any thoughts.

On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Check line 79 of your Solr schema:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?view=markup
>
> Maybe we should configure the field to be stored in 1.4. I can imagine this
> causes a lot of headaches for new users. Also highlighting will never work
> with unstored fields.
>
> On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> > Hi
> >
> > I tried to run nutch-1.3 together with solr  3.x according to
> > http://wiki.apache.org/nutch/NutchTutorial.
> >
> > That worked as described but if I try to search the index using the Solr
> > admin
> > interface i always get an empty result.
> >
> > http://localhost:8983/solr/admin/schema.jsp
> >
> > Using the Schema Browser I see entries in different fields (e.g. the url
> > field) but the content field is emtpy. I
> > was looking for similar problem on the mailing list but I didn't found a
> > solution for this problem.
> >
> > Here is what  I did:
> >
> > 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > 2.) Dumping the segment (./bin/nutch readseg -dump
> > crawl/segments/20110916124747 test). The script
> >       did also dump the content of the web pages. All seems to be ok
> here.
> > 3.) Copy the nutch schema.xml to the solr conf directory
> > 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > 5.) And then trying to search using http://localhost:8983/solr/admin/.
> > but didn't found any HTML-content.
> >       However if there was a pdf-File to crawl, this pdf-Content is
> found.
> >
> > BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could use
> > these version but I am upgrading
> > from an older Nutch Version and it would be nice if I could use the
> > newer version where nutch and solr
> > are better integrated.
> >
> > Any Ideas what might be wrong?
> >
> > Jann
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*Lewis*

Re: nutch 1.3 solrindex empty content field

Posted by Markus Jelsma <ma...@openindex.io>.

Check line 79 of your Solr schema:
http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?view=markup

Maybe we should configure the field to be stored in 1.4. I can imagine this 
causes a lot of headaches for new users. Also highlighting will never work 
with unstored fields.

On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> Hi
> 
> I tried to run nutch-1.3 together with solr  3.x according to
> http://wiki.apache.org/nutch/NutchTutorial.
> 
> That worked as described but if I try to search the index using the Solr
> admin
> interface i always get an empty result.
> 
> http://localhost:8983/solr/admin/schema.jsp
> 
> Using the Schema Browser I see entries in different fields (e.g. the url
> field) but the content field is emtpy. I
> was looking for similar problem on the mailing list but I didn't found a
> solution for this problem.
> 
> Here is what  I did:
> 
> 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> 2.) Dumping the segment (./bin/nutch readseg -dump
> crawl/segments/20110916124747 test). The script
>       did also dump the content of the web pages. All seems to be ok here.
> 3.) Copy the nutch schema.xml to the solr conf directory
> 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
> 5.) And then trying to search using http://localhost:8983/solr/admin/.
> but didn't found any HTML-content.
>       However if there was a pdf-File to crawl, this pdf-Content is found.
> 
> BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could use
> these version but I am upgrading
> from an older Nutch Version and it would be nice if I could use the
> newer version where nutch and solr
> are better integrated.
> 
> Any Ideas what might be wrong?
> 
> Jann

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350