You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nestor <ro...@gmail.com> on 2016/10/04 00:07:44 UTC

why the results have diff number of fields

In my solr query result for "url:*" number of returned fields vary compare 
to my second query(see bottom)
<result name="response" numFound="4861" start="0">
<doc>
<str name="body">...</str>
<str name="changed">2010-10-13T18:58:28</str>
<str name="created">2010-10-13T18:58:28</str>
<str name="entity">file</str>
<str name="hash">hvvzxf</str>
<str name="id">hvvzxf/file/53-623</str>
<arr name="im_vid_9">...</arr>
<str name="language">und</str>
<str name="name"/>
<str name="nid">623</str>
<str name="path">sites/default/files/HomePage.pdf</str>
<str name="promote">F</str>
<str name="site">http://www.mysite.com/</str>
<str name="sm_facetbuilder_solr_type">solr_type:facet_3</str>
<arr name="sm_vid_Project_Type">...</arr>
<arr name="spell">...</arr>
<str name="ss_file_node_title">Training Test 2</str>
<str name="ss_file_node_url">http://www.mysite.com/training-test-2</str>
<str name="ss_filemime">application/pdf</str>
<str name="status">T</str>
<str name="sticky">F</str>
<str name="teaser">...</str>
<arr name="tid">...</arr>
<str name="timestamp">2012-11-28T05:05:52.623</str>
<str name="title">HomePage.pdf</str>
<str name="ts_vid_9_names">Construction Professional Services</str>
<str name="uid">0</str>
<str name="url">...</str>
<arr name="vid">...</arr>
</doc>

When I do a solr query as "content:water" I get less fields in the results:
<result name="response" numFound="177" start="0">
<doc>
<float name="boost">0.027676692</float>
<str name="digest">4872e938706f9bee4d928330e5713623</str>
<str name="id">http://www.mysite.com/es/biographies</str>
<str name="segment">20161003150513</str>
<str name="title">Biographies</str>
<date name="tstamp">2016-10-03T15:21:45.346Z</date>
<str name="url">http://www.mysite.com/es/biographies</str>
</doc>

Why is that?


Thanks,



--
View this message in context: http://lucene.472066.n3.nabble.com/why-the-results-have-diff-number-of-fields-tp4299378.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: why the results have diff number of fields

Posted by Nestor <ro...@gmail.com>.
OK,
I found out that part of my problem was that there was a robots.txt that
would not allow me 
to  crawl my site.
The lessons and gotchas of learning nutch 

Thanks for your help



--
View this message in context: http://lucene.472066.n3.nabble.com/why-the-results-have-diff-number-of-fields-tp4299378p4299592.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: why the results have diff number of fields

Posted by Markus Jelsma <ma...@openindex.io>.
There is usually an URL filter in your way. Use bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined to check whether they are filtered.
Markus

 
 
-----Original message-----
> From:Néstor <ro...@gmail.com>
> Sent: Tuesday 4th October 2016 18:57
> To: user@nutch.apache.org
> Subject: Re: why the results have diff number of fields
> 
> Maybe because I am trying to just crawl a subfolder mysite.com/subfolder and
> I am having problems configuring it to do this and is going and crawling
> other pages from the parent directory.
> 
> Thanks!
> 
> 
> 
> On Tue, Oct 4, 2016 at 4:00 AM, Markus Jelsma <ma...@openindex.io>
> wrote:
> 
> > Well, probably because you or something indexes different stuff to the
> > Solr index. The first doesn't come from Nutch, the second does.
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Nestor <ro...@gmail.com>
> > > Sent: Tuesday 4th October 2016 2:07
> > > To: user@nutch.apache.org
> > > Subject: why the results have diff number of fields
> > >
> > > In my solr query result for "url:*" number of returned fields vary
> > compare
> > > to my second query(see bottom)
> > > <result name="response" numFound="4861" start="0">
> > > <doc>
> > > <str name="body">...</str>
> > > <str name="changed">2010-10-13T18:58:28</str>
> > > <str name="created">2010-10-13T18:58:28</str>
> > > <str name="entity">file</str>
> > > <str name="hash">hvvzxf</str>
> > > <str name="id">hvvzxf/file/53-623</str>
> > > <arr name="im_vid_9">...</arr>
> > > <str name="language">und</str>
> > > <str name="name"/>
> > > <str name="nid">623</str>
> > > <str name="path">sites/default/files/HomePage.pdf</str>
> > > <str name="promote">F</str>
> > > <str name="site">http://www.mysite.com/</str>
> > > <str name="sm_facetbuilder_solr_type">solr_type:facet_3</str>
> > > <arr name="sm_vid_Project_Type">...</arr>
> > > <arr name="spell">...</arr>
> > > <str name="ss_file_node_title">Training Test 2</str>
> > > <str name="ss_file_node_url">http://www.mysite.com/training-test-2</str>
> > > <str name="ss_filemime">application/pdf</str>
> > > <str name="status">T</str>
> > > <str name="sticky">F</str>
> > > <str name="teaser">...</str>
> > > <arr name="tid">...</arr>
> > > <str name="timestamp">2012-11-28T05:05:52.623</str>
> > > <str name="title">HomePage.pdf</str>
> > > <str name="ts_vid_9_names">Construction Professional Services</str>
> > > <str name="uid">0</str>
> > > <str name="url">...</str>
> > > <arr name="vid">...</arr>
> > > </doc>
> > >
> > > When I do a solr query as "content:water" I get less fields in the
> > results:
> > > <result name="response" numFound="177" start="0">
> > > <doc>
> > > <float name="boost">0.027676692</float>
> > > <str name="digest">4872e938706f9bee4d928330e5713623</str>
> > > <str name="id">http://www.mysite.com/es/biographies</str>
> > > <str name="segment">20161003150513</str>
> > > <str name="title">Biographies</str>
> > > <date name="tstamp">2016-10-03T15:21:45.346Z</date>
> > > <str name="url">http://www.mysite.com/es/biographies</str>
> > > </doc>
> > >
> > > Why is that?
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > --
> > > View this message in context: http://lucene.472066.n3.
> > nabble.com/why-the-results-have-diff-number-of-fields-tp4299378.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> 
> 
> 
> -- 
> Né§t☼r  *Authority gone to one's head is the greatest enemy of Truth*
> 

Re: why the results have diff number of fields

Posted by Néstor <ro...@gmail.com>.
Maybe because I am trying to just crawl a subfolder mysite.com/subfolder and
I am having problems configuring it to do this and is going and crawling
other pages from the parent directory.

Thanks!



On Tue, Oct 4, 2016 at 4:00 AM, Markus Jelsma <ma...@openindex.io>
wrote:

> Well, probably because you or something indexes different stuff to the
> Solr index. The first doesn't come from Nutch, the second does.
> Markus
>
>
>
> -----Original message-----
> > From:Nestor <ro...@gmail.com>
> > Sent: Tuesday 4th October 2016 2:07
> > To: user@nutch.apache.org
> > Subject: why the results have diff number of fields
> >
> > In my solr query result for "url:*" number of returned fields vary
> compare
> > to my second query(see bottom)
> > <result name="response" numFound="4861" start="0">
> > <doc>
> > <str name="body">...</str>
> > <str name="changed">2010-10-13T18:58:28</str>
> > <str name="created">2010-10-13T18:58:28</str>
> > <str name="entity">file</str>
> > <str name="hash">hvvzxf</str>
> > <str name="id">hvvzxf/file/53-623</str>
> > <arr name="im_vid_9">...</arr>
> > <str name="language">und</str>
> > <str name="name"/>
> > <str name="nid">623</str>
> > <str name="path">sites/default/files/HomePage.pdf</str>
> > <str name="promote">F</str>
> > <str name="site">http://www.mysite.com/</str>
> > <str name="sm_facetbuilder_solr_type">solr_type:facet_3</str>
> > <arr name="sm_vid_Project_Type">...</arr>
> > <arr name="spell">...</arr>
> > <str name="ss_file_node_title">Training Test 2</str>
> > <str name="ss_file_node_url">http://www.mysite.com/training-test-2</str>
> > <str name="ss_filemime">application/pdf</str>
> > <str name="status">T</str>
> > <str name="sticky">F</str>
> > <str name="teaser">...</str>
> > <arr name="tid">...</arr>
> > <str name="timestamp">2012-11-28T05:05:52.623</str>
> > <str name="title">HomePage.pdf</str>
> > <str name="ts_vid_9_names">Construction Professional Services</str>
> > <str name="uid">0</str>
> > <str name="url">...</str>
> > <arr name="vid">...</arr>
> > </doc>
> >
> > When I do a solr query as "content:water" I get less fields in the
> results:
> > <result name="response" numFound="177" start="0">
> > <doc>
> > <float name="boost">0.027676692</float>
> > <str name="digest">4872e938706f9bee4d928330e5713623</str>
> > <str name="id">http://www.mysite.com/es/biographies</str>
> > <str name="segment">20161003150513</str>
> > <str name="title">Biographies</str>
> > <date name="tstamp">2016-10-03T15:21:45.346Z</date>
> > <str name="url">http://www.mysite.com/es/biographies</str>
> > </doc>
> >
> > Why is that?
> >
> >
> > Thanks,
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/why-the-results-have-diff-number-of-fields-tp4299378.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>



-- 
Né§t☼r  *Authority gone to one's head is the greatest enemy of Truth*

RE: why the results have diff number of fields

Posted by Markus Jelsma <ma...@openindex.io>.
Well, probably because you or something indexes different stuff to the Solr index. The first doesn't come from Nutch, the second does.
Markus

 
 
-----Original message-----
> From:Nestor <ro...@gmail.com>
> Sent: Tuesday 4th October 2016 2:07
> To: user@nutch.apache.org
> Subject: why the results have diff number of fields
> 
> In my solr query result for "url:*" number of returned fields vary compare 
> to my second query(see bottom)
> <result name="response" numFound="4861" start="0">
> <doc>
> <str name="body">...</str>
> <str name="changed">2010-10-13T18:58:28</str>
> <str name="created">2010-10-13T18:58:28</str>
> <str name="entity">file</str>
> <str name="hash">hvvzxf</str>
> <str name="id">hvvzxf/file/53-623</str>
> <arr name="im_vid_9">...</arr>
> <str name="language">und</str>
> <str name="name"/>
> <str name="nid">623</str>
> <str name="path">sites/default/files/HomePage.pdf</str>
> <str name="promote">F</str>
> <str name="site">http://www.mysite.com/</str>
> <str name="sm_facetbuilder_solr_type">solr_type:facet_3</str>
> <arr name="sm_vid_Project_Type">...</arr>
> <arr name="spell">...</arr>
> <str name="ss_file_node_title">Training Test 2</str>
> <str name="ss_file_node_url">http://www.mysite.com/training-test-2</str>
> <str name="ss_filemime">application/pdf</str>
> <str name="status">T</str>
> <str name="sticky">F</str>
> <str name="teaser">...</str>
> <arr name="tid">...</arr>
> <str name="timestamp">2012-11-28T05:05:52.623</str>
> <str name="title">HomePage.pdf</str>
> <str name="ts_vid_9_names">Construction Professional Services</str>
> <str name="uid">0</str>
> <str name="url">...</str>
> <arr name="vid">...</arr>
> </doc>
> 
> When I do a solr query as "content:water" I get less fields in the results:
> <result name="response" numFound="177" start="0">
> <doc>
> <float name="boost">0.027676692</float>
> <str name="digest">4872e938706f9bee4d928330e5713623</str>
> <str name="id">http://www.mysite.com/es/biographies</str>
> <str name="segment">20161003150513</str>
> <str name="title">Biographies</str>
> <date name="tstamp">2016-10-03T15:21:45.346Z</date>
> <str name="url">http://www.mysite.com/es/biographies</str>
> </doc>
> 
> Why is that?
> 
> 
> Thanks,
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/why-the-results-have-diff-number-of-fields-tp4299378.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>