You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/01/20 15:08:22 UTC

Indexing Solr with the web crawler

I have started the Jetty server, configured the web crawler, a Solr 
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds 
of document types (pdf, html, doc etc.).

I have three questions.

1. Why do I now have a lot of these lines in the above host's access_log 
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is it 
fetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's web 
crawler collects? There is a field mapper in the job configuration, but 
I only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.? 
If it does not use Apache Tika, is it possible to configure the web 
crawler to use Tika for document parsing and language detection?

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

Perhaps it is acceptable to use the release version of Solr, plus
specific patches for the ticket or tickets in question?  There should
be a Solr tag for the release - you might be able to svn export from
that tag and pull the release code into your local svn, before
applying the patch, and then committing that also.  That way you have
a reproducible image to work with.  That's often what we needed to do
at MetaCarta.  It's a pain I know but that's life in the open-source
world.

Karl


On Tue, Jan 25, 2011 at 4:59 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 24.01.11 14.48, Karl Wright wrote:
>
>> Thanks for the information.
>> What I'd like to do is wait until your research is done and then post
>> the rough instructions to dev@lucene.apache.org for confirmation that
>> your approach is the preferred one.  I'd also like to know if you
>> check out the latest solr release from the svn tag and just build it,
>> whether you have any of these problems.  I've been building
>> solr/lucene trunk and not using the binary distribution, which may be
>> why I never noticed that this has gone away in the main distribution.
>
> OK, it might take a week or so, but here are some details I just figured
> out:
> - There is a bug with the current Solr release (1.4.1) which makes it
> impossible to extract the content by using the ExtractingRequestHandler. I
> think it is related to this Jira issue:
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> - This issue is now fixed, and if I check out the latest release from trunk,
> content can now be extracted by Tika.
>
> What I need to test is whether I need to place the tika/extracting jars
> manually in a lib folder when I deploy solr.war on Resin by using the latest
> trunk version from SVN. When this is done, I can inform you.
>
> Anyway, I don't like to build a search application for my university by
> using the latest version from trunk, I would rather prefer to use an
> official release. So maybe I will try to implement the changes from trunk
> instead. I can already now see that Tika has a newer version in trunk
> compared to the official 1.4.1 release, i.e. tika-core-0.8.jar instead of
> tika-core-0.4.jar.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Indexing Solr with the web crawler

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 24.01.11 14.48, Karl Wright wrote:

> Thanks for the information.
> What I'd like to do is wait until your research is done and then post
> the rough instructions to dev@lucene.apache.org for confirmation that
> your approach is the preferred one.  I'd also like to know if you
> check out the latest solr release from the svn tag and just build it,
> whether you have any of these problems.  I've been building
> solr/lucene trunk and not using the binary distribution, which may be
> why I never noticed that this has gone away in the main distribution.

OK, it might take a week or so, but here are some details I just figured 
out:
- There is a bug with the current Solr release (1.4.1) which makes it 
impossible to extract the content by using the ExtractingRequestHandler. 
I think it is related to this Jira issue:
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
- This issue is now fixed, and if I check out the latest release from 
trunk, content can now be extracted by Tika.

What I need to test is whether I need to place the tika/extracting jars 
manually in a lib folder when I deploy solr.war on Resin by using the 
latest trunk version from SVN. When this is done, I can inform you.

Anyway, I don't like to build a search application for my university by 
using the latest version from trunk, I would rather prefer to use an 
official release. So maybe I will try to implement the changes from 
trunk instead. I can already now see that Tika has a newer version in 
trunk compared to the official 1.4.1 release, i.e. tika-core-0.8.jar 
instead of tika-core-0.4.jar.

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

Lucene Revolutions could not fit me in, so my employer decided to send
me to Berlin Buzzwords in Berlin instead.
Karl

On Mon, Mar 7, 2011 at 9:01 AM, Karl Wright <da...@gmail.com> wrote:
> Ok, I think I finally have the conference schedule all worked out with
> my employer.
>
> (1) I've put in a ManifoldCF presentation proposal for Lucene
> Revolution in May 25-26.  Topic is using ManifoldCF and Solr to secure
> documents.  If that's accepted, great; if not, I will probably go to
> the Berlin Buzzwords conference instead, for company reasons.
>
> (2) I plan on attending (and presenting something related to
> ManifoldCF, if accepted) at ApacheCon North America on November 7-11.
> Topic TBD depending on what happens with the Lucene Revolution talk in
> May, and what people seem to be interested in hearing about.  To that
> end, I'd love to hear ideas.
>
> Thanks!
> Karl
>
> On Wed, Jan 26, 2011 at 4:24 AM, Karl Wright <da...@gmail.com> wrote:
>> I'm told Eurocon will likely be sometime in October, and it will be
>> put on by Lucid, if it happens at all.  So I can present ManifoldCF
>> then, if appropriate arrangements can be worked out.
>>
>> Karl
>>
>> On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
>>> You're right.  For some reason I misread the date on the note.
>>>
>>> So it is indeed possible that I can present at the Lucene Revolution
>>> conference - but if so, that would be a search-related talk, not about
>>> ManifoldCF.  I *may* be able to present at Eurocon, if it's not May
>>> 17-21.  I probably wouldn't be able to do both.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>>
>>>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>>>
>>>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>>>
>>>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>>>> likely, because the conference conflicts with my daughter's college
>>>>>>> graduation.  Sorry about that!
>>>>>>
>>>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>>>> not officially published yet.
>>>>>>
>>>>>
>>>>> I received the email last week.  The conference is currently set for
>>>>> May 18 and 19 in San Francisco.
>>>>
>>>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA.  And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend.  I don't have details on EuroCon 2011 yet myself.
>>>>
>>>>        Erik
>>>>
>>>>
>>>
>>
>

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

Ok, I think I finally have the conference schedule all worked out with
my employer.

(1) I've put in a ManifoldCF presentation proposal for Lucene
Revolution in May 25-26.  Topic is using ManifoldCF and Solr to secure
documents.  If that's accepted, great; if not, I will probably go to
the Berlin Buzzwords conference instead, for company reasons.

(2) I plan on attending (and presenting something related to
ManifoldCF, if accepted) at ApacheCon North America on November 7-11.
Topic TBD depending on what happens with the Lucene Revolution talk in
May, and what people seem to be interested in hearing about.  To that
end, I'd love to hear ideas.

Thanks!
Karl

On Wed, Jan 26, 2011 at 4:24 AM, Karl Wright <da...@gmail.com> wrote:
> I'm told Eurocon will likely be sometime in October, and it will be
> put on by Lucid, if it happens at all.  So I can present ManifoldCF
> then, if appropriate arrangements can be worked out.
>
> Karl
>
> On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
>> You're right.  For some reason I misread the date on the note.
>>
>> So it is indeed possible that I can present at the Lucene Revolution
>> conference - but if so, that would be a search-related talk, not about
>> ManifoldCF.  I *may* be able to present at Eurocon, if it's not May
>> 17-21.  I probably wouldn't be able to do both.
>>
>> Karl
>>
>>
>> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>
>>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>>
>>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>>
>>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>>> likely, because the conference conflicts with my daughter's college
>>>>>> graduation.  Sorry about that!
>>>>>
>>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>>> not officially published yet.
>>>>>
>>>>
>>>> I received the email last week.  The conference is currently set for
>>>> May 18 and 19 in San Francisco.
>>>
>>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA.  And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend.  I don't have details on EuroCon 2011 yet myself.
>>>
>>>        Erik
>>>
>>>
>>
>

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

I'm told Eurocon will likely be sometime in October, and it will be
put on by Lucid, if it happens at all.  So I can present ManifoldCF
then, if appropriate arrangements can be worked out.

Karl

On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
> You're right.  For some reason I misread the date on the note.
>
> So it is indeed possible that I can present at the Lucene Revolution
> conference - but if so, that would be a search-related talk, not about
> ManifoldCF.  I *may* be able to present at Eurocon, if it's not May
> 17-21.  I probably wouldn't be able to do both.
>
> Karl
>
>
> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>
>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>
>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>
>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>> likely, because the conference conflicts with my daughter's college
>>>>> graduation.  Sorry about that!
>>>>
>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>> not officially published yet.
>>>>
>>>
>>> I received the email last week.  The conference is currently set for
>>> May 18 and 19 in San Francisco.
>>
>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA.  And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend.  I don't have details on EuroCon 2011 yet myself.
>>
>>        Erik
>>
>>
>

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

You're right.  For some reason I misread the date on the note.

So it is indeed possible that I can present at the Lucene Revolution
conference - but if so, that would be a search-related talk, not about
ManifoldCF.  I *may* be able to present at Eurocon, if it's not May
17-21.  I probably wouldn't be able to do both.

Karl


On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>
> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>
>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>
>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>> likely, because the conference conflicts with my daughter's college
>>>> graduation.  Sorry about that!
>>>
>>> I'm not sure when the conference will be held anyway - I guess the date is
>>> not officially published yet.
>>>
>>
>> I received the email last week.  The conference is currently set for
>> May 18 and 19 in San Francisco.
>
> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA.  And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend.  I don't have details on EuroCon 2011 yet myself.
>
>        Erik
>
>

Re: Indexing Solr with the web crawler

Posted by Erik Hatcher <er...@gmail.com>.

On Jan 24, 2011, at 08:48 , Karl Wright wrote:

> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>> On 21.01.11 17.38, Karl Wright wrote:
>>> 
>>> I will not be talking about ManifoldCF at this year's conference, most
>>> likely, because the conference conflicts with my daughter's college
>>> graduation.  Sorry about that!
>> 
>> I'm not sure when the conference will be held anyway - I guess the date is
>> not officially published yet.
>> 
> 
> I received the email last week.  The conference is currently set for
> May 18 and 19 in San Francisco.

Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA.  And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend.  I don't have details on EuroCon 2011 yet myself.

	Erik

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 21.01.11 17.38, Karl Wright wrote:
>>
>> I will not be talking about ManifoldCF at this year's conference, most
>> likely, because the conference conflicts with my daughter's college
>> graduation.  Sorry about that!
>
> I'm not sure when the conference will be held anyway - I guess the date is
> not officially published yet.
>

I received the email last week.  The conference is currently set for
May 18 and 19 in San Francisco.

>> I hadn't heard that they removed the extracting update request handler
>> from Solr.  That's unfortunate.  Please let me know how hard you find
>> it to install the jar, and I'll update the instructions accordingly.
>
> It's finally working, but not perfectly. Here's what I had to do:
> - Run "ant example"
> - Create a <solr.home>/lib directory
> - Place all jars in contrib/extraction/lib/ and contrib/extraction/build/
> into this lib folder.
>
> I also had co use the schema.xml file from the example. My own schema
> configuration is different, so I guess I need to adapt it later. Content is
> missing, title is not. And maybe I need to create my own request handler in
> order to implement language detection. I will try to dive deeper into all
> the configuration settings.
>

Thanks for the information.
What I'd like to do is wait until your research is done and then post
the rough instructions to dev@lucene.apache.org for confirmation that
your approach is the preferred one.  I'd also like to know if you
check out the latest solr release from the svn tag and just build it,
whether you have any of these problems.  I've been building
solr/lucene trunk and not using the binary distribution, which may be
why I never noticed that this has gone away in the main distribution.

Thanks again!
Karl

Re: Indexing Solr with the web crawler

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 21.01.11 17.38, Karl Wright wrote:
> I will not be talking about ManifoldCF at this year's conference, most
> likely, because the conference conflicts with my daughter's college
> graduation.  Sorry about that!

I'm not sure when the conference will be held anyway - I guess the date 
is not officially published yet.

> I hadn't heard that they removed the extracting update request handler
> from Solr.  That's unfortunate.  Please let me know how hard you find
> it to install the jar, and I'll update the instructions accordingly.

It's finally working, but not perfectly. Here's what I had to do:
- Run "ant example"
- Create a <solr.home>/lib directory
- Place all jars in contrib/extraction/lib/ and 
contrib/extraction/build/ into this lib folder.

I also had co use the schema.xml file from the example. My own schema 
configuration is different, so I guess I need to adapt it later. Content 
is missing, title is not. And maybe I need to create my own request 
handler in order to implement language detection. I will try to dive 
deeper into all the configuration settings.

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

I will not be talking about ManifoldCF at this year's conference, most
likely, because the conference conflicts with my daughter's college
graduation.  Sorry about that!

I hadn't heard that they removed the extracting update request handler
from Solr.  That's unfortunate.  Please let me know how hard you find
it to install the jar, and I'll update the instructions accordingly.

Karl

On Fri, Jan 21, 2011 at 10:32 AM, Erlend Garåsen
<e....@usit.uio.no> wrote:
>
> I knew that I had heard your name before, Karl. You held an LCF presentation
> in Prague. Unfortunately, I attended the other presentation at track 2, so I
> missed it.
>
> I hope there will be held similar presentations for this year's conference.
>
> Anyway, I figured out that it is the commit part which causes the problems.
> I entered the following url I saw from Resin's access_log:
> http://hoppalong.uio.no:8081/solr/update/extract?commit=true
>
> I'm not going to bother you with the complete stack trace, but here's the
> relevant line:
> Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.handler.extraction.ExtractingRequestHandler
>
> Jack sent me a link about the ExtractingRequestHandler, and after I read
> this document I found the reason:
> "The ExtractingRequestHandler is not incorporated into the solr war file,
> you have to install it separately."
>
> So I will try to place the missing jar file into my lib folder next week.
>
> Erlend
>
>
> On 20.01.11 16.23, Erlend Garåsen wrote:
>>
>> On 20.01.11 16.15, Jack Krupansky wrote:
>>>
>>> Here's one email thread that details at least one cause of the lazy
>>> loading error:
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
>>>
>>
>> Thanks. Now I can see that I have the following lines in Resin's access
>> log:
>> 127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET
>> /solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"
>>
>> I run Solr on Resin, so maybe there is something more I need to
>> configure. I'll take a deeper look at this right now.
>>
>> Erlend
>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Indexing Solr with the web crawler

Posted by Erlend Garåsen <e....@usit.uio.no>.

I knew that I had heard your name before, Karl. You held an LCF 
presentation in Prague. Unfortunately, I attended the other presentation 
at track 2, so I missed it.

I hope there will be held similar presentations for this year's conference.

Anyway, I figured out that it is the commit part which causes the 
problems. I entered the following url I saw from Resin's access_log:
http://hoppalong.uio.no:8081/solr/update/extract?commit=true

I'm not going to bother you with the complete stack trace, but here's 
the relevant line:
Caused by: java.lang.ClassNotFoundException: 
org.apache.solr.handler.extraction.ExtractingRequestHandler

Jack sent me a link about the ExtractingRequestHandler, and after I read 
this document I found the reason:
"The ExtractingRequestHandler is not incorporated into the solr war 
file, you have to install it separately."

So I will try to place the missing jar file into my lib folder next week.

Erlend


On 20.01.11 16.23, Erlend Garåsen wrote:
> On 20.01.11 16.15, Jack Krupansky wrote:
>> Here's one email thread that details at least one cause of the lazy
>> loading error:
>>
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
>>
>
> Thanks. Now I can see that I have the following lines in Resin's access
> log:
> 127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET
> /solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"
>
> I run Solr on Resin, so maybe there is something more I need to
> configure. I'll take a deeper look at this right now.
>
> Erlend
>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 20.01.11 16.15, Jack Krupansky wrote:
> Here's one email thread that details at least one cause of the lazy
> loading error:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E

Thanks. Now I can see that I have the following lines in Resin's access log:
127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET 
/solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"

I run Solr on Resin, so maybe there is something more I need to 
configure. I'll take a deeper look at this right now.

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Jack Krupansky <ja...@lucidimagination.com>.

Here's one email thread that details at least one cause of the lazy loading 
error:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E

-- Jack Krupansky

-----Original Message----- 
From: Karl Wright
Sent: Thursday, January 20, 2011 10:02 AM
To: connectors-user@incubator.apache.org
Subject: Re: Indexing Solr with the web crawler

> It says:
>
> 01-20-2011 15:14:18.914         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     9       lazy loading error

So what is happening is that either your solr instance or your Solr
output connection is misconfigured, and when ManifoldCF tries to send
the document to Solr it returns with an error.  I don't know what
Solr's "lazy loading error" is, but hopefully you can find out either
from the doc or from the Solr/Lucene newsgroup.


> Thanks for clarifying. I can try to configure Solr to parse these 
> documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.

That's exactly what ManifoldCF is good at.

> I'm unsure about what you mean by anonymous fields in Solr. I cannot 
> define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to 
> support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.

I'm probably using the wrong terminology.  I think they are actually
called "dynamic fields".

> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the 
> page
> every minute?

The reason it's retrying is because the Solr connector is getting that
error, and it's telling ManifoldCF that it should retry.  That's
because it hasn't figured out that the error is due to setup, rather
than some transient condition.

The expiration model for continuous crawling is going to take more to
describe than I can here.  I suggest you read about it in the online
end-user documentation.  If that's not enough, there's a book on the
way from Manning Publishing, called ManifoldCF in Action.  There
should be some chapters that might help you available soon through the
Manning Early Access Program.

Thanks!
Karl


On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e....@usit.uio.no> 
wrote:
> On 20.01.11 15.21, Karl Wright wrote:
>>
>> Hi Erlend,
>
> Hi Karl,
>
> Thank you for replying and for your comments. It's very appreciated.
>
>> (1) The best way to find out what ManifoldCF thinks it is doing is to
>> look at the Simple History report in the UI.
>
> It says:
>
> 01-20-2011 15:14:18.914         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     9       lazy loading error
> 01-20-2011 15:14:18.800         fetch   http://ridder.uio.no/
>        200     588     103
> 01-20-2011 15:13:18.581         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     16      lazy loading error
> 01-20-2011 15:13:18.448         fetch   http://ridder.uio.no/
>        200     588     111
>
>
>> (2) The Web Connector in ManifoldCF does not have the ability, at this
>> time, to extract links from Word docs, pdfs, etc., but Solr can
>> extract *content* from these documents if you configure it to use
>> Tika.  The document is sent to Solr in binary form, and Tika extracts
>> whatever metadata it can find.  ManifoldCF does not get involved in
>> that at all.  Usually, setting up Solr with anonymous fields is the
>> way to go in this case.
>
> Thanks for clarifying. I can try to configure Solr to parse these 
> documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
>
> I'm unsure about what you mean by anonymous fields in Solr. I cannot 
> define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to 
> support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
>
>> If this is an open site, I'll crawl it here myself momentarily and let
>> you know what I find.
>
> Please do that. It's just my workstation with an Apache server running. 
> It's
> open.
>
> BTW, I think I have set things up correctly for the crawler:
> Seeds: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
> matching seeds)
>
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the 
> page
> every minute?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 
> 31050
>

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

> It says:
>
> 01-20-2011 15:14:18.914         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     9       lazy loading error

So what is happening is that either your solr instance or your Solr
output connection is misconfigured, and when ManifoldCF tries to send
the document to Solr it returns with an error.  I don't know what
Solr's "lazy loading error" is, but hopefully you can find out either
from the doc or from the Solr/Lucene newsgroup.


> Thanks for clarifying. I can try to configure Solr to parse these documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.

That's exactly what ManifoldCF is good at.

> I'm unsure about what you mean by anonymous fields in Solr. I cannot define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.

I'm probably using the wrong terminology.  I think they are actually
called "dynamic fields".

> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the page
> every minute?

The reason it's retrying is because the Solr connector is getting that
error, and it's telling ManifoldCF that it should retry.  That's
because it hasn't figured out that the error is due to setup, rather
than some transient condition.

The expiration model for continuous crawling is going to take more to
describe than I can here.  I suggest you read about it in the online
end-user documentation.  If that's not enough, there's a book on the
way from Manning Publishing, called ManifoldCF in Action.  There
should be some chapters that might help you available soon through the
Manning Early Access Program.

Thanks!
Karl


On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 20.01.11 15.21, Karl Wright wrote:
>>
>> Hi Erlend,
>
> Hi Karl,
>
> Thank you for replying and for your comments. It's very appreciated.
>
>> (1) The best way to find out what ManifoldCF thinks it is doing is to
>> look at the Simple History report in the UI.
>
> It says:
>
> 01-20-2011 15:14:18.914         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     9       lazy loading error
> 01-20-2011 15:14:18.800         fetch   http://ridder.uio.no/
>        200     588     103
> 01-20-2011 15:13:18.581         document ingest (solr_indexer)
> http://ridder.uio.no/
>        500     588     16      lazy loading error
> 01-20-2011 15:13:18.448         fetch   http://ridder.uio.no/
>        200     588     111
>
>
>> (2) The Web Connector in ManifoldCF does not have the ability, at this
>> time, to extract links from Word docs, pdfs, etc., but Solr can
>> extract *content* from these documents if you configure it to use
>> Tika.  The document is sent to Solr in binary form, and Tika extracts
>> whatever metadata it can find.  ManifoldCF does not get involved in
>> that at all.  Usually, setting up Solr with anonymous fields is the
>> way to go in this case.
>
> Thanks for clarifying. I can try to configure Solr to parse these documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
>
> I'm unsure about what you mean by anonymous fields in Solr. I cannot define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
>
>> If this is an open site, I'll crawl it here myself momentarily and let
>> you know what I find.
>
> Please do that. It's just my workstation with an Apache server running. It's
> open.
>
> BTW, I think I have set things up correctly for the crawler:
> Seeds: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
> matching seeds)
>
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the page
> every minute?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Indexing Solr with the web crawler

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 20.01.11 15.21, Karl Wright wrote:
> Hi Erlend,

Hi Karl,

Thank you for replying and for your comments. It's very appreciated.

> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.

It says:

01-20-2011 15:14:18.914 	document ingest (solr_indexer) 
http://ridder.uio.no/
	500 	588 	9 	lazy loading error
01-20-2011 15:14:18.800 	fetch 	http://ridder.uio.no/
	200 	588 	103 	
01-20-2011 15:13:18.581 	document ingest (solr_indexer) 
http://ridder.uio.no/
	500 	588 	16 	lazy loading error
01-20-2011 15:13:18.448 	fetch 	http://ridder.uio.no/
	200 	588 	111

> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika.  The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find.  ManifoldCF does not get involved in
> that at all.  Usually, setting up Solr with anonymous fields is the
> way to go in this case.

Thanks for clarifying. I can try to configure Solr to parse these 
documents. Nutch did a good job except that it cannot detect whether a 
document was modified in order to send an update/delete commando to 
Solr. That function is crucial for us.

I'm unsure about what you mean by anonymous fields in Solr. I cannot 
define the fields I need in schema.xml as I want? I have created 
duplicate fields for title and content in order to use different 
stemmers (I need to support English and Norwegian). In Nutch there is a 
simple configuration file for mapping fields from Nutch to Solr.

> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.

Please do that. It's just my workstation with an Apache server running. 
It's open.

BTW, I think I have set things up correctly for the crawler:
Seeds: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts 
matching seeds)

I havent't filled out the "expiration interval (if continuous)." under 
the scheduling folder. Is this the reason why ManifoldCF is recrawling 
the page every minute?

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

Hmm, right now I'm behind a firewall, unfortunately, so I won't be
able to try this myself until this evening.  But if you post the
output of your simple history report I can help interpret it for you.

Karl

On Thu, Jan 20, 2011 at 9:21 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Erlend,
>
> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.
>
> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika.  The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find.  ManifoldCF does not get involved in
> that at all.  Usually, setting up Solr with anonymous fields is the
> way to go in this case.
>
> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.
>
> Karl
>
>
>
> On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>
>> I have started the Jetty server, configured the web crawler, a Solr
>> connector and created a job. First I try to crawl the following site:
>> http://ridder.uio.no/
>> which contains nothing but an index.html with links to different kinds of
>> document types (pdf, html, doc etc.).
>>
>> I have three questions.
>>
>> 1. Why do I now have a lot of these lines in the above host's access_log
>> after the crawler has been started?
>> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>>
>> What is the crawler trying to do which it probably cannot do? Why is it
>> fetching the same URL over and over again?
>>
>> 2. How can I index Solr when I don't know which fields ManifoldCF's web
>> crawler collects? There is a field mapper in the job configuration, but I
>> only know about the fields I have configured in Solr's schema.xml.
>>
>> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
>> it does not use Apache Tika, is it possible to configure the web crawler to
>> use Tika for document parsing and language detection?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Re: Indexing Solr with the web crawler

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

(1) The best way to find out what ManifoldCF thinks it is doing is to
look at the Simple History report in the UI.

(2) The Web Connector in ManifoldCF does not have the ability, at this
time, to extract links from Word docs, pdfs, etc., but Solr can
extract *content* from these documents if you configure it to use
Tika.  The document is sent to Solr in binary form, and Tika extracts
whatever metadata it can find.  ManifoldCF does not get involved in
that at all.  Usually, setting up Solr with anonymous fields is the
way to go in this case.

If this is an open site, I'll crawl it here myself momentarily and let
you know what I find.

Karl



On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> I have started the Jetty server, configured the web crawler, a Solr
> connector and created a job. First I try to crawl the following site:
> http://ridder.uio.no/
> which contains nothing but an index.html with links to different kinds of
> document types (pdf, html, doc etc.).
>
> I have three questions.
>
> 1. Why do I now have a lot of these lines in the above host's access_log
> after the crawler has been started?
> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
>
> What is the crawler trying to do which it probably cannot do? Why is it
> fetching the same URL over and over again?
>
> 2. How can I index Solr when I don't know which fields ManifoldCF's web
> crawler collects? There is a field mapper in the job configuration, but I
> only know about the fields I have configured in Solr's schema.xml.
>
> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
> it does not use Apache Tika, is it possible to configure the web crawler to
> use Tika for document parsing and language detection?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Indexing Solr with the web crawler

Posted by Jack Krupansky <ja...@lucidimagination.com>.

The Solr connector is designed to send raw document content (unparsed) to 
Solr Cell (the ExtractingRequestHandler) which then uses Tika for mime type 
detection and document parsing. If you run Tika directly it will tell you 
what metadata is extracted from a particular document type, which varies.

See:
http://wiki.apache.org/solr/ExtractingRequestHandler

You can also access Solr Cell with the "Extract Only" option to see what 
Tika is generating within Solr Cell for a particular input document and then 
use those metadata field names to construct MCF field mappings to your 
schema fields.

See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only

-- Jack Krupansky

-----Original Message----- 
From: Erlend Garåsen
Sent: Thursday, January 20, 2011 9:08 AM
To: connectors-user@incubator.apache.org
Subject: Indexing Solr with the web crawler


I have started the Jetty server, configured the web crawler, a Solr
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds
of document types (pdf, html, doc etc.).

I have three questions.

1. Why do I now have a lot of these lines in the above host's access_log
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is it
fetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's web
crawler collects? There is a field mapper in the job configuration, but
I only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.?
If it does not use Apache Tika, is it possible to configure the web
crawler to use Tika for document parsing and language detection?

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050