You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/01/20 15:08:22 UTC
Indexing Solr with the web crawler
I have started the Jetty server, configured the web crawler, a Solr
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds
of document types (pdf, html, doc etc.).
I have three questions.
1. Why do I now have a lot of these lines in the above host's access_log
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
What is the crawler trying to do which it probably cannot do? Why is it
fetching the same URL over and over again?
2. How can I index Solr when I don't know which fields ManifoldCF's web
crawler collects? There is a field mapper in the job configuration, but
I only know about the fields I have configured in Solr's schema.xml.
3. Will the web crawler parse document types such as PDF, doc, rtf etc.?
If it does not use Apache Tika, is it possible to configure the web
crawler to use Tika for document parsing and language detection?
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
Perhaps it is acceptable to use the release version of Solr, plus
specific patches for the ticket or tickets in question? There should
be a Solr tag for the release - you might be able to svn export from
that tag and pull the release code into your local svn, before
applying the patch, and then committing that also. That way you have
a reproducible image to work with. That's often what we needed to do
at MetaCarta. It's a pain I know but that's life in the open-source
world.
Karl
On Tue, Jan 25, 2011 at 4:59 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 24.01.11 14.48, Karl Wright wrote:
>
>> Thanks for the information.
>> What I'd like to do is wait until your research is done and then post
>> the rough instructions to dev@lucene.apache.org for confirmation that
>> your approach is the preferred one. I'd also like to know if you
>> check out the latest solr release from the svn tag and just build it,
>> whether you have any of these problems. I've been building
>> solr/lucene trunk and not using the binary distribution, which may be
>> why I never noticed that this has gone away in the main distribution.
>
> OK, it might take a week or so, but here are some details I just figured
> out:
> - There is a bug with the current Solr release (1.4.1) which makes it
> impossible to extract the content by using the ExtractingRequestHandler. I
> think it is related to this Jira issue:
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> - This issue is now fixed, and if I check out the latest release from trunk,
> content can now be extracted by Tika.
>
> What I need to test is whether I need to place the tika/extracting jars
> manually in a lib folder when I deploy solr.war on Resin by using the latest
> trunk version from SVN. When this is done, I can inform you.
>
> Anyway, I don't like to build a search application for my university by
> using the latest version from trunk, I would rather prefer to use an
> official release. So maybe I will try to implement the changes from trunk
> instead. I can already now see that Tika has a newer version in trunk
> compared to the official 1.4.1 release, i.e. tika-core-0.8.jar instead of
> tika-core-0.4.jar.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>
Re: Indexing Solr with the web crawler
Posted by Erlend Garåsen <e....@usit.uio.no>.
On 24.01.11 14.48, Karl Wright wrote:
> Thanks for the information.
> What I'd like to do is wait until your research is done and then post
> the rough instructions to dev@lucene.apache.org for confirmation that
> your approach is the preferred one. I'd also like to know if you
> check out the latest solr release from the svn tag and just build it,
> whether you have any of these problems. I've been building
> solr/lucene trunk and not using the binary distribution, which may be
> why I never noticed that this has gone away in the main distribution.
OK, it might take a week or so, but here are some details I just figured
out:
- There is a bug with the current Solr release (1.4.1) which makes it
impossible to extract the content by using the ExtractingRequestHandler.
I think it is related to this Jira issue:
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
- This issue is now fixed, and if I check out the latest release from
trunk, content can now be extracted by Tika.
What I need to test is whether I need to place the tika/extracting jars
manually in a lib folder when I deploy solr.war on Resin by using the
latest trunk version from SVN. When this is done, I can inform you.
Anyway, I don't like to build a search application for my university by
using the latest version from trunk, I would rather prefer to use an
official release. So maybe I will try to implement the changes from
trunk instead. I can already now see that Tika has a newer version in
trunk compared to the official 1.4.1 release, i.e. tika-core-0.8.jar
instead of tika-core-0.4.jar.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
Lucene Revolutions could not fit me in, so my employer decided to send
me to Berlin Buzzwords in Berlin instead.
Karl
On Mon, Mar 7, 2011 at 9:01 AM, Karl Wright <da...@gmail.com> wrote:
> Ok, I think I finally have the conference schedule all worked out with
> my employer.
>
> (1) I've put in a ManifoldCF presentation proposal for Lucene
> Revolution in May 25-26. Topic is using ManifoldCF and Solr to secure
> documents. If that's accepted, great; if not, I will probably go to
> the Berlin Buzzwords conference instead, for company reasons.
>
> (2) I plan on attending (and presenting something related to
> ManifoldCF, if accepted) at ApacheCon North America on November 7-11.
> Topic TBD depending on what happens with the Lucene Revolution talk in
> May, and what people seem to be interested in hearing about. To that
> end, I'd love to hear ideas.
>
> Thanks!
> Karl
>
> On Wed, Jan 26, 2011 at 4:24 AM, Karl Wright <da...@gmail.com> wrote:
>> I'm told Eurocon will likely be sometime in October, and it will be
>> put on by Lucid, if it happens at all. So I can present ManifoldCF
>> then, if appropriate arrangements can be worked out.
>>
>> Karl
>>
>> On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
>>> You're right. For some reason I misread the date on the note.
>>>
>>> So it is indeed possible that I can present at the Lucene Revolution
>>> conference - but if so, that would be a search-related talk, not about
>>> ManifoldCF. I *may* be able to present at Eurocon, if it's not May
>>> 17-21. I probably wouldn't be able to do both.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>>
>>>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>>>
>>>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>>>
>>>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>>>> likely, because the conference conflicts with my daughter's college
>>>>>>> graduation. Sorry about that!
>>>>>>
>>>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>>>> not officially published yet.
>>>>>>
>>>>>
>>>>> I received the email last week. The conference is currently set for
>>>>> May 18 and 19 in San Francisco.
>>>>
>>>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA. And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend. I don't have details on EuroCon 2011 yet myself.
>>>>
>>>> Erik
>>>>
>>>>
>>>
>>
>
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
Ok, I think I finally have the conference schedule all worked out with
my employer.
(1) I've put in a ManifoldCF presentation proposal for Lucene
Revolution in May 25-26. Topic is using ManifoldCF and Solr to secure
documents. If that's accepted, great; if not, I will probably go to
the Berlin Buzzwords conference instead, for company reasons.
(2) I plan on attending (and presenting something related to
ManifoldCF, if accepted) at ApacheCon North America on November 7-11.
Topic TBD depending on what happens with the Lucene Revolution talk in
May, and what people seem to be interested in hearing about. To that
end, I'd love to hear ideas.
Thanks!
Karl
On Wed, Jan 26, 2011 at 4:24 AM, Karl Wright <da...@gmail.com> wrote:
> I'm told Eurocon will likely be sometime in October, and it will be
> put on by Lucid, if it happens at all. So I can present ManifoldCF
> then, if appropriate arrangements can be worked out.
>
> Karl
>
> On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
>> You're right. For some reason I misread the date on the note.
>>
>> So it is indeed possible that I can present at the Lucene Revolution
>> conference - but if so, that would be a search-related talk, not about
>> ManifoldCF. I *may* be able to present at Eurocon, if it's not May
>> 17-21. I probably wouldn't be able to do both.
>>
>> Karl
>>
>>
>> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>
>>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>>
>>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>>
>>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>>> likely, because the conference conflicts with my daughter's college
>>>>>> graduation. Sorry about that!
>>>>>
>>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>>> not officially published yet.
>>>>>
>>>>
>>>> I received the email last week. The conference is currently set for
>>>> May 18 and 19 in San Francisco.
>>>
>>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA. And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend. I don't have details on EuroCon 2011 yet myself.
>>>
>>> Erik
>>>
>>>
>>
>
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
I'm told Eurocon will likely be sometime in October, and it will be
put on by Lucid, if it happens at all. So I can present ManifoldCF
then, if appropriate arrangements can be worked out.
Karl
On Mon, Jan 24, 2011 at 10:45 AM, Karl Wright <da...@gmail.com> wrote:
> You're right. For some reason I misread the date on the note.
>
> So it is indeed possible that I can present at the Lucene Revolution
> conference - but if so, that would be a search-related talk, not about
> ManifoldCF. I *may* be able to present at Eurocon, if it's not May
> 17-21. I probably wouldn't be able to do both.
>
> Karl
>
>
> On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>>
>> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>>
>>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>>
>>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>>> likely, because the conference conflicts with my daughter's college
>>>>> graduation. Sorry about that!
>>>>
>>>> I'm not sure when the conference will be held anyway - I guess the date is
>>>> not officially published yet.
>>>>
>>>
>>> I received the email last week. The conference is currently set for
>>> May 18 and 19 in San Francisco.
>>
>> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA. And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend. I don't have details on EuroCon 2011 yet myself.
>>
>> Erik
>>
>>
>
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
You're right. For some reason I misread the date on the note.
So it is indeed possible that I can present at the Lucene Revolution
conference - but if so, that would be a search-related talk, not about
ManifoldCF. I *may* be able to present at Eurocon, if it's not May
17-21. I probably wouldn't be able to do both.
Karl
On Mon, Jan 24, 2011 at 10:34 AM, Erik Hatcher <er...@gmail.com> wrote:
>
> On Jan 24, 2011, at 08:48 , Karl Wright wrote:
>
>> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>> On 21.01.11 17.38, Karl Wright wrote:
>>>>
>>>> I will not be talking about ManifoldCF at this year's conference, most
>>>> likely, because the conference conflicts with my daughter's college
>>>> graduation. Sorry about that!
>>>
>>> I'm not sure when the conference will be held anyway - I guess the date is
>>> not officially published yet.
>>>
>>
>> I received the email last week. The conference is currently set for
>> May 18 and 19 in San Francisco.
>
> Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA. And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend. I don't have details on EuroCon 2011 yet myself.
>
> Erik
>
>
Re: Indexing Solr with the web crawler
Posted by Erik Hatcher <er...@gmail.com>.
On Jan 24, 2011, at 08:48 , Karl Wright wrote:
> On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>> On 21.01.11 17.38, Karl Wright wrote:
>>>
>>> I will not be talking about ManifoldCF at this year's conference, most
>>> likely, because the conference conflicts with my daughter's college
>>> graduation. Sorry about that!
>>
>> I'm not sure when the conference will be held anyway - I guess the date is
>> not officially published yet.
>>
>
> I received the email last week. The conference is currently set for
> May 18 and 19 in San Francisco.
Actually that's not correct, according to <http://www.lucidimagination.com/revolution/2011> it's May 23-26 in San Francisco, CA. And that's the Lucene *Revolution* conference, whereas it was Lucene *EuroCon* that was being mentioned by Erlend. I don't have details on EuroCon 2011 yet myself.
Erik
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
On Mon, Jan 24, 2011 at 8:40 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 21.01.11 17.38, Karl Wright wrote:
>>
>> I will not be talking about ManifoldCF at this year's conference, most
>> likely, because the conference conflicts with my daughter's college
>> graduation. Sorry about that!
>
> I'm not sure when the conference will be held anyway - I guess the date is
> not officially published yet.
>
I received the email last week. The conference is currently set for
May 18 and 19 in San Francisco.
>> I hadn't heard that they removed the extracting update request handler
>> from Solr. That's unfortunate. Please let me know how hard you find
>> it to install the jar, and I'll update the instructions accordingly.
>
> It's finally working, but not perfectly. Here's what I had to do:
> - Run "ant example"
> - Create a <solr.home>/lib directory
> - Place all jars in contrib/extraction/lib/ and contrib/extraction/build/
> into this lib folder.
>
> I also had co use the schema.xml file from the example. My own schema
> configuration is different, so I guess I need to adapt it later. Content is
> missing, title is not. And maybe I need to create my own request handler in
> order to implement language detection. I will try to dive deeper into all
> the configuration settings.
>
Thanks for the information.
What I'd like to do is wait until your research is done and then post
the rough instructions to dev@lucene.apache.org for confirmation that
your approach is the preferred one. I'd also like to know if you
check out the latest solr release from the svn tag and just build it,
whether you have any of these problems. I've been building
solr/lucene trunk and not using the binary distribution, which may be
why I never noticed that this has gone away in the main distribution.
Thanks again!
Karl
Re: Indexing Solr with the web crawler
Posted by Erlend Garåsen <e....@usit.uio.no>.
On 21.01.11 17.38, Karl Wright wrote:
> I will not be talking about ManifoldCF at this year's conference, most
> likely, because the conference conflicts with my daughter's college
> graduation. Sorry about that!
I'm not sure when the conference will be held anyway - I guess the date
is not officially published yet.
> I hadn't heard that they removed the extracting update request handler
> from Solr. That's unfortunate. Please let me know how hard you find
> it to install the jar, and I'll update the instructions accordingly.
It's finally working, but not perfectly. Here's what I had to do:
- Run "ant example"
- Create a <solr.home>/lib directory
- Place all jars in contrib/extraction/lib/ and
contrib/extraction/build/ into this lib folder.
I also had co use the schema.xml file from the example. My own schema
configuration is different, so I guess I need to adapt it later. Content
is missing, title is not. And maybe I need to create my own request
handler in order to implement language detection. I will try to dive
deeper into all the configuration settings.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
I will not be talking about ManifoldCF at this year's conference, most
likely, because the conference conflicts with my daughter's college
graduation. Sorry about that!
I hadn't heard that they removed the extracting update request handler
from Solr. That's unfortunate. Please let me know how hard you find
it to install the jar, and I'll update the instructions accordingly.
Karl
On Fri, Jan 21, 2011 at 10:32 AM, Erlend Garåsen
<e....@usit.uio.no> wrote:
>
> I knew that I had heard your name before, Karl. You held an LCF presentation
> in Prague. Unfortunately, I attended the other presentation at track 2, so I
> missed it.
>
> I hope there will be held similar presentations for this year's conference.
>
> Anyway, I figured out that it is the commit part which causes the problems.
> I entered the following url I saw from Resin's access_log:
> http://hoppalong.uio.no:8081/solr/update/extract?commit=true
>
> I'm not going to bother you with the complete stack trace, but here's the
> relevant line:
> Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.handler.extraction.ExtractingRequestHandler
>
> Jack sent me a link about the ExtractingRequestHandler, and after I read
> this document I found the reason:
> "The ExtractingRequestHandler is not incorporated into the solr war file,
> you have to install it separately."
>
> So I will try to place the missing jar file into my lib folder next week.
>
> Erlend
>
>
> On 20.01.11 16.23, Erlend Garåsen wrote:
>>
>> On 20.01.11 16.15, Jack Krupansky wrote:
>>>
>>> Here's one email thread that details at least one cause of the lazy
>>> loading error:
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
>>>
>>
>> Thanks. Now I can see that I have the following lines in Resin's access
>> log:
>> 127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET
>> /solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"
>>
>> I run Solr on Resin, so maybe there is something more I need to
>> configure. I'll take a deeper look at this right now.
>>
>> Erlend
>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>
Re: Indexing Solr with the web crawler
Posted by Erlend Garåsen <e....@usit.uio.no>.
I knew that I had heard your name before, Karl. You held an LCF
presentation in Prague. Unfortunately, I attended the other presentation
at track 2, so I missed it.
I hope there will be held similar presentations for this year's conference.
Anyway, I figured out that it is the commit part which causes the
problems. I entered the following url I saw from Resin's access_log:
http://hoppalong.uio.no:8081/solr/update/extract?commit=true
I'm not going to bother you with the complete stack trace, but here's
the relevant line:
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.handler.extraction.ExtractingRequestHandler
Jack sent me a link about the ExtractingRequestHandler, and after I read
this document I found the reason:
"The ExtractingRequestHandler is not incorporated into the solr war
file, you have to install it separately."
So I will try to place the missing jar file into my lib folder next week.
Erlend
On 20.01.11 16.23, Erlend Garåsen wrote:
> On 20.01.11 16.15, Jack Krupansky wrote:
>> Here's one email thread that details at least one cause of the lazy
>> loading error:
>>
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
>>
>
> Thanks. Now I can see that I have the following lines in Resin's access
> log:
> 127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET
> /solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"
>
> I run Solr on Resin, so maybe there is something more I need to
> configure. I'll take a deeper look at this right now.
>
> Erlend
>
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Erlend Garåsen <e....@usit.uio.no>.
On 20.01.11 16.15, Jack Krupansky wrote:
> Here's one email thread that details at least one cause of the lazy
> loading error:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
Thanks. Now I can see that I have the following lines in Resin's access log:
127.0.0.1 - - [20/Jan/2011:16:19:09 +0100] "GET
/solr/update/extract?commit=true HTTP/1.0" 500 5598 "-" "-"
I run Solr on Resin, so maybe there is something more I need to
configure. I'll take a deeper look at this right now.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Jack Krupansky <ja...@lucidimagination.com>.
Here's one email thread that details at least one cause of the lazy loading
error:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3C4AD5EC8C.6000308@gmail.com%3E
-- Jack Krupansky
-----Original Message-----
From: Karl Wright
Sent: Thursday, January 20, 2011 10:02 AM
To: connectors-user@incubator.apache.org
Subject: Re: Indexing Solr with the web crawler
> It says:
>
> 01-20-2011 15:14:18.914 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 9 lazy loading error
So what is happening is that either your solr instance or your Solr
output connection is misconfigured, and when ManifoldCF tries to send
the document to Solr it returns with an error. I don't know what
Solr's "lazy loading error" is, but hopefully you can find out either
from the doc or from the Solr/Lucene newsgroup.
> Thanks for clarifying. I can try to configure Solr to parse these
> documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
That's exactly what ManifoldCF is good at.
> I'm unsure about what you mean by anonymous fields in Solr. I cannot
> define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to
> support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
I'm probably using the wrong terminology. I think they are actually
called "dynamic fields".
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the
> page
> every minute?
The reason it's retrying is because the Solr connector is getting that
error, and it's telling ManifoldCF that it should retry. That's
because it hasn't figured out that the error is due to setup, rather
than some transient condition.
The expiration model for continuous crawling is going to take more to
describe than I can here. I suggest you read about it in the online
end-user documentation. If that's not enough, there's a book on the
way from Manning Publishing, called ManifoldCF in Action. There
should be some chapters that might help you available soon through the
Manning Early Access Program.
Thanks!
Karl
On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:
> On 20.01.11 15.21, Karl Wright wrote:
>>
>> Hi Erlend,
>
> Hi Karl,
>
> Thank you for replying and for your comments. It's very appreciated.
>
>> (1) The best way to find out what ManifoldCF thinks it is doing is to
>> look at the Simple History report in the UI.
>
> It says:
>
> 01-20-2011 15:14:18.914 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 9 lazy loading error
> 01-20-2011 15:14:18.800 fetch http://ridder.uio.no/
> 200 588 103
> 01-20-2011 15:13:18.581 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 16 lazy loading error
> 01-20-2011 15:13:18.448 fetch http://ridder.uio.no/
> 200 588 111
>
>
>> (2) The Web Connector in ManifoldCF does not have the ability, at this
>> time, to extract links from Word docs, pdfs, etc., but Solr can
>> extract *content* from these documents if you configure it to use
>> Tika. The document is sent to Solr in binary form, and Tika extracts
>> whatever metadata it can find. ManifoldCF does not get involved in
>> that at all. Usually, setting up Solr with anonymous fields is the
>> way to go in this case.
>
> Thanks for clarifying. I can try to configure Solr to parse these
> documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
>
> I'm unsure about what you mean by anonymous fields in Solr. I cannot
> define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to
> support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
>
>> If this is an open site, I'll crawl it here myself momentarily and let
>> you know what I find.
>
> Please do that. It's just my workstation with an Apache server running.
> It's
> open.
>
> BTW, I think I have set things up correctly for the crawler:
> Seeds: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
> matching seeds)
>
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the
> page
> every minute?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> 31050
>
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
> It says:
>
> 01-20-2011 15:14:18.914 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 9 lazy loading error
So what is happening is that either your solr instance or your Solr
output connection is misconfigured, and when ManifoldCF tries to send
the document to Solr it returns with an error. I don't know what
Solr's "lazy loading error" is, but hopefully you can find out either
from the doc or from the Solr/Lucene newsgroup.
> Thanks for clarifying. I can try to configure Solr to parse these documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
That's exactly what ManifoldCF is good at.
> I'm unsure about what you mean by anonymous fields in Solr. I cannot define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
I'm probably using the wrong terminology. I think they are actually
called "dynamic fields".
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the page
> every minute?
The reason it's retrying is because the Solr connector is getting that
error, and it's telling ManifoldCF that it should retry. That's
because it hasn't figured out that the error is due to setup, rather
than some transient condition.
The expiration model for continuous crawling is going to take more to
describe than I can here. I suggest you read about it in the online
end-user documentation. If that's not enough, there's a book on the
way from Manning Publishing, called ManifoldCF in Action. There
should be some chapters that might help you available soon through the
Manning Early Access Program.
Thanks!
Karl
On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 20.01.11 15.21, Karl Wright wrote:
>>
>> Hi Erlend,
>
> Hi Karl,
>
> Thank you for replying and for your comments. It's very appreciated.
>
>> (1) The best way to find out what ManifoldCF thinks it is doing is to
>> look at the Simple History report in the UI.
>
> It says:
>
> 01-20-2011 15:14:18.914 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 9 lazy loading error
> 01-20-2011 15:14:18.800 fetch http://ridder.uio.no/
> 200 588 103
> 01-20-2011 15:13:18.581 document ingest (solr_indexer)
> http://ridder.uio.no/
> 500 588 16 lazy loading error
> 01-20-2011 15:13:18.448 fetch http://ridder.uio.no/
> 200 588 111
>
>
>> (2) The Web Connector in ManifoldCF does not have the ability, at this
>> time, to extract links from Word docs, pdfs, etc., but Solr can
>> extract *content* from these documents if you configure it to use
>> Tika. The document is sent to Solr in binary form, and Tika extracts
>> whatever metadata it can find. ManifoldCF does not get involved in
>> that at all. Usually, setting up Solr with anonymous fields is the
>> way to go in this case.
>
> Thanks for clarifying. I can try to configure Solr to parse these documents.
> Nutch did a good job except that it cannot detect whether a document was
> modified in order to send an update/delete commando to Solr. That function
> is crucial for us.
>
> I'm unsure about what you mean by anonymous fields in Solr. I cannot define
> the fields I need in schema.xml as I want? I have created duplicate fields
> for title and content in order to use different stemmers (I need to support
> English and Norwegian). In Nutch there is a simple configuration file for
> mapping fields from Nutch to Solr.
>
>> If this is an open site, I'll crawl it here myself momentarily and let
>> you know what I find.
>
> Please do that. It's just my workstation with an Apache server running. It's
> open.
>
> BTW, I think I have set things up correctly for the crawler:
> Seeds: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
> matching seeds)
>
> I havent't filled out the "expiration interval (if continuous)." under the
> scheduling folder. Is this the reason why ManifoldCF is recrawling the page
> every minute?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>
Re: Indexing Solr with the web crawler
Posted by Erlend Garåsen <e....@usit.uio.no>.
On 20.01.11 15.21, Karl Wright wrote:
> Hi Erlend,
Hi Karl,
Thank you for replying and for your comments. It's very appreciated.
> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.
It says:
01-20-2011 15:14:18.914 document ingest (solr_indexer)
http://ridder.uio.no/
500 588 9 lazy loading error
01-20-2011 15:14:18.800 fetch http://ridder.uio.no/
200 588 103
01-20-2011 15:13:18.581 document ingest (solr_indexer)
http://ridder.uio.no/
500 588 16 lazy loading error
01-20-2011 15:13:18.448 fetch http://ridder.uio.no/
200 588 111
> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika. The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find. ManifoldCF does not get involved in
> that at all. Usually, setting up Solr with anonymous fields is the
> way to go in this case.
Thanks for clarifying. I can try to configure Solr to parse these
documents. Nutch did a good job except that it cannot detect whether a
document was modified in order to send an update/delete commando to
Solr. That function is crucial for us.
I'm unsure about what you mean by anonymous fields in Solr. I cannot
define the fields I need in schema.xml as I want? I have created
duplicate fields for title and content in order to use different
stemmers (I need to support English and Norwegian). In Nutch there is a
simple configuration file for mapping fields from Nutch to Solr.
> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.
Please do that. It's just my workstation with an Apache server running.
It's open.
BTW, I think I have set things up correctly for the crawler:
Seeds: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
matching seeds)
I havent't filled out the "expiration interval (if continuous)." under
the scheduling folder. Is this the reason why ManifoldCF is recrawling
the page every minute?
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
Hmm, right now I'm behind a firewall, unfortunately, so I won't be
able to try this myself until this evening. But if you post the
output of your simple history report I can help interpret it for you.
Karl
On Thu, Jan 20, 2011 at 9:21 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Erlend,
>
> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.
>
> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika. The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find. ManifoldCF does not get involved in
> that at all. Usually, setting up Solr with anonymous fields is the
> way to go in this case.
>
> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.
>
> Karl
>
>
>
> On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>
>> I have started the Jetty server, configured the web crawler, a Solr
>> connector and created a job. First I try to crawl the following site:
>> http://ridder.uio.no/
>> which contains nothing but an index.html with links to different kinds of
>> document types (pdf, html, doc etc.).
>>
>> I have three questions.
>>
>> 1. Why do I now have a lot of these lines in the above host's access_log
>> after the crawler has been started?
>> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>>
>> What is the crawler trying to do which it probably cannot do? Why is it
>> fetching the same URL over and over again?
>>
>> 2. How can I index Solr when I don't know which fields ManifoldCF's web
>> crawler collects? There is a field mapper in the job configuration, but I
>> only know about the fields I have configured in Solr's schema.xml.
>>
>> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
>> it does not use Apache Tika, is it possible to configure the web crawler to
>> use Tika for document parsing and language detection?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>
Re: Indexing Solr with the web crawler
Posted by Karl Wright <da...@gmail.com>.
Hi Erlend,
(1) The best way to find out what ManifoldCF thinks it is doing is to
look at the Simple History report in the UI.
(2) The Web Connector in ManifoldCF does not have the ability, at this
time, to extract links from Word docs, pdfs, etc., but Solr can
extract *content* from these documents if you configure it to use
Tika. The document is sent to Solr in binary form, and Tika extracts
whatever metadata it can find. ManifoldCF does not get involved in
that at all. Usually, setting up Solr with anonymous fields is the
way to go in this case.
If this is an open site, I'll crawl it here myself momentarily and let
you know what I find.
Karl
On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> I have started the Jetty server, configured the web crawler, a Solr
> connector and created a job. First I try to crawl the following site:
> http://ridder.uio.no/
> which contains nothing but an index.html with links to different kinds of
> document types (pdf, html, doc etc.).
>
> I have three questions.
>
> 1. Why do I now have a lot of these lines in the above host's access_log
> after the crawler has been started?
> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
>
> What is the crawler trying to do which it probably cannot do? Why is it
> fetching the same URL over and over again?
>
> 2. How can I index Solr when I don't know which fields ManifoldCF's web
> crawler collects? There is a field mapper in the job configuration, but I
> only know about the fields I have configured in Solr's schema.xml.
>
> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
> it does not use Apache Tika, is it possible to configure the web crawler to
> use Tika for document parsing and language detection?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>
Re: Indexing Solr with the web crawler
Posted by Jack Krupansky <ja...@lucidimagination.com>.
The Solr connector is designed to send raw document content (unparsed) to
Solr Cell (the ExtractingRequestHandler) which then uses Tika for mime type
detection and document parsing. If you run Tika directly it will tell you
what metadata is extracted from a particular document type, which varies.
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
You can also access Solr Cell with the "Extract Only" option to see what
Tika is generating within Solr Cell for a particular input document and then
use those metadata field names to construct MCF field mappings to your
schema fields.
See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
-- Jack Krupansky
-----Original Message-----
From: Erlend Garåsen
Sent: Thursday, January 20, 2011 9:08 AM
To: connectors-user@incubator.apache.org
Subject: Indexing Solr with the web crawler
I have started the Jetty server, configured the web crawler, a Solr
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds
of document types (pdf, html, doc etc.).
I have three questions.
1. Why do I now have a lot of these lines in the above host's access_log
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
What is the crawler trying to do which it probably cannot do? Why is it
fetching the same URL over and over again?
2. How can I index Solr when I don't know which fields ManifoldCF's web
crawler collects? There is a field mapper in the job configuration, but
I only know about the fields I have configured in Solr's schema.xml.
3. Will the web crawler parse document types such as PDF, doc, rtf etc.?
If it does not use Apache Tika, is it possible to configure the web
crawler to use Tika for document parsing and language detection?
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050