You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Claudio Martella <cl...@tis.bz.it> on 2009/12/16 17:36:10 UTC

Accessing crawled data

Hello list,

I'm using nutch 1.0 to crawl some intranet sites and i want to later put
the crawled data into my solr server. Though nutch 1.0 comes with solr
support "out of the box" i think that solution doesn't fit me. First, i
need to run my own code on the crawled data (particularly what comes out
AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
extract keywords with my own code and i want to do some language
detection to choose in what fields to put the text (each solr field for
me has different stopwords and snowball stemming). What happens with the
crawl command is that i get data that has already been indexed, but i'd
like to avoid it.

Basically i'd like the crawler to "mirror" the site and for each URL get
the text in that url. Any suggestions please?

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: Accessing crawled data

Posted by Claudio Martella <cl...@tis.bz.it>.
Andrzej Bialecki wrote:
> On 2009-12-22 16:07, Claudio Martella wrote:
>> Andrzej Bialecki wrote:
>>> On 2009-12-22 13:16, Claudio Martella wrote:
>>>> Yes, I'am aware of that. The problem is that i have some fields of the
>>>> SolrDocument that i want to compute by text analysis (basically i want
>>>> to do some smart keywords extraction) so i have to get in the middle
>>>> between crawling and indexing! My actual solution is to dump the
>>>> content
>>>> in a file through the segreader, parse it and then use SolrJ to
>>>> send the
>>>> documents. Probably the best solution is to set my own analyzer for
>>>> the
>>>> field on solr side, and do keywords extraction there.
>>>>
>>>> Thanks for the script, you'll use it!
>>>
>>> Likely the solution that you are looking for is an IndexingFilter -
>>> this receives a copy of the document with all fields collected just
>>> before it's sent to the indexing backend - and you can freely modify
>>> the content of NutchDocument, e.g. do additional analysis,
>>> add/remove/modify fields, etc.
>>>
>> This sounds very interesting. So the idea is to take the NutchDocument
>> as it comes out of the crawling and modify it (inside of an
>> IndexingFilter) before it's sent to indexing (inside of nutch),  right?
>
> Correct - IndexingFilter-s work no matter whether you use Nutch or
> Solr indexing.
>
>> So how does it relate to nutch schema and solr schema? Can you give me
>> some pointers?
>>
>
> Please take a look at how e.g. the index-more filter is implemented -
> basically you need to copy this filter and make whatever modifications
> you need ;)
>
> Keep in mind that any fields that you create in NutchDocument need to
> be properly declared in schema.xml when using Solr indexing.
>
Ok, I understand the rational behind this.
Another question for me is how to setup the pipeline. For instance i
want to first run the LanguageIdentifier and then move the content to a
particular field (imagine something called "content-$lang", as i want
each content-* field to have its own filtering (stopwords, stemming etc)
on solr side).
I still don't get where i can decide when the languageidentifier gets in
and when my own filter would.

thanks

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: Accessing crawled data

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2009-12-22 16:07, Claudio Martella wrote:
> Andrzej Bialecki wrote:
>> On 2009-12-22 13:16, Claudio Martella wrote:
>>> Yes, I'am aware of that. The problem is that i have some fields of the
>>> SolrDocument that i want to compute by text analysis (basically i want
>>> to do some smart keywords extraction) so i have to get in the middle
>>> between crawling and indexing! My actual solution is to dump the content
>>> in a file through the segreader, parse it and then use SolrJ to send the
>>> documents. Probably the best solution is to set my own analyzer for the
>>> field on solr side, and do keywords extraction there.
>>>
>>> Thanks for the script, you'll use it!
>>
>> Likely the solution that you are looking for is an IndexingFilter -
>> this receives a copy of the document with all fields collected just
>> before it's sent to the indexing backend - and you can freely modify
>> the content of NutchDocument, e.g. do additional analysis,
>> add/remove/modify fields, etc.
>>
> This sounds very interesting. So the idea is to take the NutchDocument
> as it comes out of the crawling and modify it (inside of an
> IndexingFilter) before it's sent to indexing (inside of nutch),  right?

Correct - IndexingFilter-s work no matter whether you use Nutch or Solr 
indexing.

> So how does it relate to nutch schema and solr schema? Can you give me
> some pointers?
>

Please take a look at how e.g. the index-more filter is implemented - 
basically you need to copy this filter and make whatever modifications 
you need ;)

Keep in mind that any fields that you create in NutchDocument need to be 
properly declared in schema.xml when using Solr indexing.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Accessing crawled data

Posted by Claudio Martella <cl...@tis.bz.it>.
Andrzej Bialecki wrote:
> On 2009-12-22 13:16, Claudio Martella wrote:
>> Yes, I'am aware of that. The problem is that i have some fields of the
>> SolrDocument that i want to compute by text analysis (basically i want
>> to do some smart keywords extraction) so i have to get in the middle
>> between crawling and indexing! My actual solution is to dump the content
>> in a file through the segreader, parse it and then use SolrJ to send the
>> documents. Probably the best solution is to set my own analyzer for the
>> field on solr side, and do keywords extraction there.
>>
>> Thanks for the script, you'll use it!
>
> Likely the solution that you are looking for is an IndexingFilter -
> this receives a copy of the document with all fields collected just
> before it's sent to the indexing backend - and you can freely modify
> the content of NutchDocument, e.g. do additional analysis,
> add/remove/modify fields, etc.
>
This sounds very interesting. So the idea is to take the NutchDocument
as it comes out of the crawling and modify it (inside of an
IndexingFilter) before it's sent to indexing (inside of nutch),  right?
So how does it relate to nutch schema and solr schema? Can you give me
some pointers?

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: Accessing crawled data

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2009-12-22 13:16, Claudio Martella wrote:
> Yes, I'am aware of that. The problem is that i have some fields of the
> SolrDocument that i want to compute by text analysis (basically i want
> to do some smart keywords extraction) so i have to get in the middle
> between crawling and indexing! My actual solution is to dump the content
> in a file through the segreader, parse it and then use SolrJ to send the
> documents. Probably the best solution is to set my own analyzer for the
> field on solr side, and do keywords extraction there.
>
> Thanks for the script, you'll use it!

Likely the solution that you are looking for is an IndexingFilter - this 
receives a copy of the document with all fields collected just before 
it's sent to the indexing backend - and you can freely modify the 
content of NutchDocument, e.g. do additional analysis, add/remove/modify 
fields, etc.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Accessing crawled data

Posted by Claudio Martella <cl...@tis.bz.it>.
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!

Just one question about it. Your approach will also dump the pages
outside of the domain that are referenced with urls. How far is it going?


BELLINI ADAM wrote:
> hi
>
> do you know that you can index your data for SOLR...the command is solrindex. so you dont need the nutch index.
>
> and dont you know that you are not obliged to use crawl command ? so if you want so skip index steps you can just use your own steps to inject, generate, fetch, update in a loop. at the end of the loop you can index your data to solr with solrindex command..here is the code
>   
>> steps=10
>> echo "----- Inject (Step 1 of $steps) -----"
>> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>>
>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>> for((i=0; i < $depth; i++))
>> do
>>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>
>>
>> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>>
>>  if [ $? -ne 0 ]
>>  then
>>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>    break
>>  fi
>>  segment=`ls -d $crawl/segments/* | tail -1`
>>
>>  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>>  if [ $? -ne 0 ]
>>  then
>>    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>>    echo "runbot: Deleting segment $segment."
>>    rm $RMARGS $segment
>>    continue
>>  fi
>>
>> echo " ----- Updating Dadatabase ( $steps) -----"
>>
>>
>>  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>> donenutch_folder/bin/nutch solrindex  URL_OF_YOUR_SOLR_SERVER  $crawl/crawldb $crawl/linkdb $crawl/segments/*
>>     
>
>
> and you can also use segreader to extract content of the pages : here is the command :
>
>
> ./bin/nutch readseg -dump crawl_dublin/segments/20091001145126/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex
>
>
> this command will return only the content (source pages)
>
>
> hope it will help.
>
>
>
>   
>> Date: Thu, 17 Dec 2009 15:32:33 +0100
>> From: claudio.martella@tis.bz.it
>> To: nutch-user@lucene.apache.org
>> Subject: Re: Accessing crawled data
>>
>> Hi,
>>
>> actually i completely mis-explained myself. I'll try to make myself
>> clear: i'd like to extract the information in the segments by using the
>> parsers.
>>
>> This means i can basically use the crawl command but this will also
>> index the data and that's a waist of resources. So basically what i will
>> do is copy the code in the org.apache.nutch.craw.Crawl until the data is
>> indexed and will skip that part. What I'm missing at the moment (and
>> maybe i should check the readseg command
>> (org.apache.nutch.segment.SegmentReader class)) is the understanding of
>> how i can basically extract, from the segments, the list of urls fetched
>> and the text connected to the urls. After i have the list of urls and
>> the function:url<->text i can use my text analysis algorithms and create
>> the xml messages to send to my solr server.
>>
>> Any pointer to how to handle segments to extract text or extract list of
>> all the urls in the db?
>>
>> thanks
>>
>> Claudio
>>
>>
>> reinhard schwab wrote:
>>     
>>> if you dont want to refetch already fetched pages,
>>> i think of 3 possibilities:
>>>
>>> a/ set a very high fetch interval
>>> b/ use a customized fetch schedule class instead of DefaultFetchSchedule
>>> implement there a method
>>> public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
>>> which returns false if a datum is already fetched.
>>> this should theoretically work, have not done this.
>>> in nutch-site.xml you have then to set the property
>>>
>>> <property>
>>>   <name>db.fetch.schedule.class</name>
>>>   <value>org.apache.nutch.crawl.CustomizedFetchSchedule</value>
>>>   <description>The implementation of fetch schedule.
>>> DefaultFetchSchedule simply
>>>   adds the original fetchInterval to the last fetch time, regardless of
>>>   page changes.</description>
>>> </property>
>>>
>>> c/ modify the class you are now using as fetch schedule class and adapt
>>> the method shouldFetch
>>> to the behaviour you want
>>>
>>> regards
>>>
>>> Claudio Martella schrieb:
>>>   
>>>       
>>>> Hello list,
>>>>
>>>> I'm using nutch 1.0 to crawl some intranet sites and i want to later put
>>>> the crawled data into my solr server. Though nutch 1.0 comes with solr
>>>> support "out of the box" i think that solution doesn't fit me. First, i
>>>> need to run my own code on the crawled data (particularly what comes out
>>>> AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
>>>> extract keywords with my own code and i want to do some language
>>>> detection to choose in what fields to put the text (each solr field for
>>>> me has different stopwords and snowball stemming). What happens with the
>>>> crawl command is that i get data that has already been indexed, but i'd
>>>> like to avoid it.
>>>>
>>>> Basically i'd like the crawler to "mirror" the site and for each URL get
>>>> the text in that url. Any suggestions please?
>>>>
>>>> Claudio
>>>>
>>>>   
>>>>     
>>>>         
>>>   
>>>       
>> -- 
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>
>>     
>  		 	   		  
> _________________________________________________________________
> Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
> http://go.microsoft.com/?linkid=9691817
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



RE: Accessing crawled data

Posted by BELLINI ADAM <mb...@msn.com>.
hi

do you know that you can index your data for SOLR...the command is solrindex. so you dont need the nutch index.

and dont you know that you are not obliged to use crawl command ? so if you want so skip index steps you can just use your own steps to inject, generate, fetch, update in a loop. at the end of the loop you can index your data to solr with solrindex command..here is the code
>steps=10
> echo "----- Inject (Step 1 of $steps) -----"
> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
> do
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
>
> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>    break
>  fi
>  segment=`ls -d $crawl/segments/* | tail -1`
>
>  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>    echo "runbot: Deleting segment $segment."
>    rm $RMARGS $segment
>    continue
>  fi
>
> echo " ----- Updating Dadatabase ( $steps) -----"
>
>
>  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> donenutch_folder/bin/nutch solrindex  URL_OF_YOUR_SOLR_SERVER  $crawl/crawldb $crawl/linkdb $crawl/segments/*


and you can also use segreader to extract content of the pages : here is the command :


./bin/nutch readseg -dump crawl_dublin/segments/20091001145126/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex


this command will return only the content (source pages)


hope it will help.



> Date: Thu, 17 Dec 2009 15:32:33 +0100
> From: claudio.martella@tis.bz.it
> To: nutch-user@lucene.apache.org
> Subject: Re: Accessing crawled data
> 
> Hi,
> 
> actually i completely mis-explained myself. I'll try to make myself
> clear: i'd like to extract the information in the segments by using the
> parsers.
> 
> This means i can basically use the crawl command but this will also
> index the data and that's a waist of resources. So basically what i will
> do is copy the code in the org.apache.nutch.craw.Crawl until the data is
> indexed and will skip that part. What I'm missing at the moment (and
> maybe i should check the readseg command
> (org.apache.nutch.segment.SegmentReader class)) is the understanding of
> how i can basically extract, from the segments, the list of urls fetched
> and the text connected to the urls. After i have the list of urls and
> the function:url<->text i can use my text analysis algorithms and create
> the xml messages to send to my solr server.
> 
> Any pointer to how to handle segments to extract text or extract list of
> all the urls in the db?
> 
> thanks
> 
> Claudio
> 
> 
> reinhard schwab wrote:
> > if you dont want to refetch already fetched pages,
> > i think of 3 possibilities:
> >
> > a/ set a very high fetch interval
> > b/ use a customized fetch schedule class instead of DefaultFetchSchedule
> > implement there a method
> > public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
> > which returns false if a datum is already fetched.
> > this should theoretically work, have not done this.
> > in nutch-site.xml you have then to set the property
> >
> > <property>
> >   <name>db.fetch.schedule.class</name>
> >   <value>org.apache.nutch.crawl.CustomizedFetchSchedule</value>
> >   <description>The implementation of fetch schedule.
> > DefaultFetchSchedule simply
> >   adds the original fetchInterval to the last fetch time, regardless of
> >   page changes.</description>
> > </property>
> >
> > c/ modify the class you are now using as fetch schedule class and adapt
> > the method shouldFetch
> > to the behaviour you want
> >
> > regards
> >
> > Claudio Martella schrieb:
> >   
> >> Hello list,
> >>
> >> I'm using nutch 1.0 to crawl some intranet sites and i want to later put
> >> the crawled data into my solr server. Though nutch 1.0 comes with solr
> >> support "out of the box" i think that solution doesn't fit me. First, i
> >> need to run my own code on the crawled data (particularly what comes out
> >> AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
> >> extract keywords with my own code and i want to do some language
> >> detection to choose in what fields to put the text (each solr field for
> >> me has different stopwords and snowball stemming). What happens with the
> >> crawl command is that i get data that has already been indexed, but i'd
> >> like to avoid it.
> >>
> >> Basically i'd like the crawler to "mirror" the site and for each URL get
> >> the text in that url. Any suggestions please?
> >>
> >> Claudio
> >>
> >>   
> >>     
> >
> >
> >   
> 
> 
> -- 
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
> 
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
> 
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
> 
> 
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817

Re: Accessing crawled data

Posted by Claudio Martella <cl...@tis.bz.it>.
Hi,

actually i completely mis-explained myself. I'll try to make myself
clear: i'd like to extract the information in the segments by using the
parsers.

This means i can basically use the crawl command but this will also
index the data and that's a waist of resources. So basically what i will
do is copy the code in the org.apache.nutch.craw.Crawl until the data is
indexed and will skip that part. What I'm missing at the moment (and
maybe i should check the readseg command
(org.apache.nutch.segment.SegmentReader class)) is the understanding of
how i can basically extract, from the segments, the list of urls fetched
and the text connected to the urls. After i have the list of urls and
the function:url<->text i can use my text analysis algorithms and create
the xml messages to send to my solr server.

Any pointer to how to handle segments to extract text or extract list of
all the urls in the db?

thanks

Claudio


reinhard schwab wrote:
> if you dont want to refetch already fetched pages,
> i think of 3 possibilities:
>
> a/ set a very high fetch interval
> b/ use a customized fetch schedule class instead of DefaultFetchSchedule
> implement there a method
> public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
> which returns false if a datum is already fetched.
> this should theoretically work, have not done this.
> in nutch-site.xml you have then to set the property
>
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.CustomizedFetchSchedule</value>
>   <description>The implementation of fetch schedule.
> DefaultFetchSchedule simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
> </property>
>
> c/ modify the class you are now using as fetch schedule class and adapt
> the method shouldFetch
> to the behaviour you want
>
> regards
>
> Claudio Martella schrieb:
>   
>> Hello list,
>>
>> I'm using nutch 1.0 to crawl some intranet sites and i want to later put
>> the crawled data into my solr server. Though nutch 1.0 comes with solr
>> support "out of the box" i think that solution doesn't fit me. First, i
>> need to run my own code on the crawled data (particularly what comes out
>> AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
>> extract keywords with my own code and i want to do some language
>> detection to choose in what fields to put the text (each solr field for
>> me has different stopwords and snowball stemming). What happens with the
>> crawl command is that i get data that has already been indexed, but i'd
>> like to avoid it.
>>
>> Basically i'd like the crawler to "mirror" the site and for each URL get
>> the text in that url. Any suggestions please?
>>
>> Claudio
>>
>>   
>>     
>
>
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: Accessing crawled data

Posted by reinhard schwab <re...@aon.at>.
if you dont want to refetch already fetched pages,
i think of 3 possibilities:

a/ set a very high fetch interval
b/ use a customized fetch schedule class instead of DefaultFetchSchedule
implement there a method
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
which returns false if a datum is already fetched.
this should theoretically work, have not done this.
in nutch-site.xml you have then to set the property

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.CustomizedFetchSchedule</value>
  <description>The implementation of fetch schedule.
DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property>

c/ modify the class you are now using as fetch schedule class and adapt
the method shouldFetch
to the behaviour you want

regards

Claudio Martella schrieb:
> Hello list,
>
> I'm using nutch 1.0 to crawl some intranet sites and i want to later put
> the crawled data into my solr server. Though nutch 1.0 comes with solr
> support "out of the box" i think that solution doesn't fit me. First, i
> need to run my own code on the crawled data (particularly what comes out
> AFTER the parser (as i'm crawling both pdf, doc etc)) as i want to
> extract keywords with my own code and i want to do some language
> detection to choose in what fields to put the text (each solr field for
> me has different stopwords and snowball stemming). What happens with the
> crawl command is that i get data that has already been indexed, but i'd
> like to avoid it.
>
> Basically i'd like the crawler to "mirror" the site and for each URL get
> the text in that url. Any suggestions please?
>
> Claudio
>
>