You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2006/03/23 03:18:47 UTC

crawling pdf and word file

hi there,

Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: parsing pdf file

Posted by Ravi Chintakunta <ra...@gmail.com>.

Hi Michael,

The default value for the content limit in nutch-default.xml is 65536.
This is set in these properties:

http.content.limit
file.content.limit
ftp.content.limit

So irrespective of the file size,  the download is limited to this value.

To allow parsing of the files that exceed this limit, copy the above 3
properties into nutch-site.xml and increase them to your desired
number.

- Ravi Chintakunta

On 3/24/06, Michael Ji <fj...@yahoo.com> wrote:
> Hi there,
>
> I got the following errors;
>
> 060324 095216 http.max.delays = 10000
> 060324 095217 fetch okay, but can't parse
> http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
> reason: failed(2,202): Content truncated at 69266
> bytes. Parser can't handle incomplete pdf file.
>
> Seems fetching is successfully, but not for parsing; I
> expanding delays to 10000, still not enough?
>
> thanks,
>
> Michael
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: fetching https pages

Posted by kauu <ba...@gmail.com>.

i think u need a protocol to parse the https
so u need to change this in ur nutch-site.xml if u hava the
protocol-https plugin


<name>plugin.includes</name>
  <value>nutch-extensionpoints|protocol-http|protocol-https
|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>

<description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

On 3/27/06, Michael Ji <fj...@yahoo.com> wrote:
>
> hi there:
>
> Does the following lines in nutch-site.xml will let
> nutch to fetch https page down?
>
> "protocol-(http|https)"
>
> I tried that but gives me error message of
>
> "
> failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol
> not found for url=https
> "
>
> Any idea how to fix it?
>
> thanks,
>
> Michael
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>



--
www.babatu.com

Re: fetching https pages

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael Ji wrote:
> hi there:
>
> Does the following lines in nutch-site.xml will let
> nutch to fetch https page down?
>
> "protocol-(http|https)"
>   

No. There is no plugin named "protocol-https". In order to handle HTTPS 
you need to use the "protocol-httpclient" plugin, which handles both 
HTTP and HTTPS - and then you should remove "protocol-http" from your 
config.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

fetching https pages

Posted by Michael Ji <fj...@yahoo.com>.

hi there:

Does the following lines in nutch-site.xml will let
nutch to fetch https page down?

"protocol-(http|https)"

I tried that but gives me error message of 

"
failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol
not found for url=https
"

Any idea how to fix it?

thanks,

Michael




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

a way to fetch, parse, index and query pdf/msword

Posted by Michael Ji <fj...@yahoo.com>.

hi there,

Within nutch-site.xml, I added pdf|msword for
parse-/index-/query-

I wonder if it is the proper way to tell nutch to
fetch,index and query these two file formats?

thanks,

Michael,

---------------------------------------------------

<property>
<name>plugin.includes</name>

<value>

nutch-extensionpoints|protocol-http|
urlfilter-regex|
parse-(text|html|pdf|msword)|
index-(basic|pdf|msword)|
query-(basic|site|url|pdf|msword)

</value>
  <description> </description>
</property>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: search word file

Posted by Michael Ji <fj...@yahoo.com>.

I found my index.done file has 0 size; Is it wrong?

but I can't find the error in the indexing log;

"060324 095226 * Moving index to NFS if needed...
060324 095226 DONE indexing segment 20060324095213:
total 1 records in 0.688 s (Infinity rec/s).
060324 095226 done indexing
"

thanks,

Michael,

--- Michael Ji <fj...@yahoo.com> wrote:

> hi there,
> 
> I can fetch the word and parse the file
> successfully,
> 
> "060324 094040 fetching
>
http://www.ala.org/ala/rusa/rusaprotools/referenceguide/illformprint.doc
> 060324 094040 http.proxy.host = null
> 060324 094040 http.proxy.port = 8080
> 060324 094040 http.timeout = 10000
> "
> 
> I can use the lukeAll to check the content of
> segment
> and could see the letter.
> 
> But I can't search the letter in nutch search page.
> Do
> I need more configuration to let word file been
> searchable?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

search word file

Posted by Michael Ji <fj...@yahoo.com>.

hi there,

I can fetch the word and parse the file successfully,

"060324 094040 fetching
http://www.ala.org/ala/rusa/rusaprotools/referenceguide/illformprint.doc
060324 094040 http.proxy.host = null
060324 094040 http.proxy.port = 8080
060324 094040 http.timeout = 10000
"

I can use the lukeAll to check the content of segment
and could see the letter.

But I can't search the letter in nutch search page. Do
I need more configuration to let word file been
searchable?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

parsing pdf file

Posted by Michael Ji <fj...@yahoo.com>.

Hi there,

I got the following errors;

060324 095216 http.max.delays = 10000
060324 095217 fetch okay, but can't parse
http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
reason: failed(2,202): Content truncated at 69266
bytes. Parser can't handle incomplete pdf file.

Seems fetching is successfully, but not for parsing; I
expanding delays to 10000, still not enough?

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: crawling pdf and word file

Posted by Michael Ji <fj...@yahoo.com>.

hi Sudhendra:

I use the same configuration as you suggested in
nutch-site.xml

I did a testing and after look at the fetch log, found
the following error message

"
fetch okay, but can't parse
http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
reason: failed(2,203): Content-Type not text/html:
application/pdf
"

Does that mean pdf is downloaded but doesn't parse
successfully? So we can't search the word in pdf file
directly?

thanks,

Michael,

By the way, I use nutch 07 to do testing.



--- sudhendra seshachala <su...@yahoo.com> wrote:

> In Nutch-default.xml,
> Include plugin for word and PDF as below.
> 
> <property>
>   <name>plugin.includes</name>
>  
>
<value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value>
>   <description>Regular expression naming plugin
> directory names to
>   include.  Any plugin not matching this expression
> is excluded.
>   In any case you need at least include the
> nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and
> plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> But reco is to include the property in
> nutch-site.xml
> 
> Hope this helps.
> 
> Michael Ji <fj...@yahoo.com> wrote: 
> hi there,
> 
> Is there any specific setting need to be added in
> configuration file in order to crawl and index pdf
> and
> word file?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
> Blab-away for as little as 1¢/min. Make  PC-to-Phone
> Calls using Yahoo! Messenger with Voice.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: crawling pdf and word file

Posted by sudhendra seshachala <su...@yahoo.com>.

In Nutch-default.xml,
Include plugin for word and PDF as below.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
But reco is to include the property in nutch-site.xml

Hope this helps.

Michael Ji <fj...@yahoo.com> wrote: 
hi there,

Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! Messenger with Voice.