You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by hala <ro...@yahoo.com> on 2011/02/13 13:47:11 UTC

nutch crawling arabic pdf site

when i crawl a site with pdf link contain arabic words it dont return me the
arabic word in pdf when i search with nutch on it 
what can i do please help me 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2485360.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling arabic pdf site

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

1. is the PDF actually fetched, parsed and indexed? Doesn't your regex-
urlfilter skip PDF?
2. Is the PDF too large, is it being truncated by Nutch?
3. Does Tika actually parse the PDF as you expect?

There may be issues at separate locations. You can use the parser checker to 
confirm Tika's working.

bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.apache.org/licenses/icla.pdf

Cheers,

On Wednesday 16 February 2011 08:31:23 hala wrote:
> thank you for your reply
> i do a complete crawl (generate, fetch, update, index) cycle,i use nutch
> internal search,i crawl a site that has alink to pdf file,the pdf contain
> arabic words, i want to search on them by nutch.
> if the site has arabic words, nutch return them to me, but if the arabic
> words in the pdf ,nutch don't return them to me.
> please give me any help

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch crawling arabic pdf site

Posted by hala <ro...@yahoo.com>.


thank you for your reply
i do a complete crawl (generate, fetch, update, index) cycle,i use nutch
internal search,i crawl a site that has alink to pdf file,the pdf contain
arabic words, i want to search on them by nutch. 
if the site has arabic words, nutch return them to me, but if the arabic
words in the pdf ,nutch don't return them to me.
please give me any help
-- 
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2507554.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling arabic pdf site

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

What configuration are you using? Did you actually succeed a complete crawl 
(generate, fetch, update, index) cycle? Are you using Nutch' internal search 
or are you using Solr as search backend? Can your servlet container handle non 
latin input for GET requests?

Using Nutch and Solr i can query for almost any kind of language and character 
set. I haven't used Nutch' own search but since Solr and Nutch share quite 
some components i'm sure that should work as well.

Please provide more details and there is no need to start a second thread on 
the same subject ;)

Cheers,
> when i crawl a site with pdf link contain arabic words it dont return me
> the arabic word in pdf when i search with nutch on it
> what can i do please help me

Re: nutch crawling arabic pdf site

Posted by Markus Jelsma <ma...@openindex.io>.

The problem isn't fixed in the 0.9 relase of Tika so you're still stuck here 
and there is no other parse-pdf plugin which you can use. There is, however, 
the parse-ext plugin [1] which you perhaps could use to execute pdf2text and 
return the parsed content. I haven't used this plugin and i don't know how to 
configure it. If you successfully manage to get it up and running then please 
post your findings on the list.

As a last resort you might have to write a custom plugin [2]. But i image it'd 
do the same job as the parse-ext plugin with pdf2text.

[1]: http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-
ext/src/java/org/apache/nutch/parse/ext/ExtParser.java?view=markup
[2]: http://wiki.apache.org/nutch/WritingPluginExample

> thaaaaaanx  a lot for your help
> you have a wide experience
> but the problem is still exist
> i don't know what can i do

Re: nutch crawling arabic pdf site

Posted by hala <ro...@yahoo.com>.

thaaaaaanx  a lot for your help
you have a wide experience 
but the problem is still exist 
i don't know what can i do 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2538118.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling arabic pdf site

Posted by Markus Jelsma <ma...@openindex.io>.

I dug a bit deeper and i now believe you're the victim of TIKA-469
https://issues.apache.org/jira/browse/TIKA-469



On Sunday 13 February 2011 13:47:11 hala wrote:
> when i crawl a site with pdf link contain arabic words it dont return me
> the arabic word in pdf when i search with nutch on it
> what can i do please help me

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350