You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by hala <ro...@yahoo.com> on 2011/02/13 13:47:11 UTC
nutch crawling arabic pdf site
when i crawl a site with pdf link contain arabic words it dont return me the
arabic word in pdf when i search with nutch on it
what can i do please help me
--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2485360.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch crawling arabic pdf site
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
1. is the PDF actually fetched, parsed and indexed? Doesn't your regex-
urlfilter skip PDF?
2. Is the PDF too large, is it being truncated by Nutch?
3. Does Tika actually parse the PDF as you expect?
There may be issues at separate locations. You can use the parser checker to
confirm Tika's working.
bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
http://www.apache.org/licenses/icla.pdf
Cheers,
On Wednesday 16 February 2011 08:31:23 hala wrote:
> thank you for your reply
> i do a complete crawl (generate, fetch, update, index) cycle,i use nutch
> internal search,i crawl a site that has alink to pdf file,the pdf contain
> arabic words, i want to search on them by nutch.
> if the site has arabic words, nutch return them to me, but if the arabic
> words in the pdf ,nutch don't return them to me.
> please give me any help
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: nutch crawling arabic pdf site
Posted by hala <ro...@yahoo.com>.
thank you for your reply
i do a complete crawl (generate, fetch, update, index) cycle,i use nutch
internal search,i crawl a site that has alink to pdf file,the pdf contain
arabic words, i want to search on them by nutch.
if the site has arabic words, nutch return them to me, but if the arabic
words in the pdf ,nutch don't return them to me.
please give me any help
--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2507554.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch crawling arabic pdf site
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
What configuration are you using? Did you actually succeed a complete crawl
(generate, fetch, update, index) cycle? Are you using Nutch' internal search
or are you using Solr as search backend? Can your servlet container handle non
latin input for GET requests?
Using Nutch and Solr i can query for almost any kind of language and character
set. I haven't used Nutch' own search but since Solr and Nutch share quite
some components i'm sure that should work as well.
Please provide more details and there is no need to start a second thread on
the same subject ;)
Cheers,
> when i crawl a site with pdf link contain arabic words it dont return me
> the arabic word in pdf when i search with nutch on it
> what can i do please help me
Re: nutch crawling arabic pdf site
Posted by Markus Jelsma <ma...@openindex.io>.
The problem isn't fixed in the 0.9 relase of Tika so you're still stuck here
and there is no other parse-pdf plugin which you can use. There is, however,
the parse-ext plugin [1] which you perhaps could use to execute pdf2text and
return the parsed content. I haven't used this plugin and i don't know how to
configure it. If you successfully manage to get it up and running then please
post your findings on the list.
As a last resort you might have to write a custom plugin [2]. But i image it'd
do the same job as the parse-ext plugin with pdf2text.
[1]: http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-
ext/src/java/org/apache/nutch/parse/ext/ExtParser.java?view=markup
[2]: http://wiki.apache.org/nutch/WritingPluginExample
> thaaaaaanx a lot for your help
> you have a wide experience
> but the problem is still exist
> i don't know what can i do
Re: nutch crawling arabic pdf site
Posted by hala <ro...@yahoo.com>.
thaaaaaanx a lot for your help
you have a wide experience
but the problem is still exist
i don't know what can i do
--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2538118.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch crawling arabic pdf site
Posted by Markus Jelsma <ma...@openindex.io>.
I dug a bit deeper and i now believe you're the victim of TIKA-469
https://issues.apache.org/jira/browse/TIKA-469
On Sunday 13 February 2011 13:47:11 hala wrote:
> when i crawl a site with pdf link contain arabic words it dont return me
> the arabic word in pdf when i search with nutch on it
> what can i do please help me
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350