You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2011/01/24 15:13:18 UTC

PDF Content Extraction

Hello list,

I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages and associated URL’s containing PDF document’s. I have then been using Luke v 1.0.1 to look inside my index to guarantee I have indexed specific PDF documents which reside on these web pages. When I search my index I am returned a hyperlink (amongst other information) for a relevant hit. It is my intention to implement a content extraction mechanism to also provide relevant information contained within the pdf documents which reside in my index whenever a user submits a query. E.g. if someone were to submit a query relating to a clause within a legal document, the content extraction tool would parse the pdf file and provide a snippet of the relevant data from within the PDF document in the search result.

I hope I have explained my problem properly, I am posting here as I have been aware for some time that Tika was possibly the solution but I am only just getting round to working on this now.

Does anyone have a suggestion of how I can implement this in Nutch 1.2. Thank you

Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

RE: PDF Content Extraction

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.

This is great thanks for explaining. I will take your points on and hopefully get it sorted out.

Lewis

________________________________________
From: Claudio Martella [claudio.martella@tis.bz.it]
Sent: 24 January 2011 14:30
To: user@nutch.apache.org
Subject: Re: PDF Content Extraction

I might not understand your question correctly, but it looks like you
can send your data to SOLR and issue your queries there. You'll ask SOLR
to return the snippet of the content that matches the query.

If your question relates on how to extract the data from the pdf, you
can configure nutch to use the parse-tika plugin, it will extract the
text just like nutch is doing on your html right now.

I hope i'm answering your question.

On 1/24/11 3:13 PM, McGibbney, Lewis John wrote:
> Hello list,
>
> I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages and associated URL’s containing PDF document’s. I have then been using Luke v 1.0.1 to look inside my index to guarantee I have indexed specific PDF documents which reside on these web pages. When I search my index I am returned a hyperlink (amongst other information) for a relevant hit. It is my intention to implement a content extraction mechanism to also provide relevant information contained within the pdf documents which reside in my index whenever a user submits a query. E.g. if someone were to submit a query relating to a clause within a legal document, the content extraction tool would parse the pdf file and provide a snippet of the relevant data from within the PDF document in the search result.
>
> I hope I have explained my problem properly, I am posting here as I have been aware for some time that Tika was possibly the solution but I am only just getting round to working on this now.
>
> Does anyone have a suggestion of how I can implement this in Nutch 1.2. Thank you
>
> Lewis
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>

--
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: PDF Content Extraction

Posted by Claudio Martella <cl...@tis.bz.it>.

I might not understand your question correctly, but it looks like you
can send your data to SOLR and issue your queries there. You'll ask SOLR
to return the snippet of the content that matches the query.

If your question relates on how to extract the data from the pdf, you
can configure nutch to use the parse-tika plugin, it will extract the
text just like nutch is doing on your html right now.

I hope i'm answering your question.

On 1/24/11 3:13 PM, McGibbney, Lewis John wrote:
> Hello list,
>
> I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages and associated URL’s containing PDF document’s. I have then been using Luke v 1.0.1 to look inside my index to guarantee I have indexed specific PDF documents which reside on these web pages. When I search my index I am returned a hyperlink (amongst other information) for a relevant hit. It is my intention to implement a content extraction mechanism to also provide relevant information contained within the pdf documents which reside in my index whenever a user submits a query. E.g. if someone were to submit a query relating to a clause within a legal document, the content extraction tool would parse the pdf file and provide a snippet of the relevant data from within the PDF document in the search result.
>
> I hope I have explained my problem properly, I am posting here as I have been aware for some time that Tika was possibly the solution but I am only just getting round to working on this now.
>
> Does anyone have a suggestion of how I can implement this in Nutch 1.2. Thank you
>
> Lewis
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.