You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jorge Luis Betancourt Gonzalez <jl...@uci.cu> on 2013/01/28 16:53:04 UTC

Solr dinamic fields

Hi:

I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract and store the textual content of the files in solr, but right now we want to be able to extract the content of the PDFs by page, this means that, we want to store several solr fields (one per each page in the document). Is there any recommended way of accomplish this in nutch/solr?. With a parse plugin I could store the text from each page to the metadata's document, anything else would be needed?

slds
--
"It is only in the mysterious equation of love that any 
logical reasons can be found."
"Good programmers often confuse halloween (31 OCT) with 
christmas (25 DEC)"

RE: Solr dinamic fields

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

 
 
-----Original message-----
> From:Jorge Luis Betancourt Gonzalez <jl...@uci.cu>
> Sent: Mon 28-Jan-2013 17:01
> To: user@nutch.apache.org
> Subject: Solr dinamic fields
> 
> Hi:
> 
> I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract and store the textual content of the files in solr, but right now we want to be able to extract the content of the PDFs by page, this means that, we want to store several solr fields (one per each page in the document). Is there any recommended way of accomplish this in nutch/solr?. With a parse plugin I could store the text from each page to the metadata's document, anything else would be needed?

Yes, make a custom indexing filter that reads your parsed metadata and adds page specific fields to NutchDocument. That should work fine.

> 
> slds
> --
> "It is only in the mysterious equation of love that any 
> logical reasons can be found."
> "Good programmers often confuse halloween (31 OCT) with 
> christmas (25 DEC)"
> 
>

Re: Solr dinamic fields

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.

Sorry, the message was sent without finish :-(. So the question basically is can I send the content of one or several fields into one dynamic field using solrmapping.xml. The problem raises in the unknown variable (number of pages) so any advice on how this could be accomplished?

----- Mensaje original -----
De: "Jorge Luis Betancourt Gonzalez" <jl...@uci.cu>
Para: user@nutch.apache.org
Enviados: Lunes, 28 de Enero 2013 10:53:04
Asunto: Solr dinamic fields

Hi:

I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract and store the textual content of the files in solr, but right now we want to be able to extract the content of the PDFs by page, this means that, we want to store several solr fields (one per each page in the document). Is there any recommended way of accomplish this in nutch/solr?. With a parse plugin I could store the text from each page to the metadata's document, anything else would be needed?

slds
--
"It is only in the mysterious equation of love that any 
logical reasons can be found."
"Good programmers often confuse halloween (31 OCT) with 
christmas (25 DEC)"