You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Khai Doan <kh...@gmail.com> on 2009/09/03 02:04:01 UTC

How to use DataImportHandler with ExtractingRequestHandler?

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan

Re: How to use DataImportHandler with ExtractingRequestHandler?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Fri, Nov 20, 2009 at 9:13 PM, javaxmlsoapdev <vi...@yahoo.com> wrote:

>
> did you extend DIH to do this work? can you share code samples. I have
> similar requirement where I need tp index database records and each record
> has a column with document path so need to create another index for
> documents (we allow users to search both index separately) in parallel with
> reading some meta data of documents from database as well. I have all sorts
> of different document formats to index. fyi; I am on solr 1.4.0. Any
> pointers would be appreciated.
>
>
He did not extend DIH for this. He extracted out text from his documents and
saved them into files and used XPathEntityProcessor (you can use
PlainTextEntityProcessor) to index them.

I don't know much about ExtractionRequestHandler but if you want to use DIH,
you'll have to extend it to add Tika support. You may want to look at a
couple of open issues:

   1. https://issues.apache.org/jira/browse/SOLR-1358
   2. https://issues.apache.org/jira/browse/SOLR-1583

-- 
Regards,
Shalin Shekhar Mangar.

Re: How to use DataImportHandler with ExtractingRequestHandler?

Posted by javaxmlsoapdev <vi...@yahoo.com>.

Anyone any idea?

javaxmlsoapdev wrote:
> 
> did you extend DIH to do this work? can you share code samples. I have
> similar requirement where I need tp index database records and each record
> has a column with document path so need to create another index for
> documents (we allow users to search both index separately) in parallel
> with reading some meta data of documents from database as well. I have all
> sorts of different document formats to index. I am on solr 1.4.0. Any
> pointers would be appreciated.
> 
> Thanks,
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26485245.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use DataImportHandler with ExtractingRequestHandler?

Posted by javaxmlsoapdev <vi...@yahoo.com>.

did you extend DIH to do this work? can you share code samples. I have
similar requirement where I need tp index database records and each record
has a column with document path so need to create another index for
documents (we allow users to search both index separately) in parallel with
reading some meta data of documents from database as well. I have all sorts
of different document formats to index. fyi; I am on solr 1.4.0. Any
pointers would be appreciated.

Thanks,

Sascha Szott wrote:
> 
> Hi Khai,
> 
> a few weeks ago, I was facing the same problem.
> 
> In my case, this workaround helped (assuming, you're using Solr 1.3): 
> For each row, extract the content from the corresponding pdf file using 
> a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
> in case you need to process other file types as well), put it between
> 
> 	<foo><![CDATA[
> 
> and
> 
> 	]]></foo>
> 
> and store it in a text file. To keep the relationship between a file and 
> its corresponding database row, use the primary key as the file name.
> 
> Within data-config.xml use the XPathEntityProcessor as follows (replace 
> dbRow and primaryKey respectively):
> 
> <entity name="pdfcontent"
> 	processor="XPathEntityProcessor"
> 	forEach="/foo"
> 	url="${dbRow.primaryKey}.xml">
>    <field column="pdftext" xpath="/foo"/>
> </entity>
> 
> 
> And, by the way, in Solr 1.4 you do not have to put your content between 
> xml tags: use the PlainTextEntityProcessor instead of
> XPathEntityProcessor.
> 
> Best,
> Sascha
> 
> Khai Doan schrieb:
>> Hi all,
>> 
>> My name is Khai.  I have a table in a relational database.  I have
>> successfully use DataImportHandler to import this data into Apache Solr.
>> However, one of the column store the location of PDF file.  How can I
>> configure DataImportHandler to use ExtractingRequestHandler to extract
>> the
>> content of the PDF?
>> 
>> Thanks!
>> 
>> Khai Doan
>> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use DataImportHandler with ExtractingRequestHandler?

Posted by Sascha Szott <sz...@zib.de>.

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between

	<foo><![CDATA[

and

	]]></foo>

and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):

<entity name="pdfcontent"
	processor="XPathEntityProcessor"
	forEach="/foo"
	url="${dbRow.primaryKey}.xml">
   <field column="pdftext" xpath="/foo"/>
</entity>


And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.

Best,
Sascha

Khai Doan schrieb:
> Hi all,
> 
> My name is Khai.  I have a table in a relational database.  I have
> successfully use DataImportHandler to import this data into Apache Solr.
> However, one of the column store the location of PDF file.  How can I
> configure DataImportHandler to use ExtractingRequestHandler to extract the
> content of the PDF?
> 
> Thanks!
> 
> Khai Doan
>

Re: How to use DataImportHandler with ExtractingRequestHandler?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

unfortunately DIH is not yet integrated with ExtractingRequestHandler .
see this https://issues.apache.org/jira/browse/SOLR-1358



On Thu, Sep 3, 2009 at 5:34 AM, Khai Doan<kh...@gmail.com> wrote:
> Hi all,
>
> My name is Khai.  I have a table in a relational database.  I have
> successfully use DataImportHandler to import this data into Apache Solr.
> However, one of the column store the location of PDF file.  How can I
> configure DataImportHandler to use ExtractingRequestHandler to extract the
> content of the PDF?
>
> Thanks!
>
> Khai Doan
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com