You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Sharma, Vikas" <vs...@medassets.com> on 2013/10/24 00:39:16 UTC

single core for extracted text from pdf/other doc types and metadata fields about that doc from the database

Can I create a core where one subset of fields comes from the Database source using the DataImport handler for database
and another subset of fields using the Apache Tika dataimport handler

For example if in the indexed doc I want following fields to come from the database source

1              Id
2              DocFilePath (nullable)
3              Subject
4              KeyWords
5              Description
6              Text

and another set of field(s) to come from documents on the  filesystem with text extracted using Apache Tika processor

7              DocText


so that Final Doc fields are as follows
where DocText is the text of the document whose path is mentioned in the DocFilePath column

1              Id
2              DocFilePath (nullable)
3              Subject
4              KeyWords
5              Description
6              Text
7              DocText


Thanks,
Vikas

Vikas Sharma | Senior Software Engineer | MedAssets
14405 SE 36th Street, Suite 206 | Bellevue, WA, 98006 | Work: 425.519.1305
vsharma@medassets.com<ma...@medassets.com>
Visit us at www.medassets.com<http://www.medassets.com>
Follow us on LinkedIn<http://www.linkedin.com/company/medassets>, YouTube<https://www.youtube.com/user/MedAssetsInc>, Twitter<https://twitter.com/MedAssets>, and Facebook<https://www.facebook.com/MedAssets>

*****Attention*****
This electronic transmission may contain confidential, sensitive, proprietary and/or privileged information belonging to the sender. This information, including any attached files, is intended only for the persons or entities to which it is addressed. Authorized recipients of this information are prohibited from disclosing the information to any unauthorized party and are required to properly dispose of the information upon fulfillment of its need/use, unless otherwise required by law. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by any person or entity other than the intended recipient is prohibited. If you have received this electronic transmission in error, please notify the sender and properly dispose of the information immediately.

Re: single core for extracted text from pdf/other doc types and metadata fields about that doc from the database

Posted by Otis Gospodnetic <ot...@gmail.com>.
You can accomplish your end goal easily if you just write your own indexer,
which is easy and gives you power and flexibility.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 23, 2013 6:39 PM, "Sharma, Vikas" <vs...@medassets.com> wrote:

>
> Can I create a core where one subset of fields comes from the Database
> source using the DataImport handler for database
> and another subset of fields using the Apache Tika dataimport handler
>
> For example if in the indexed doc I want following fields to come from the
> database source
>
> 1              Id
> 2              DocFilePath (nullable)
> 3              Subject
> 4              KeyWords
> 5              Description
> 6              Text
>
> and another set of field(s) to come from documents on the  filesystem with
> text extracted using Apache Tika processor
>
> 7              DocText
>
>
> so that Final Doc fields are as follows
> where DocText is the text of the document whose path is mentioned in the
> DocFilePath column
>
> 1              Id
> 2              DocFilePath (nullable)
> 3              Subject
> 4              KeyWords
> 5              Description
> 6              Text
> 7              DocText
>
>
> Thanks,
> Vikas
>
> Vikas Sharma | Senior Software Engineer | MedAssets
> 14405 SE 36th Street, Suite 206 | Bellevue, WA, 98006 | Work: 425.519.1305
> vsharma@medassets.com<ma...@medassets.com>
> Visit us at www.medassets.com<http://www.medassets.com>
> Follow us on LinkedIn<http://www.linkedin.com/company/medassets>, YouTube<
> https://www.youtube.com/user/MedAssetsInc>, Twitter<
> https://twitter.com/MedAssets>, and Facebook<
> https://www.facebook.com/MedAssets>
>
> *****Attention*****
> This electronic transmission may contain confidential, sensitive,
> proprietary and/or privileged information belonging to the sender. This
> information, including any attached files, is intended only for the persons
> or entities to which it is addressed. Authorized recipients of this
> information are prohibited from disclosing the information to any
> unauthorized party and are required to properly dispose of the information
> upon fulfillment of its need/use, unless otherwise required by law. Any
> review, retransmission, dissemination or other use of, or taking of any
> action in reliance upon this information by any person or entity other than
> the intended recipient is prohibited. If you have received this electronic
> transmission in error, please notify the sender and properly dispose of the
> information immediately.
>