You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Anupam Bhattacharya <an...@gmail.com> on 2012/03/06 18:24:55 UTC

Re: How to Index Custom XML structure

Thanks Erick, for the prompt response,

Both the suggestions will be useful for a one time indexing activity. Since
DIH will be one time process of indexing the repository thus it is of no
use in my case.Writing a standalone Java program utilizing SolrJ will again
be a one time indexing process.

I want to write a separate Handler which will be called by ManifoldCF Job
to create indexes in SOLR. In my case the repository is Documentum Content
server. I found some relevant link at this url..
https://community.emc.com/docs/DOC-6520 which is quite similar to my
requirement.

I modified the code to parse the XML and added that into the document
properties Although this works fine when i try to test it with my CURL
program with parameters but when the same handler is called from ManifoldCF
job the job gets terminated within few minutes. Not sure the reason for
that. The handler is written similar to /update/extract which is
ExtractingRequestHandler.

Is ExtractingRequestHandler capable of extracting tag name and values using
some of its defined attributes like capture, captureAttr, extractOnly etc ?
which can be added into the document indexes..

On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson <er...@gmail.com>wrote:

> You might be able to do something with the XSL Transformer step in DIH.
>
> It might also be easier to just write a SolrJ program to parse the XML and
> construct a SolrInputDocument to send to Solr. It's really pretty
> straightforward.
>
> Best
> Erick
>
> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > Hi,
> >
> > I am using ManifoldCF to Crawl data from Documentum repository. I am able
> > to successfully read the metadata/properties for the defined document
> types
> > in Documentum using the out-of-the box Documentum Connector in
> ManifoldCF.
> > Unfortunately, there is one XML file also present which consists of a
> > custom XML structure which I need to read and fetch the element values
> and
> > add it for indexing in lucene through SOLR.
> >
> > Is there any mechanism to index any XML structure document in SOLR ?
> >
> > I checked the SOLR CELL framework which support below stucture..
> >
> > <add>
> >  <doc>
> >    <field name="id">9885A004</field>
> >    <field name="name">Canon PowerShot SD500</field>
> >    <field name="category">camera</field>
> >    <field name="features">3x optical zoom</field>
> >    <field name="features">aluminum case</field>
> >    <field name="weight">6.4</field>
> >    <field name="price">329.95</field>
> >  </doc>
> >  <doc>
> >    <field name="id">9885A003</field>
> >    <field name="name">Canon PowerShot SD504</field>
> >    <field name="category">camera1</field>
> >    <field name="features">3x optical zoom1</field>
> >    <field name="features">aluminum case1</field>
> >    <field name="weight">6.41</field>
> >    <field name="price">329.956</field>
> >  </doc>
> > </add>
> >
> > & my Custom XML structure is of the following format.. from which I need
> to
> > read *subject *& *abstract *field for indexing. I checked TIKA project
> but
> > I couldn't find any useful stuff.
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <RECORD>
> > <doc_id>1</doc_id>
> > <abstract>This is an abstract.</abstract>
> > <subject>Text Subject</subject>
> > <availability />
> > <indexing>
> > <index_group></index_group>
> > <keyterms></keyterms>
> > <keyterms></keyterms>
> > </indexing>
> > <publication_date></publication_date>
> > <physical_storage />
> > <log_entry />
> > <legal_category />
> > <legal_category_notes />
> > <citation_only></citation_only>
> > <citation_only_desc />
> > <export_control />
> > <export_control_desc />
> > </RECORD>
> >
> > Appreciate any help on this.
> >
> > Regards
> > Anupam
>

-- 
Thanks & Regards
Anupam Bhattacharya

Re: How to Index Custom XML structure

Posted by Jan Høydahl <ja...@cominvent.com>.

You could setup a ManifoldCF job to fetch the XMLs and then setup a new SolrOutputConnection for /solr/update/xslt?tr=myStyleSheet.xsl where myStyleSheet.xsl is the stylesheet to use for that kind of XML. See http://wiki.apache.org/solr/XsltUpdateRequestHandler

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 7. mars 2012, at 14:04, Erick Erickson wrote:

> Well, I'm ManifoldCF ignorant, so I'll have to defer on this one....
> 
> Best
> Erick
> 
> On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
>> Thanks Erick, for the prompt response,
>> 
>> Both the suggestions will be useful for a one time indexing activity. Since
>> DIH will be one time process of indexing the repository thus it is of no
>> use in my case.Writing a standalone Java program utilizing SolrJ will again
>> be a one time indexing process.
>> 
>> I want to write a separate Handler which will be called by ManifoldCF Job
>> to create indexes in SOLR. In my case the repository is Documentum Content
>> server. I found some relevant link at this url..
>> https://community.emc.com/docs/DOC-6520 which is quite similar to my
>> requirement.
>> 
>> I modified the code to parse the XML and added that into the document
>> properties Although this works fine when i try to test it with my CURL
>> program with parameters but when the same handler is called from ManifoldCF
>> job the job gets terminated within few minutes. Not sure the reason for
>> that. The handler is written similar to /update/extract which is
>> ExtractingRequestHandler.
>> 
>> Is ExtractingRequestHandler capable of extracting tag name and values using
>> some of its defined attributes like capture, captureAttr, extractOnly etc ?
>> which can be added into the document indexes..
>> 
>> 
>> On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson <er...@gmail.com>wrote:
>> 
>>> You might be able to do something with the XSL Transformer step in DIH.
>>> 
>>> It might also be easier to just write a SolrJ program to parse the XML and
>>> construct a SolrInputDocument to send to Solr. It's really pretty
>>> straightforward.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
>>> <an...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am using ManifoldCF to Crawl data from Documentum repository. I am able
>>>> to successfully read the metadata/properties for the defined document
>>> types
>>>> in Documentum using the out-of-the box Documentum Connector in
>>> ManifoldCF.
>>>> Unfortunately, there is one XML file also present which consists of a
>>>> custom XML structure which I need to read and fetch the element values
>>> and
>>>> add it for indexing in lucene through SOLR.
>>>> 
>>>> Is there any mechanism to index any XML structure document in SOLR ?
>>>> 
>>>> I checked the SOLR CELL framework which support below stucture..
>>>> 
>>>> <add>
>>>>  <doc>
>>>>    <field name="id">9885A004</field>
>>>>    <field name="name">Canon PowerShot SD500</field>
>>>>    <field name="category">camera</field>
>>>>    <field name="features">3x optical zoom</field>
>>>>    <field name="features">aluminum case</field>
>>>>    <field name="weight">6.4</field>
>>>>    <field name="price">329.95</field>
>>>>  </doc>
>>>>  <doc>
>>>>    <field name="id">9885A003</field>
>>>>    <field name="name">Canon PowerShot SD504</field>
>>>>    <field name="category">camera1</field>
>>>>    <field name="features">3x optical zoom1</field>
>>>>    <field name="features">aluminum case1</field>
>>>>    <field name="weight">6.41</field>
>>>>    <field name="price">329.956</field>
>>>>  </doc>
>>>> </add>
>>>> 
>>>> & my Custom XML structure is of the following format.. from which I need
>>> to
>>>> read *subject *& *abstract *field for indexing. I checked TIKA project
>>> but
>>>> I couldn't find any useful stuff.
>>>> 
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <RECORD>
>>>> <doc_id>1</doc_id>
>>>> <abstract>This is an abstract.</abstract>
>>>> <subject>Text Subject</subject>
>>>> <availability />
>>>> <indexing>
>>>> <index_group></index_group>
>>>> <keyterms></keyterms>
>>>> <keyterms></keyterms>
>>>> </indexing>
>>>> <publication_date></publication_date>
>>>> <physical_storage />
>>>> <log_entry />
>>>> <legal_category />
>>>> <legal_category_notes />
>>>> <citation_only></citation_only>
>>>> <citation_only_desc />
>>>> <export_control />
>>>> <export_control_desc />
>>>> </RECORD>
>>>> 
>>>> Appreciate any help on this.
>>>> 
>>>> Regards
>>>> Anupam
>>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards
>> Anupam Bhattacharya

Re: How to Index Custom XML structure

Posted by Erick Erickson <er...@gmail.com>.

Well, I'm ManifoldCF ignorant, so I'll have to defer on this one....

Best
Erick

On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> Thanks Erick, for the prompt response,
>
> Both the suggestions will be useful for a one time indexing activity. Since
> DIH will be one time process of indexing the repository thus it is of no
> use in my case.Writing a standalone Java program utilizing SolrJ will again
> be a one time indexing process.
>
> I want to write a separate Handler which will be called by ManifoldCF Job
> to create indexes in SOLR. In my case the repository is Documentum Content
> server. I found some relevant link at this url..
> https://community.emc.com/docs/DOC-6520 which is quite similar to my
> requirement.
>
> I modified the code to parse the XML and added that into the document
> properties Although this works fine when i try to test it with my CURL
> program with parameters but when the same handler is called from ManifoldCF
> job the job gets terminated within few minutes. Not sure the reason for
> that. The handler is written similar to /update/extract which is
> ExtractingRequestHandler.
>
> Is ExtractingRequestHandler capable of extracting tag name and values using
> some of its defined attributes like capture, captureAttr, extractOnly etc ?
> which can be added into the document indexes..
>
>
> On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson <er...@gmail.com>wrote:
>
>> You might be able to do something with the XSL Transformer step in DIH.
>>
>> It might also be easier to just write a SolrJ program to parse the XML and
>> construct a SolrInputDocument to send to Solr. It's really pretty
>> straightforward.
>>
>> Best
>> Erick
>>
>> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am using ManifoldCF to Crawl data from Documentum repository. I am able
>> > to successfully read the metadata/properties for the defined document
>> types
>> > in Documentum using the out-of-the box Documentum Connector in
>> ManifoldCF.
>> > Unfortunately, there is one XML file also present which consists of a
>> > custom XML structure which I need to read and fetch the element values
>> and
>> > add it for indexing in lucene through SOLR.
>> >
>> > Is there any mechanism to index any XML structure document in SOLR ?
>> >
>> > I checked the SOLR CELL framework which support below stucture..
>> >
>> > <add>
>> >  <doc>
>> >    <field name="id">9885A004</field>
>> >    <field name="name">Canon PowerShot SD500</field>
>> >    <field name="category">camera</field>
>> >    <field name="features">3x optical zoom</field>
>> >    <field name="features">aluminum case</field>
>> >    <field name="weight">6.4</field>
>> >    <field name="price">329.95</field>
>> >  </doc>
>> >  <doc>
>> >    <field name="id">9885A003</field>
>> >    <field name="name">Canon PowerShot SD504</field>
>> >    <field name="category">camera1</field>
>> >    <field name="features">3x optical zoom1</field>
>> >    <field name="features">aluminum case1</field>
>> >    <field name="weight">6.41</field>
>> >    <field name="price">329.956</field>
>> >  </doc>
>> > </add>
>> >
>> > & my Custom XML structure is of the following format.. from which I need
>> to
>> > read *subject *& *abstract *field for indexing. I checked TIKA project
>> but
>> > I couldn't find any useful stuff.
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <RECORD>
>> > <doc_id>1</doc_id>
>> > <abstract>This is an abstract.</abstract>
>> > <subject>Text Subject</subject>
>> > <availability />
>> > <indexing>
>> > <index_group></index_group>
>> > <keyterms></keyterms>
>> > <keyterms></keyterms>
>> > </indexing>
>> > <publication_date></publication_date>
>> > <physical_storage />
>> > <log_entry />
>> > <legal_category />
>> > <legal_category_notes />
>> > <citation_only></citation_only>
>> > <citation_only_desc />
>> > <export_control />
>> > <export_control_desc />
>> > </RECORD>
>> >
>> > Appreciate any help on this.
>> >
>> > Regards
>> > Anupam
>>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya