You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by meghana <me...@amultek.com> on 2011/12/23 10:43:55 UTC

PlainTextEntityProcessor and RegexTransformer in DataImport Handler

Hi all, 

I need to import data from my text file (which have HTML text). and need to
apply some formatting on it. i want all text with in <p> tag , and i want it
to be preceded by one element of p tag in my output,  like below.

Original Text
------------------------------------------------------------------------------------------
<div><p  myvar="12" myvar1="xyz">Hello World!!</p><p  myvar="14"
myvar1="abc">Welcome to Solr.</p><p  myvar="15" myvar1="def">Enjoy</p></div>


Needed Text After Formattting
------------------------------------------------------------------------------------------
12 : Hello World!!
14 : Welcome to Solr.
15 : Enjoy

I have applied combination of PlainTextEntityProcessor with RegexTransformer
and TemplateTransformer for that as below. but i am receiving
ConfigurationError when i set that.

<entity name="xx" onError="continue"  processor="PlainTextEntityProcessor"
transformer="TemplateTransformer,RegexTransformer" url="${URL.MyTxtFile}"
dataSource="MDataSource">
                       <field column="plainText" name="FullText"   />
                       <field column=&quot;FullText&quot;    
template=&quot;${xx.FullText}&quot; regex='&lt;p (?:\s+[^>]+)?
myvar="([^<"]*)" (?:\s+[^>]+)?>([^<]*)</p>' replaceWith="$2 : $4"/>
               </entity>

I like to add here that i am able do this using TemplateTransformer and
multivalued field by setting foreach on entity, but i need above format in
single valued field, for which i am failed to do it.

Can any body help me, how can i get my desired result? or what i am doing
wrong on above transformer?
Thanks
Meghana

--
View this message in context: http://lucene.472066.n3.nabble.com/PlainTextEntityProcessor-and-RegexTransformer-in-DataImport-Handler-tp3608449p3608449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PlainTextEntityProcessor and RegexTransformer in DataImport Handler

Posted by meghana <me...@amultek.com>.
Thanks Matthew ,

Its really helped a lot. i am about to done with this. 

--
View this message in context: http://lucene.472066.n3.nabble.com/PlainTextEntityProcessor-and-RegexTransformer-in-DataImport-Handler-tp3608449p3612674.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PlainTextEntityProcessor and RegexTransformer in DataImport Handler

Posted by Matthew Parker <mp...@apogeeintegration.com>.
I would try something like the following:

<dataConfig>
    <dataSource type="FileDataSource" />
    <script><![CDATA[
                function format(row){
                    var text = row.get("plainText")

                    // do regex processsing with Javascript's RegExp object.

                    row.put("all_text", results );   // store results in
the "all_text" field.
                    return row;
                }
        ]]></script>
    <document>
        <entity name="f" processor="FileListEntityProcessor" baseDir="[path
to text file directory]" fileName=".*txt" rootEntity="false"
dataSource="null">
            <entity name="x" processor="PlainTextEntityProcessor"
url="${f.fileAbsolutePath}" rootEntity="true" dataSource="null"
transformer="script:format"></entity>
        </entity>
    </document>
</dataConfig>


On Fri, Dec 23, 2011 at 7:41 AM, meghana <me...@amultek.com> wrote:

> Hi..
>
> Plz anybody have any idea? how can i achieve this?
>
> also if it is possible to convert multivalued field to non-multicalued
> field
> then it would aslo work for me.
>
> I have custom mustivalued field ArrText, which have value as shown below
> <arr name="ArrText" >
>    <str>12 : Hello World!!</str>
>    <str>14 : Welcome to Solr.</str>
>    <str>15 : Enjoy</str>
> </arr>
>
> if we can convert this as my desired result then it would be great.
> Thanks in Adcance.
> Meghana
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PlainTextEntityProcessor-and-RegexTransformer-in-DataImport-Handler-tp3608449p3608726.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.

Re: PlainTextEntityProcessor and RegexTransformer in DataImport Handler

Posted by meghana <me...@amultek.com>.
Hi..

Plz anybody have any idea? how can i achieve this? 

also if it is possible to convert multivalued field to non-multicalued field
then it would aslo work for me.

I have custom mustivalued field ArrText, which have value as shown below
<arr name="ArrText" >
    <str>12 : Hello World!!</str>
    <str>14 : Welcome to Solr.</str>
    <str>15 : Enjoy</str>
</arr>

if we can convert this as my desired result then it would be great.
Thanks in Adcance.
Meghana

--
View this message in context: http://lucene.472066.n3.nabble.com/PlainTextEntityProcessor-and-RegexTransformer-in-DataImport-Handler-tp3608449p3608726.html
Sent from the Solr - User mailing list archive at Nabble.com.