You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2011/08/01 12:17:45 UTC

Re: Store complete XML record (DIH & XPathEntityProcessor)

Hi g,

ok, I understand your problem, now. (Sorry for answering that late.)

I don't think PlainTextEntityProcessor can help you. It does not take a
regex. LineEntityProcessor does but your record elements probably do not
come on their own line each and you wouldn't want to depend on that,
anyway.

I guess you would be best off writing your own entity processor - maybe
by extending XPath EP if that gives you some advantage. You can of
course also implement your own importer using SolrJ and your favourite
XML parser framework - or any other programming language.

If you are looking for a config-only solution - i'm not sure that there
is one. Someone else might be able to comment on that?

Cheers,
Chantal

On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
> Thanks Chantal
> I am ok with the second call and I already tried using that.  Unfortunatly
> It reads the whole file into a field.  My file is as below example
> <xml > 
>   <record> 
>       ... 
>   </record>
>   
>   <record> 
>       ... 
>   </record>
>  
>    <record> 
>       ... 
>   </record>
> 
> </xml>
> 
> Now the XPATH does the 'for each /record' part.  For each record I also need
> to store the raw log in there.  If I use the  PlainTextEntityProcessor then
> it gives me the whole file (from <xml> .. </xml> ) and not each of the
> <record> </record>
> 
> Am I using the PlainTextEntityProcessor wrong?
> 
> THanks
> g
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Store complete XML record (DIH & XPathEntityProcessor)

Posted by Michael Sokolov <so...@ifactory.com>.

On 8/1/2011 6:17 AM, Chantal Ackermann wrote:
> If you are looking for a config-only solution - i'm not sure that there
> is one. Someone else might be able to comment on that?
>
You might want to take a look at SOLR-2597; it has a patch for 
XmlStripCharFilter, which will strip tags from XML for indexing (like 
HtmlStripCharFilter), and also allows you to specify XML element names 
to include/exclude.  Not full XPath, but might work for you?  You would 
have to compile the 2 java files and place them in your solr classpath 
since the patch has not been committed.

-Mike

Re: Store complete XML record (DIH & XPathEntityProcessor)

Posted by ka...@gmx.de.

Hi g, Hi Chantal

I had the same problem.
You can use XPathEntityProcessor but you have to insert an xsl. The drawback is performance "wasting": See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
  Karsten

-------- Original-Nachricht --------
> Datum: Mon, 1 Aug 2011 12:17:45 +0200
> Von: Chantal Ackermann <ch...@btelligent.de>
> An: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Betreff: Re: Store complete XML record  (DIH & XPathEntityProcessor)

> Hi g,
> 
> ok, I understand your problem, now. (Sorry for answering that late.)
> 
> I don't think PlainTextEntityProcessor can help you. It does not take a
> regex. LineEntityProcessor does but your record elements probably do not
> come on their own line each and you wouldn't want to depend on that,
> anyway.
> 
> I guess you would be best off writing your own entity processor - maybe
> by extending XPath EP if that gives you some advantage. You can of
> course also implement your own importer using SolrJ and your favourite
> XML parser framework - or any other programming language.
> 
> If you are looking for a config-only solution - i'm not sure that there
> is one. Someone else might be able to comment on that?
> 
> Cheers,
> Chantal
> 
> 
> On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
> > Thanks Chantal
> > I am ok with the second call and I already tried using that. 
> Unfortunatly
> > It reads the whole file into a field.  My file is as below example
> > <xml > 
> >   <record> 
> >       ... 
> >   </record>
> >   
> >   <record> 
> >       ... 
> >   </record>
> >  
> >    <record> 
> >       ... 
> >   </record>
> > 
> > </xml>
> > 
> > Now the XPATH does the 'for each /record' part.  For each record I also
> need
> > to store the raw log in there.  If I use the  PlainTextEntityProcessor
> then
> > it gives me the whole file (from <xml> .. </xml> ) and not each of the
> > <record> </record>
> > 
> > Am I using the PlainTextEntityProcessor wrong?
> > 
> > THanks
> > g
> > 
> > 
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>