You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2012/06/03 14:42:44 UTC

Re: Efficiently mining or parsing data out of XML source files

This seems really odd. How big are these XML files? Where are you parsing them?
You could consider using a SolrJ program with a SAX-style parser.

But the first question I'd answer is "what is slow?". The implications
of your post is that
parsing the XML is the slow part, it really shouldn't be taking
anywhere near this long IMO...

Best
Erick

On Thu, May 31, 2012 at 9:14 AM, Van Tassell, Kristian
<kr...@siemens.com> wrote:
> I'm just wondering what the general consensus is on indexing XML data to Solr in terms of parsing and mining the relevant data out of the file and putting them into Solr fields. Assume that this is the XML file and resulting Solr fields:
>
> XML data:
> <mydoc id="1234">
> <title>foo</title>
> <bar attr1="val1"/>
> <baz>garbage data</baz>
> </ mydoc >
>
> Solr Fields:
> Id=1234
> Title=foo
> Bar=val1
>
> I'd previously set this process up using XSLT and have since tested using XMLBeans, JAXB, etc. to get the relevant data. The speed at which this occurs, however, is not acceptable. 2800 objects take 11 minutes to parse and index into Solr.
>
> The big slowdown appears to be that I'm parsing the data with an XML parser.
>
> So, now I'm testing mining the data by opening the file as just a text file (using Groovy) and picking out relevant data using regular expression matching. I'm now able to parse (mine) the data and index the 2800 files in 72 seconds.
>
> So I'm wondering if the typical solution people use is to go with a non-XML solution. It seems to make sense considering the search index would only want to store (as much data) as possible and not rely on the incoming documents being xml compliant.
>
> Thanks in advance for any thoughts on this!
> -Kristian
>
>
>
>
>
>
>

Re: Efficiently mining or parsing data out of XML source files

Posted by Jack Krupansky <ja...@basetechnology.com>.

I did see a mention yesterday to a situation involving DIH and large XML 
files where is was unusually slow, but if the big XML file was broken into 
many smaller files it went really fast for the same amount of data. If that 
is the case, you don't need to parse all of the XML, just detect the 
boundaries between "documents" and break them into smaller XML files.

-- Jack Krupansky

-----Original Message----- 
From: Mike Sokolov
Sent: Wednesday, June 06, 2012 8:02 AM
To: solr-user@lucene.apache.org
Cc: Erick Erickson
Subject: Re: Efficiently mining or parsing data out of XML source files

I agree, that seems odd.  We routinely index XML using either
HTMLStripCharFilter, or XmlCharFilter (see patch:
https://issues.apache.org/jira/browse/SOLR-2597), both of which parse
the XML, and we don't see such a huge  speed difference from indexing
other field types.  XmlCharFilter also allows you to specify which
elements to index if you don't want the whole file.

-Mike

On 6/3/2012 8:42 AM, Erick Erickson wrote:
> This seems really odd. How big are these XML files? Where are you parsing 
> them?
> You could consider using a SolrJ program with a SAX-style parser.
>
> But the first question I'd answer is "what is slow?". The implications
> of your post is that
> parsing the XML is the slow part, it really shouldn't be taking
> anywhere near this long IMO...
>
> Best
> Erick
>
> On Thu, May 31, 2012 at 9:14 AM, Van Tassell, Kristian
> <kr...@siemens.com>  wrote:
>> I'm just wondering what the general consensus is on indexing XML data to 
>> Solr in terms of parsing and mining the relevant data out of the file and 
>> putting them into Solr fields. Assume that this is the XML file and 
>> resulting Solr fields:
>>
>> XML data:
>> <mydoc id="1234">
>> <title>foo</title>
>> <bar attr1="val1"/>
>> <baz>garbage data</baz>
>> </ mydoc>
>>
>> Solr Fields:
>> Id=1234
>> Title=foo
>> Bar=val1
>>
>> I'd previously set this process up using XSLT and have since tested using 
>> XMLBeans, JAXB, etc. to get the relevant data. The speed at which this 
>> occurs, however, is not acceptable. 2800 objects take 11 minutes to parse 
>> and index into Solr.
>>
>> The big slowdown appears to be that I'm parsing the data with an XML 
>> parser.
>>
>> So, now I'm testing mining the data by opening the file as just a text 
>> file (using Groovy) and picking out relevant data using regular 
>> expression matching. I'm now able to parse (mine) the data and index the 
>> 2800 files in 72 seconds.
>>
>> So I'm wondering if the typical solution people use is to go with a 
>> non-XML solution. It seems to make sense considering the search index 
>> would only want to store (as much data) as possible and not rely on the 
>> incoming documents being xml compliant.
>>
>> Thanks in advance for any thoughts on this!
>> -Kristian
>>
>>
>>
>>
>>
>>
>>

Re: Efficiently mining or parsing data out of XML source files

Posted by Mike Sokolov <so...@ifactory.com>.

I agree, that seems odd.  We routinely index XML using either 
HTMLStripCharFilter, or XmlCharFilter (see patch: 
https://issues.apache.org/jira/browse/SOLR-2597), both of which parse 
the XML, and we don't see such a huge  speed difference from indexing 
other field types.  XmlCharFilter also allows you to specify which 
elements to index if you don't want the whole file.

-Mike

On 6/3/2012 8:42 AM, Erick Erickson wrote:
> This seems really odd. How big are these XML files? Where are you parsing them?
> You could consider using a SolrJ program with a SAX-style parser.
>
> But the first question I'd answer is "what is slow?". The implications
> of your post is that
> parsing the XML is the slow part, it really shouldn't be taking
> anywhere near this long IMO...
>
> Best
> Erick
>
> On Thu, May 31, 2012 at 9:14 AM, Van Tassell, Kristian
> <kr...@siemens.com>  wrote:
>> I'm just wondering what the general consensus is on indexing XML data to Solr in terms of parsing and mining the relevant data out of the file and putting them into Solr fields. Assume that this is the XML file and resulting Solr fields:
>>
>> XML data:
>> <mydoc id="1234">
>> <title>foo</title>
>> <bar attr1="val1"/>
>> <baz>garbage data</baz>
>> </ mydoc>
>>
>> Solr Fields:
>> Id=1234
>> Title=foo
>> Bar=val1
>>
>> I'd previously set this process up using XSLT and have since tested using XMLBeans, JAXB, etc. to get the relevant data. The speed at which this occurs, however, is not acceptable. 2800 objects take 11 minutes to parse and index into Solr.
>>
>> The big slowdown appears to be that I'm parsing the data with an XML parser.
>>
>> So, now I'm testing mining the data by opening the file as just a text file (using Groovy) and picking out relevant data using regular expression matching. I'm now able to parse (mine) the data and index the 2800 files in 72 seconds.
>>
>> So I'm wondering if the typical solution people use is to go with a non-XML solution. It seems to make sense considering the search index would only want to store (as much data) as possible and not rely on the incoming documents being xml compliant.
>>
>> Thanks in advance for any thoughts on this!
>> -Kristian
>>
>>
>>
>>
>>
>>
>>