You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@camel.apache.org by Justinson <ju...@googlemail.com> on 2010/03/23 20:24:17 UTC

Re: handling large files

Unfortunately I'm getting an OutOfMemoryError using XPath splitting the way
you shown. I'm parsing a file with about 500000 xml messages.

How can we use Apache Digester instead?
 

Claus Ibsen-2 wrote:
> 
> Hi
> 
> This is as far I got with the xpath expression for splitting
> http://svn.apache.org/viewvc?rev=825156&view=rev
> 
> 
> 
> On Wed, Oct 14, 2009 at 4:40 PM, Claus Ibsen <cl...@gmail.com>
> wrote:
>> On Wed, Oct 14, 2009 at 4:21 PM, Claus Ibsen <cl...@gmail.com>
>> wrote:
>>> Hi
>>>
>>> On Wed, Oct 14, 2009 at 4:16 PM, mcarson <mc...@amsa.com> wrote:
>>>>
>>>> It looks like the scanner might provide me with the capabilities I was
>>>> looking for regarding reading in a file in delimited chunks.  I'm
>>>> assuming I
>>>> would implement this as a bean... can the bean component be used as a
>>>> "from"
>>>> in a camel route?  I'm new to Camel, and I have never seen that done.
>>>>  Is
>>>> there an example bean (that is a consumer of some sort) that I could
>>>> use to
>>>> model my code after?
>>>>
>>>
>>> Since you use xpath then I took at dive into looking how to split big
>>> files.
>>> Using InputSource seems to do the trick as it allow xpath to use SAX
>>> events which fits with streaming.
>>>
>>> I will work a bit to get it supported nice out of the box. And provide
>>> details how to do it in 2.0.
>>>
>>
>> Ah yeah the xpath will still at least hold all the result into memory.
>>
>> As you can only get a result of these types listed here:
>> http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPathConstants.html
>>
>> And none of them is stream based.
>>
>> So even with SAX to parse the big xml file the xpath expression
>> evaluation will result into all data being loaded into memory, or at
>> least the NodeList which contains all the splitted entries.
>>
>> So maybe that Scanner is better if you can do some custom clipping. I
>> believe its regexp based so you may be able to find a good regexp that
>> can split on </person> or something.
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>>
>>>>
>>>> Claus Ibsen-2 wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> How do you want to split the file?
>>>>> Is there a special character that denotes a new "record"
>>>>>
>>>>> Using java.util.Scanner is great as it can do streaming. And also what
>>>>> Camel can do if you for example want to split by new line etc.
>>>>>
>>>>> --
>>>>> Claus Ibsen
>>>>> Apache Camel Committer
>>>>>
>>>>> Open Source Integration: http://fusesource.com
>>>>> Blog: http://davsclaus.blogspot.com/
>>>>> Twitter: http://twitter.com/davsclaus
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/handling-large-files-tp25826380p25891924.html
>>>> Sent from the Camel - Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Claus Ibsen
>>> Apache Camel Committer
>>>
>>> Open Source Integration: http://fusesource.com
>>> Blog: http://davsclaus.blogspot.com/
>>> Twitter: http://twitter.com/davsclaus
>>>
>>
>>
>>
>> --
>> Claus Ibsen
>> Apache Camel Committer
>>
>> Open Source Integration: http://fusesource.com
>> Blog: http://davsclaus.blogspot.com/
>> Twitter: http://twitter.com/davsclaus
>>
> 
> 
> 
> -- 
> Claus Ibsen
> Apache Camel Committer
> 
> Open Source Integration: http://fusesource.com
> Blog: http://davsclaus.blogspot.com/
> Twitter: http://twitter.com/davsclaus
> 
> 

-- 
View this message in context: http://old.nabble.com/handling-large-files-tp25826380p28005868.html
Sent from the Camel - Users mailing list archive at Nabble.com.


Re: handling large files

Posted by Claus Ibsen <cl...@gmail.com>.
On Fri, Mar 26, 2010 at 9:15 AM, Justinson <ju...@googlemail.com> wrote:
>
> Thank you very much for your advices.
>
>
> Claus Ibsen-2 wrote:
>>
>> On Tue, Mar 23, 2010 at 8:24 PM, Justinson <ju...@googlemail.com>
>> wrote:
>>>
>>> Unfortunately I'm getting an OutOfMemoryError using XPath splitting the
>>> way
>>> you shown. I'm parsing a file with about 500000 xml messages.
>>
>> You could pre process the big file and split it into X files.
>> Maybe by using the java.util.Scanner to identify "good places" to
>> split the big file.
>>
>
> I'm just trying to handle the "format stack" properly: It's a byte stream in
> the base layer but an XML stream in the second layer. In my case the byte
> stream has no own structure so I cannot split it. Therefore I'd try to apply
> your second advice using XML-aware parsing.
>
>
> Claus Ibsen-2 wrote:
>>
>>
>> Or you could try using SAX based XML parsing when splitting to reduce
>> the memory overhead.
>> Just use a Bean for that. Something like this:
>>
>> public Iterator splitBigFile(java.io.File file) {
>>   // SAX parsing the big file and return an iterator or something that
>> can walk the XML messages you like
>> }
>>
>> And use the bean with the Camel Split EIP
>>
>>
>
> How it possible to integrate a "push" parser paradigm more smoothly into
> Camel than hinding it behind an iterator?
>
> (For iterator-based XML splitting, using StAX "pull" XML parsing is probably
> a more proper choice.)
>

Try googling for a solution using XPath in Java as its what is used
under the covers.
It have a XPathFactory where you can set features and whatnot. I may
offer ways to
tweak how it should run in pull or push mode. And whether it offers to
stream the result etc.



>
> Claus Ibsen-2 wrote:
>>
>>
>>> How can we use Apache Digester instead?
>>
>>
>
> The Commons Digester supports a XPath-like pattern-matching syntax and uses
> SAX behind the scenes. It also exibits the "push" paradigm of SAX but
> introduces a stack concept for match results. Thats why a stream-like
> handling is supported. Unfortunately Camel does not have a support for
> Digester at the moment.
>
> Another idea: Would you recommend using of Xstream for this task?
>
>
> Claus Ibsen-2 wrote:
>>
>>
>>> Claus Ibsen-2 wrote:
>>>>
>>>> Hi
>>>>
>>>> This is as far I got with the xpath expression for splitting
>>>> http://svn.apache.org/viewvc?rev=825156&view=rev
>>
>>
> --
> View this message in context: http://old.nabble.com/handling-large-files-tp25826380p28038839.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>
>



-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus

Re: handling large files

Posted by Justinson <ju...@googlemail.com>.
Thank you very much for your advices.


Claus Ibsen-2 wrote:
> 
> On Tue, Mar 23, 2010 at 8:24 PM, Justinson <ju...@googlemail.com>
> wrote:
>>
>> Unfortunately I'm getting an OutOfMemoryError using XPath splitting the
>> way
>> you shown. I'm parsing a file with about 500000 xml messages.
> 
> You could pre process the big file and split it into X files.
> Maybe by using the java.util.Scanner to identify "good places" to
> split the big file.
> 

I'm just trying to handle the "format stack" properly: It's a byte stream in
the base layer but an XML stream in the second layer. In my case the byte
stream has no own structure so I cannot split it. Therefore I'd try to apply
your second advice using XML-aware parsing.


Claus Ibsen-2 wrote:
> 
> 
> Or you could try using SAX based XML parsing when splitting to reduce
> the memory overhead.
> Just use a Bean for that. Something like this:
> 
> public Iterator splitBigFile(java.io.File file) {
>   // SAX parsing the big file and return an iterator or something that
> can walk the XML messages you like
> }
> 
> And use the bean with the Camel Split EIP
> 
> 

How it possible to integrate a "push" parser paradigm more smoothly into
Camel than hinding it behind an iterator?

(For iterator-based XML splitting, using StAX "pull" XML parsing is probably
a more proper choice.)


Claus Ibsen-2 wrote:
> 
> 
>> How can we use Apache Digester instead?
> 
> 

The Commons Digester supports a XPath-like pattern-matching syntax and uses
SAX behind the scenes. It also exibits the "push" paradigm of SAX but
introduces a stack concept for match results. Thats why a stream-like
handling is supported. Unfortunately Camel does not have a support for
Digester at the moment.

Another idea: Would you recommend using of Xstream for this task?


Claus Ibsen-2 wrote:
> 
> 
>> Claus Ibsen-2 wrote:
>>>
>>> Hi
>>>
>>> This is as far I got with the xpath expression for splitting
>>> http://svn.apache.org/viewvc?rev=825156&view=rev
> 
> 
-- 
View this message in context: http://old.nabble.com/handling-large-files-tp25826380p28038839.html
Sent from the Camel - Users mailing list archive at Nabble.com.


Re: handling large files

Posted by Claus Ibsen <cl...@gmail.com>.
On Tue, Mar 23, 2010 at 8:24 PM, Justinson <ju...@googlemail.com> wrote:
>
> Unfortunately I'm getting an OutOfMemoryError using XPath splitting the way
> you shown. I'm parsing a file with about 500000 xml messages.
>

You could pre process the big file and split it into X files.
Maybe by using the java.util.Scanner to identify "good places" to
split the big file.

Or you could try using SAX based XML parsing when splitting to reduce
the memory overhead.
Just use a Bean for that. Something like this:

public Iterator splitBigFile(java.io.File file) {
  // SAX parsing the big file and return an iterator or something that
can walk the XML messages you like
}

And use the bean with the Camel Split EIP


> How can we use Apache Digester instead?
>
>
> Claus Ibsen-2 wrote:
>>
>> Hi
>>
>> This is as far I got with the xpath expression for splitting
>> http://svn.apache.org/viewvc?rev=825156&view=rev
>>
>>
>>
>> On Wed, Oct 14, 2009 at 4:40 PM, Claus Ibsen <cl...@gmail.com>
>> wrote:
>>> On Wed, Oct 14, 2009 at 4:21 PM, Claus Ibsen <cl...@gmail.com>
>>> wrote:
>>>> Hi
>>>>
>>>> On Wed, Oct 14, 2009 at 4:16 PM, mcarson <mc...@amsa.com> wrote:
>>>>>
>>>>> It looks like the scanner might provide me with the capabilities I was
>>>>> looking for regarding reading in a file in delimited chunks.  I'm
>>>>> assuming I
>>>>> would implement this as a bean... can the bean component be used as a
>>>>> "from"
>>>>> in a camel route?  I'm new to Camel, and I have never seen that done.
>>>>>  Is
>>>>> there an example bean (that is a consumer of some sort) that I could
>>>>> use to
>>>>> model my code after?
>>>>>
>>>>
>>>> Since you use xpath then I took at dive into looking how to split big
>>>> files.
>>>> Using InputSource seems to do the trick as it allow xpath to use SAX
>>>> events which fits with streaming.
>>>>
>>>> I will work a bit to get it supported nice out of the box. And provide
>>>> details how to do it in 2.0.
>>>>
>>>
>>> Ah yeah the xpath will still at least hold all the result into memory.
>>>
>>> As you can only get a result of these types listed here:
>>> http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPathConstants.html
>>>
>>> And none of them is stream based.
>>>
>>> So even with SAX to parse the big xml file the xpath expression
>>> evaluation will result into all data being loaded into memory, or at
>>> least the NodeList which contains all the splitted entries.
>>>
>>> So maybe that Scanner is better if you can do some custom clipping. I
>>> believe its regexp based so you may be able to find a good regexp that
>>> can split on </person> or something.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Claus Ibsen-2 wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> How do you want to split the file?
>>>>>> Is there a special character that denotes a new "record"
>>>>>>
>>>>>> Using java.util.Scanner is great as it can do streaming. And also what
>>>>>> Camel can do if you for example want to split by new line etc.
>>>>>>
>>>>>> --
>>>>>> Claus Ibsen
>>>>>> Apache Camel Committer
>>>>>>
>>>>>> Open Source Integration: http://fusesource.com
>>>>>> Blog: http://davsclaus.blogspot.com/
>>>>>> Twitter: http://twitter.com/davsclaus
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/handling-large-files-tp25826380p25891924.html
>>>>> Sent from the Camel - Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Claus Ibsen
>>>> Apache Camel Committer
>>>>
>>>> Open Source Integration: http://fusesource.com
>>>> Blog: http://davsclaus.blogspot.com/
>>>> Twitter: http://twitter.com/davsclaus
>>>>
>>>
>>>
>>>
>>> --
>>> Claus Ibsen
>>> Apache Camel Committer
>>>
>>> Open Source Integration: http://fusesource.com
>>> Blog: http://davsclaus.blogspot.com/
>>> Twitter: http://twitter.com/davsclaus
>>>
>>
>>
>>
>> --
>> Claus Ibsen
>> Apache Camel Committer
>>
>> Open Source Integration: http://fusesource.com
>> Blog: http://davsclaus.blogspot.com/
>> Twitter: http://twitter.com/davsclaus
>>
>>
>
> --
> View this message in context: http://old.nabble.com/handling-large-files-tp25826380p28005868.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>
>



-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus