You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@camel.apache.org by Kevin Jackson <fo...@gmail.com> on 2010/01/25 11:23:08 UTC

Scanning/Splitting Large input files

Hi,

I have a problem which I assume is a relatively normal use case.  I
have to process 700+Mb input text file (not xml) and I want to
generate events/messages from this file based on splitting the file
into individual records.

Having done some digging through the archives, it seems that there are
a couple of solutions:
# use the claim-check EIP
We do not want to use a database at this point in our processing and
using a file datastore we would run into 'too many files in one
directory' and if I start implementing a partition scheme, the code to
handle the splitting of the origin data becomes much more complex than
the data processing - this seems to be a hack upon hack approach and
one I wish to avoid.

# some kind of custom scanner
This post[1] seems to imply that it's possible to implement some kind
of custom splitting strategy based on a record delimiter - this is
what I would prefer to do.  Is there any further documentation on this
aspect of camel apart from the 'splitter' page[2] which seems to
assume processing the 'splitted' message in one pass where as I need
to generate not a List<MyMessage> but simply MyMessage

Having just looked at the src, the SplitterPojoTest is indeed
processing the entire message in one pass

Thanks,
Kev


[1] http://osdir.com/ml/users-camel-apache/2009-10/msg00289.html
[2] http://camel.apache.org/splitter.html

Re: Scanning/Splitting Large input files

Posted by Claus Ibsen <cl...@gmail.com>.
On Mon, Jan 25, 2010 at 7:37 PM, Kevin Jackson <fo...@gmail.com> wrote:
> Hi,
>
>> How do you want to split your file. Can you split it on some sort of
>> token, which is required by the Scanner?
>
> I need to split on '**'
>
> I wrote a simple POJO with a splitBody() method that returns a
> List<String> based on this using a Scanner within the method
>
>> You can use POJO to return a String value which then will be used by
>> the Scanner.
>> For example in that splitBody method in the sample above it could be defined as
>>
>> public String splitBody() {
>>  return "\n";
>> }
>
> Ah that is much simpler so I can use the underlying Scanner instead of
> instantiating a new one.
>
> Can you confirm that the spring xml config is correct?  I can only
> find examples of the Java DSL for this - I'll happily provide a patch
> to the docs if the previous spring xml config is correct.
>

A good idea is to peek in the camel-spring component as there are a
bunch of tests using Spring XML

eg. like this one
https://svn.apache.org/repos/asf/camel/trunk/components/camel-spring/src/test/resources/org/apache/camel/spring/processor/splitterMethodCallTest.xml

> Thanks I'll refactor tomorrow and save myself some LOC,
> Kev
>

If you only need to split on ** then you can do all in XML only = no
need for POJO
https://svn.apache.org/repos/asf/camel/trunk/components/camel-spring/src/test/resources/org/apache/camel/spring/processor/splitterTokenizerTest.xml

If that works for you, feel free to help with the wiki docu
http://camel.apache.org/contributing.html

-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus

Re: Scanning/Splitting Large input files

Posted by Kevin Jackson <fo...@gmail.com>.
Hi,

> How do you want to split your file. Can you split it on some sort of
> token, which is required by the Scanner?

I need to split on '**'

I wrote a simple POJO with a splitBody() method that returns a
List<String> based on this using a Scanner within the method

> You can use POJO to return a String value which then will be used by
> the Scanner.
> For example in that splitBody method in the sample above it could be defined as
>
> public String splitBody() {
>  return "\n";
> }

Ah that is much simpler so I can use the underlying Scanner instead of
instantiating a new one.

Can you confirm that the spring xml config is correct?  I can only
find examples of the Java DSL for this - I'll happily provide a patch
to the docs if the previous spring xml config is correct.

Thanks I'll refactor tomorrow and save myself some LOC,
Kev

Re: Scanning/Splitting Large input files

Posted by Claus Ibsen <cl...@gmail.com>.
On Mon, Jan 25, 2010 at 4:22 PM, Kevin Jackson <fo...@gmail.com> wrote:
> Hi,
>
>> The JDK offers a java.util.Scanner which allows you to split a stream
>> on-the-fly. Camel leverages this scanner under the covers as well.
>>
>> For example suppose you want to split a 700mb file pr line then you
>> can use the Camel splitter and have it tokenized using \n, which
>> should leverage that Scanner under the covers. You can also enable the
>> streaming mode of the Splitter which should prevent reading the 700mb
>> into memory.
>>
>> So by enabling streaming and having the big message split by the
>> Scanner should allow you to do this with low memory usage.
>>
>>
>> Its the createIterator method on ObjectHelper which the Camel splitter
>> will use, if you use the body().tokenize("\n") as the split
>> expression.
>
> And this is the case when you use the POJO splitting method?  I
> assumed that this was the case as it made the most sense so I have
> followed the example in the SplitterPOJOTest
>
> For the Java DSL :
> from("direct:start").split().method("mySplitter",
> "splitBody").streaming().to("mock:result");
>
> The equivalent spring xml :
> <route>
>  <from uri="direct:start"/>
>  <split streaming="true">
>    <bean ref="mySplitter" method="splitBody"/>
>    <to uri="mock:result"/>
>  </split>
> </route>
>
> Given that I cannot tokenize on a simple \n

How do you want to split your file. Can you split it on some sort of
token, which is required by the Scanner?

You can use POJO to return a String value which then will be used by
the Scanner.
For example in that splitBody method in the sample above it could be defined as

public String splitBody() {
  return "\n";
}




>
> Thanks,
> Kev
>



-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus

Re: Scanning/Splitting Large input files

Posted by Kevin Jackson <fo...@gmail.com>.
Hi,

> The JDK offers a java.util.Scanner which allows you to split a stream
> on-the-fly. Camel leverages this scanner under the covers as well.
>
> For example suppose you want to split a 700mb file pr line then you
> can use the Camel splitter and have it tokenized using \n, which
> should leverage that Scanner under the covers. You can also enable the
> streaming mode of the Splitter which should prevent reading the 700mb
> into memory.
>
> So by enabling streaming and having the big message split by the
> Scanner should allow you to do this with low memory usage.
>
>
> Its the createIterator method on ObjectHelper which the Camel splitter
> will use, if you use the body().tokenize("\n") as the split
> expression.

And this is the case when you use the POJO splitting method?  I
assumed that this was the case as it made the most sense so I have
followed the example in the SplitterPOJOTest

For the Java DSL :
from("direct:start").split().method("mySplitter",
"splitBody").streaming().to("mock:result");

The equivalent spring xml :
<route>
  <from uri="direct:start"/>
  <split streaming="true">
    <bean ref="mySplitter" method="splitBody"/>
    <to uri="mock:result"/>
  </split>
</route>

Given that I cannot tokenize on a simple \n

Thanks,
Kev

Re: Scanning/Splitting Large input files

Posted by Claus Ibsen <cl...@gmail.com>.
Hi Keven

I think we have debated this before here on the Camel user forum so
you may be able to dig out some of those topics.

The JDK offers a java.util.Scanner which allows you to split a stream
on-the-fly. Camel leverages this scanner under the covers as well.

For example suppose you want to split a 700mb file pr line then you
can use the Camel splitter and have it tokenized using \n, which
should leverage that Scanner under the covers. You can also enable the
streaming mode of the Splitter which should prevent reading the 700mb
into memory.

So by enabling streaming and having the big message split by the
Scanner should allow you to do this with low memory usage.


Its the createIterator method on ObjectHelper which the Camel splitter
will use, if you use the body().tokenize("\n") as the split
expression.



On Mon, Jan 25, 2010 at 11:23 AM, Kevin Jackson <fo...@gmail.com> wrote:
> Hi,
>
> I have a problem which I assume is a relatively normal use case.  I
> have to process 700+Mb input text file (not xml) and I want to
> generate events/messages from this file based on splitting the file
> into individual records.
>
> Having done some digging through the archives, it seems that there are
> a couple of solutions:
> # use the claim-check EIP
> We do not want to use a database at this point in our processing and
> using a file datastore we would run into 'too many files in one
> directory' and if I start implementing a partition scheme, the code to
> handle the splitting of the origin data becomes much more complex than
> the data processing - this seems to be a hack upon hack approach and
> one I wish to avoid.
>
> # some kind of custom scanner
> This post[1] seems to imply that it's possible to implement some kind
> of custom splitting strategy based on a record delimiter - this is
> what I would prefer to do.  Is there any further documentation on this
> aspect of camel apart from the 'splitter' page[2] which seems to
> assume processing the 'splitted' message in one pass where as I need
> to generate not a List<MyMessage> but simply MyMessage
>
> Having just looked at the src, the SplitterPojoTest is indeed
> processing the entire message in one pass
>
> Thanks,
> Kev
>
>
> [1] http://osdir.com/ml/users-camel-apache/2009-10/msg00289.html
> [2] http://camel.apache.org/splitter.html
>



-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus