You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@camel.apache.org by Bart Frackiewicz <ba...@open-medium.com> on 2008/09/02 17:39:41 UTC

Splitter for big files

Hi,

i am using this route for a couple of CSV file routes:

   from("file:/tmp/input/?delete=true")
   .splitter(body(InputStream.class).tokenize("\r\n"))
   .beanRef("myBean", "process")
   .to("file:/tmp/output/?append=true")

This works fine for small CSV files, but for big files i noticed
that camel uses a lot of memory, it seems that camel is reading
the file into memory. What is the configuration to use a stream
in the splitter?

I recognized the same behaviour in the xpath splitter:

   from("file:/tmp/input/?delete=true")
   .splitter(ns.xpath("//member"))
   ...

BTW, i found a posting from march, where James suggest following
implementation for an own splitter:

-- quote --

   from("file:///c:/temp?noop=true)").
     splitter().method("myBean", "split").
     to("activemq:someQueue")

Then register "myBean" with a split method...

class SomeBean {
   public Iterator split(File file) {
      /// figure out how to split this file into rows...
   }
}
-- quote --

But this won't work for me (Camel 1.4).

Bart

RE: [SPAM] RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi

With or without these improvements the transaction issue is still the same.

The patches just improve the memory usage to not load the entire file into memory before splitting.

The transactional issue should be handled by external Transaction managers such as Spring, JTA in a J2EE container or others. Notice this usually only works with JMS and JDBC.

So if you for instance want to read a big file, split it into lines, processes each line and store each line in a database. Then you could put the exchanges on a JMS queue before it's stored in the database to ensure a safe point. Then the JMS can redo until the database is updated.

from(file).split().to(jms);
from(jms).process().to(jdbc);


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: cmoulliard [mailto:cmoulliard@gmail.com] 
Sent: 3. september 2008 15:41
To: camel-user@activemq.apache.org
Subject: [SPAM] RE: Splitter for big files


If we implement what the different stakeholders propose, can we guarantee
that in case a problem occurs during the parsing of the file, a rollback of
the messages created (by the batch or the tokenisation) will be done ?

Kind regards,

 

Claus Ibsen wrote:
> 
> Hi
> 
> I have created 2 tickets to track this:
> CAMEL-875, CAMEL-876
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> 
> -----Original Message-----
> From: Claus Ibsen [mailto:ci@silverbullet.dk] 
> Sent: 2. september 2008 21:44
> To: camel-user@activemq.apache.org
> Subject: RE: Splitter for big files
> 
> Ah of course well spotted. The tokenize is the memory hog. Good idea with
> the java.util.Scanner.
> 
> So combined with the batch stuff we should be able to operate on really
> big files without consuming too much memory ;)
> 
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> -----Original Message-----
> From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be] 
> Sent: 2. september 2008 21:28
> To: camel-user@activemq.apache.org
> Subject: Re: Splitter for big files
> 
> L.S.,
> 
> Just added my pair of eyes ;).  One part of the problem is indeed the 
> list of exchanges that is returned by the expression, but I think you're 
> also reading the entire file into memory a first time for tokenizing 
> it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
> and then uses a StringTokenizer on that.  I think we could add support 
> there for tokenizing File, InputStreams and Readers directly using a 
> Scanner.
> 
> Regards,
> 
> Gert
> 
> Claus Ibsen wrote:
>> Hi
>>
>> Looking into the source code of the splitter it looks like it creates the
>> list of splitted exchanges before they are being processed. That is why
>> it then will consume memory for big files.
>>
>> Maybe somekind of batch size option is needed so you can set for instance
>> number, say 20 as batch size.
>>
>>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>>
>> Could you create a JIRA ticket for this improvement?
>> Btw how big is the files you use? 
>>
>> The file component uses a File as the object. 
>> So when you split using the input stream then Camel should use the type
>> converter from File -> InputStream, that doesn't read the entire content
>> into memory. This happends in the splitter where it creates the entire
>> list of new exchanges to fire.
>>
>> At least that is what I can read from the source code after a long days
>> work, so please read the code as 4 eyes is better that 2 ;)
>>
>>
>>
>> Med venlig hilsen
>>  
>> Claus Ibsen
>> ......................................
>> Silverbullet
>> Skovsgårdsvænget 21
>> 8362 Hørning
>> Tlf. +45 2962 7576
>> Web: www.silverbullet.dk
>>
>> -----Original Message-----
>> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
>> Sent: 2. september 2008 17:40
>> To: camel-user@activemq.apache.org
>> Subject: Splitter for big files
>>
>> Hi,
>>
>> i am using this route for a couple of CSV file routes:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(body(InputStream.class).tokenize("\r\n"))
>>    .beanRef("myBean", "process")
>>    .to("file:/tmp/output/?append=true")
>>
>> This works fine for small CSV files, but for big files i noticed
>> that camel uses a lot of memory, it seems that camel is reading
>> the file into memory. What is the configuration to use a stream
>> in the splitter?
>>
>> I recognized the same behaviour in the xpath splitter:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(ns.xpath("//member"))
>>    ...
>>
>> BTW, i found a posting from march, where James suggest following
>> implementation for an own splitter:
>>
>> -- quote --
>>
>>    from("file:///c:/temp?noop=true)").
>>      splitter().method("myBean", "split").
>>      to("activemq:someQueue")
>>
>> Then register "myBean" with a split method...
>>
>> class SomeBean {
>>    public Iterator split(File file) {
>>       /// figure out how to split this file into rows...
>>    }
>> }
>> -- quote --
>>
>> But this won't work for me (Camel 1.4).
>>
>> Bart
>>
>>   
> 
> 
> 


-----
Enterprise Architect

Xpectis
12, route d'Esch
L-1470 Luxembourg

Phone +352 25 10 70 470
Mobile +352 621 45 36 22

e-mail : cmoulliard@xpectis.com
web site :  www.xpectis.com www.xpectis.com 
My Blog :  http://cmoulliard.blogspot.com/ http://cmoulliard.blogspot.com/  
-- 
View this message in context: http://www.nabble.com/Splitter-for-big-files-tp19272583s22882p19289425.html
Sent from the Camel - Users mailing list archive at Nabble.com.

RE: Splitter for big files

Posted by cmoulliard <cm...@gmail.com>.

If we implement what the different stakeholders propose, can we guarantee
that in case a problem occurs during the parsing of the file, a rollback of
the messages created (by the batch or the tokenisation) will be done ?

Kind regards,

 

Claus Ibsen wrote:
> 
> Hi
> 
> I have created 2 tickets to track this:
> CAMEL-875, CAMEL-876
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> 
> -----Original Message-----
> From: Claus Ibsen [mailto:ci@silverbullet.dk] 
> Sent: 2. september 2008 21:44
> To: camel-user@activemq.apache.org
> Subject: RE: Splitter for big files
> 
> Ah of course well spotted. The tokenize is the memory hog. Good idea with
> the java.util.Scanner.
> 
> So combined with the batch stuff we should be able to operate on really
> big files without consuming too much memory ;)
> 
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> -----Original Message-----
> From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be] 
> Sent: 2. september 2008 21:28
> To: camel-user@activemq.apache.org
> Subject: Re: Splitter for big files
> 
> L.S.,
> 
> Just added my pair of eyes ;).  One part of the problem is indeed the 
> list of exchanges that is returned by the expression, but I think you're 
> also reading the entire file into memory a first time for tokenizing 
> it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
> and then uses a StringTokenizer on that.  I think we could add support 
> there for tokenizing File, InputStreams and Readers directly using a 
> Scanner.
> 
> Regards,
> 
> Gert
> 
> Claus Ibsen wrote:
>> Hi
>>
>> Looking into the source code of the splitter it looks like it creates the
>> list of splitted exchanges before they are being processed. That is why
>> it then will consume memory for big files.
>>
>> Maybe somekind of batch size option is needed so you can set for instance
>> number, say 20 as batch size.
>>
>>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>>
>> Could you create a JIRA ticket for this improvement?
>> Btw how big is the files you use? 
>>
>> The file component uses a File as the object. 
>> So when you split using the input stream then Camel should use the type
>> converter from File -> InputStream, that doesn't read the entire content
>> into memory. This happends in the splitter where it creates the entire
>> list of new exchanges to fire.
>>
>> At least that is what I can read from the source code after a long days
>> work, so please read the code as 4 eyes is better that 2 ;)
>>
>>
>>
>> Med venlig hilsen
>>  
>> Claus Ibsen
>> ......................................
>> Silverbullet
>> Skovsgårdsvænget 21
>> 8362 Hørning
>> Tlf. +45 2962 7576
>> Web: www.silverbullet.dk
>>
>> -----Original Message-----
>> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
>> Sent: 2. september 2008 17:40
>> To: camel-user@activemq.apache.org
>> Subject: Splitter for big files
>>
>> Hi,
>>
>> i am using this route for a couple of CSV file routes:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(body(InputStream.class).tokenize("\r\n"))
>>    .beanRef("myBean", "process")
>>    .to("file:/tmp/output/?append=true")
>>
>> This works fine for small CSV files, but for big files i noticed
>> that camel uses a lot of memory, it seems that camel is reading
>> the file into memory. What is the configuration to use a stream
>> in the splitter?
>>
>> I recognized the same behaviour in the xpath splitter:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(ns.xpath("//member"))
>>    ...
>>
>> BTW, i found a posting from march, where James suggest following
>> implementation for an own splitter:
>>
>> -- quote --
>>
>>    from("file:///c:/temp?noop=true)").
>>      splitter().method("myBean", "split").
>>      to("activemq:someQueue")
>>
>> Then register "myBean" with a split method...
>>
>> class SomeBean {
>>    public Iterator split(File file) {
>>       /// figure out how to split this file into rows...
>>    }
>> }
>> -- quote --
>>
>> But this won't work for me (Camel 1.4).
>>
>> Bart
>>
>>   
> 
> 
> 


-----
Enterprise Architect

Xpectis
12, route d'Esch
L-1470 Luxembourg

Phone +352 25 10 70 470
Mobile +352 621 45 36 22

e-mail : cmoulliard@xpectis.com
web site :  www.xpectis.com www.xpectis.com 
My Blog :  http://cmoulliard.blogspot.com/ http://cmoulliard.blogspot.com/  
-- 
View this message in context: http://www.nabble.com/Splitter-for-big-files-tp19272583s22882p19289425.html
Sent from the Camel - Users mailing list archive at Nabble.com.

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi

I have created 2 tickets to track this:
CAMEL-875, CAMEL-876

Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Claus Ibsen [mailto:ci@silverbullet.dk] 
Sent: 2. september 2008 21:44
To: camel-user@activemq.apache.org
Subject: RE: Splitter for big files

Ah of course well spotted. The tokenize is the memory hog. Good idea with the java.util.Scanner.

So combined with the batch stuff we should be able to operate on really big files without consuming too much memory ;)


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk
-----Original Message-----
From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be] 
Sent: 2. september 2008 21:28
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

L.S.,

Just added my pair of eyes ;).  One part of the problem is indeed the 
list of exchanges that is returned by the expression, but I think you're 
also reading the entire file into memory a first time for tokenizing 
it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
and then uses a StringTokenizer on that.  I think we could add support 
there for tokenizing File, InputStreams and Readers directly using a 
Scanner.

Regards,

Gert

Claus Ibsen wrote:
> Hi
>
> Looking into the source code of the splitter it looks like it creates the list of splitted exchanges before they are being processed. That is why it then will consume memory for big files.
>
> Maybe somekind of batch size option is needed so you can set for instance number, say 20 as batch size.
>
>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>
> Could you create a JIRA ticket for this improvement?
> Btw how big is the files you use? 
>
> The file component uses a File as the object. 
> So when you split using the input stream then Camel should use the type converter from File -> InputStream, that doesn't read the entire content into memory. This happends in the splitter where it creates the entire list of new exchanges to fire.
>
> At least that is what I can read from the source code after a long days work, so please read the code as 4 eyes is better that 2 ;)
>
>
>
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
>
> -----Original Message-----
> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
> Sent: 2. september 2008 17:40
> To: camel-user@activemq.apache.org
> Subject: Splitter for big files
>
> Hi,
>
> i am using this route for a couple of CSV file routes:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(body(InputStream.class).tokenize("\r\n"))
>    .beanRef("myBean", "process")
>    .to("file:/tmp/output/?append=true")
>
> This works fine for small CSV files, but for big files i noticed
> that camel uses a lot of memory, it seems that camel is reading
> the file into memory. What is the configuration to use a stream
> in the splitter?
>
> I recognized the same behaviour in the xpath splitter:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(ns.xpath("//member"))
>    ...
>
> BTW, i found a posting from march, where James suggest following
> implementation for an own splitter:
>
> -- quote --
>
>    from("file:///c:/temp?noop=true)").
>      splitter().method("myBean", "split").
>      to("activemq:someQueue")
>
> Then register "myBean" with a split method...
>
> class SomeBean {
>    public Iterator split(File file) {
>       /// figure out how to split this file into rows...
>    }
> }
> -- quote --
>
> But this won't work for me (Camel 1.4).
>
> Bart
>
>

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Ah of course well spotted. The tokenize is the memory hog. Good idea with the java.util.Scanner.

So combined with the batch stuff we should be able to operate on really big files without consuming too much memory ;)


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk
-----Original Message-----
From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be] 
Sent: 2. september 2008 21:28
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

L.S.,

Just added my pair of eyes ;).  One part of the problem is indeed the 
list of exchanges that is returned by the expression, but I think you're 
also reading the entire file into memory a first time for tokenizing 
it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
and then uses a StringTokenizer on that.  I think we could add support 
there for tokenizing File, InputStreams and Readers directly using a 
Scanner.

Regards,

Gert

Claus Ibsen wrote:
> Hi
>
> Looking into the source code of the splitter it looks like it creates the list of splitted exchanges before they are being processed. That is why it then will consume memory for big files.
>
> Maybe somekind of batch size option is needed so you can set for instance number, say 20 as batch size.
>
>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>
> Could you create a JIRA ticket for this improvement?
> Btw how big is the files you use? 
>
> The file component uses a File as the object. 
> So when you split using the input stream then Camel should use the type converter from File -> InputStream, that doesn't read the entire content into memory. This happends in the splitter where it creates the entire list of new exchanges to fire.
>
> At least that is what I can read from the source code after a long days work, so please read the code as 4 eyes is better that 2 ;)
>
>
>
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
>
> -----Original Message-----
> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
> Sent: 2. september 2008 17:40
> To: camel-user@activemq.apache.org
> Subject: Splitter for big files
>
> Hi,
>
> i am using this route for a couple of CSV file routes:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(body(InputStream.class).tokenize("\r\n"))
>    .beanRef("myBean", "process")
>    .to("file:/tmp/output/?append=true")
>
> This works fine for small CSV files, but for big files i noticed
> that camel uses a lot of memory, it seems that camel is reading
> the file into memory. What is the configuration to use a stream
> in the splitter?
>
> I recognized the same behaviour in the xpath splitter:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(ns.xpath("//member"))
>    ...
>
> BTW, i found a posting from march, where James suggest following
> implementation for an own splitter:
>
> -- quote --
>
>    from("file:///c:/temp?noop=true)").
>      splitter().method("myBean", "split").
>      to("activemq:someQueue")
>
> Then register "myBean" with a split method...
>
> class SomeBean {
>    public Iterator split(File file) {
>       /// figure out how to split this file into rows...
>    }
> }
> -- quote --
>
> But this won't work for me (Camel 1.4).
>
> Bart
>
>

Re: Splitter for big files

Posted by Gert Vanthienen <ge...@skynet.be>.

L.S.,

Just added my pair of eyes ;).  One part of the problem is indeed the 
list of exchanges that is returned by the expression, but I think you're 
also reading the entire file into memory a first time for tokenizing 
it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
and then uses a StringTokenizer on that.  I think we could add support 
there for tokenizing File, InputStreams and Readers directly using a 
Scanner.

Regards,

Gert

Claus Ibsen wrote:
> Hi
>
> Looking into the source code of the splitter it looks like it creates the list of splitted exchanges before they are being processed. That is why it then will consume memory for big files.
>
> Maybe somekind of batch size option is needed so you can set for instance number, say 20 as batch size.
>
>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>
> Could you create a JIRA ticket for this improvement?
> Btw how big is the files you use? 
>
> The file component uses a File as the object. 
> So when you split using the input stream then Camel should use the type converter from File -> InputStream, that doesn't read the entire content into memory. This happends in the splitter where it creates the entire list of new exchanges to fire.
>
> At least that is what I can read from the source code after a long days work, so please read the code as 4 eyes is better that 2 ;)
>
>
>
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
>
> -----Original Message-----
> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
> Sent: 2. september 2008 17:40
> To: camel-user@activemq.apache.org
> Subject: Splitter for big files
>
> Hi,
>
> i am using this route for a couple of CSV file routes:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(body(InputStream.class).tokenize("\r\n"))
>    .beanRef("myBean", "process")
>    .to("file:/tmp/output/?append=true")
>
> This works fine for small CSV files, but for big files i noticed
> that camel uses a lot of memory, it seems that camel is reading
> the file into memory. What is the configuration to use a stream
> in the splitter?
>
> I recognized the same behaviour in the xpath splitter:
>
>    from("file:/tmp/input/?delete=true")
>    .splitter(ns.xpath("//member"))
>    ...
>
> BTW, i found a posting from march, where James suggest following
> implementation for an own splitter:
>
> -- quote --
>
>    from("file:///c:/temp?noop=true)").
>      splitter().method("myBean", "split").
>      to("activemq:someQueue")
>
> Then register "myBean" with a split method...
>
> class SomeBean {
>    public Iterator split(File file) {
>       /// figure out how to split this file into rows...
>    }
> }
> -- quote --
>
> But this won't work for me (Camel 1.4).
>
> Bart
>
>

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi

CAMEL-876 cover any types of files. The batch just means that camel will chop its internal list of "lines" in a fixed list size, and then process the "lines" as batches, eg:

After
=====
Read 20 lines
Process 20 exchanges
Read 20 lines
Process 20 exchanges
...


Before
======
Read *all* lines
Process *all* lines


Workaround
==========
If you are using ServiceMix already then you are in luck as you have found a solution in it that works. ServiceMix  has integration to camel with the camel-jbi component. However I am not familiar with how this works. The user forum at servicemix will be able to help.

However we will soon start working on these issues you have with Camel so we are able to help you with Camel only. So if you have a little patience and is able to test on-the-fly then that would be much helpful.



Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk
-----Original Message-----
From: Bart Frackiewicz [mailto:bart@open-medium.com] 
Sent: 3. september 2008 23:16
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

Hi Claus,

Claus Ibsen schrieb:
> 
> Could you create a JIRA ticket for this improvement?

Thank you for creating the JIRA tickets. Do CAMEL-876 cover the XPATH 
issue? The batch functionality was new for me.

> Btw how big is the files you use? 

We process old datasets up to 1gb, the XML files are about 250mb.

Can i create a workaround for this? A colleague of mine found a good 
implementation in Servicemix, which splits the lines and send the new 
exchanges direct on the message bus. I am not so experienced with Camel 
to measure this. Maybe you can give me a hint how you would solve it.

Bart

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi Bart

Glad it's working for you. Please feel welcome to write again if you need assistance.


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Bart Frackiewicz [mailto:bart@open-medium.com] 
Sent: 12. september 2008 11:20
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

Hi Claus,

we used your code as a first draft and it works like a charm. I would 
like to thank you for your great support.

Note: we added scannr.close() and added also the typical headers (size 
and counter) from the splitter.

Bart


Claus Ibsen schrieb:
> Hi
> 
> A workaround I can think off right now
> 
> // read the big file and split it into lines and send each line to the seda queue (async).
> From("file://mybigfiles").process(new Processor() {
>    public void process(Exchange e) {
>      // need to get hold of a producer template to easily send the lines to the sede queue
>      final ProducerTemplate template = e.getContext().createProducerTemplate();
> 
>      // get the file from the body
>      File file = e.getIn().getBody(File.class);
>      // create the scanner that split lines without reading all into memory
>      // mind you can also pass in a encoding if you like to the scanner
>      Scanner scanner = new Scanner(file);
>      // use our token as delimiter
>      scanner.setDelimiter("\n\r");
>      // loop and send each line to the seda queue
>      while (scanner.hasNext()){
>         String line = scanner.next();
>         template.sendBody("seda:myfileline", line);
>      }
> });
> 
> // then do what you normally would do next
> from("seda:myfileline").to("xxxx");
> 
> 
> Mind the code is pseudo code written in email editor.
>

Re: Splitter for big files

Posted by Bart Frackiewicz <ba...@open-medium.com>.

Hi Claus,

we used your code as a first draft and it works like a charm. I would 
like to thank you for your great support.

Note: we added scannr.close() and added also the typical headers (size 
and counter) from the splitter.

Bart


Claus Ibsen schrieb:
> Hi
> 
> A workaround I can think off right now
> 
> // read the big file and split it into lines and send each line to the seda queue (async).
> From("file://mybigfiles").process(new Processor() {
>    public void process(Exchange e) {
>      // need to get hold of a producer template to easily send the lines to the sede queue
>      final ProducerTemplate template = e.getContext().createProducerTemplate();
> 
>      // get the file from the body
>      File file = e.getIn().getBody(File.class);
>      // create the scanner that split lines without reading all into memory
>      // mind you can also pass in a encoding if you like to the scanner
>      Scanner scanner = new Scanner(file);
>      // use our token as delimiter
>      scanner.setDelimiter("\n\r");
>      // loop and send each line to the seda queue
>      while (scanner.hasNext()){
>         String line = scanner.next();
>         template.sendBody("seda:myfileline", line);
>      }
> });
> 
> // then do what you normally would do next
> from("seda:myfileline").to("xxxx");
> 
> 
> Mind the code is pseudo code written in email editor.
>

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi

A workaround I can think off right now

// read the big file and split it into lines and send each line to the seda queue (async).
From("file://mybigfiles").process(new Processor() {
   public void process(Exchange e) {
     // need to get hold of a producer template to easily send the lines to the sede queue
     final ProducerTemplate template = e.getContext().createProducerTemplate();

     // get the file from the body
     File file = e.getIn().getBody(File.class);
     // create the scanner that split lines without reading all into memory
     // mind you can also pass in a encoding if you like to the scanner
     Scanner scanner = new Scanner(file);
     // use our token as delimiter
     scanner.setDelimiter("\n\r");
     // loop and send each line to the seda queue
     while (scanner.hasNext()){
        String line = scanner.next();
        template.sendBody("seda:myfileline", line);
     }
});

// then do what you normally would do next
from("seda:myfileline").to("xxxx");


Mind the code is pseudo code written in email editor.


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Claus Ibsen [mailto:ci@silverbullet.dk] 
Sent: 4. september 2008 06:10
To: camel-user@activemq.apache.org
Subject: RE: Splitter for big files

Hi 

A workaround or solution if you might say is also to do the splitting you self.

You can do this in a processor or a POJO and use bean binding.

Camel is very flexible and if there is something you are missing then you can always code it yourself and be integrated easily with camel. It never takes the power of coding from you. 



Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Bart Frackiewicz [mailto:bart@open-medium.com] 
Sent: 3. september 2008 23:16
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

Hi Claus,

Claus Ibsen schrieb:
> 
> Could you create a JIRA ticket for this improvement?

Thank you for creating the JIRA tickets. Do CAMEL-876 cover the XPATH 
issue? The batch functionality was new for me.

> Btw how big is the files you use? 

We process old datasets up to 1gb, the XML files are about 250mb.

Can i create a workaround for this? A colleague of mine found a good 
implementation in Servicemix, which splits the lines and send the new 
exchanges direct on the message bus. I am not so experienced with Camel 
to measure this. Maybe you can give me a hint how you would solve it.

Bart

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi 

A workaround or solution if you might say is also to do the splitting you self.

You can do this in a processor or a POJO and use bean binding.

Camel is very flexible and if there is something you are missing then you can always code it yourself and be integrated easily with camel. It never takes the power of coding from you. 



Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Bart Frackiewicz [mailto:bart@open-medium.com] 
Sent: 3. september 2008 23:16
To: camel-user@activemq.apache.org
Subject: Re: Splitter for big files

Hi Claus,

Claus Ibsen schrieb:
> 
> Could you create a JIRA ticket for this improvement?

Thank you for creating the JIRA tickets. Do CAMEL-876 cover the XPATH 
issue? The batch functionality was new for me.

> Btw how big is the files you use? 

We process old datasets up to 1gb, the XML files are about 250mb.

Can i create a workaround for this? A colleague of mine found a good 
implementation in Servicemix, which splits the lines and send the new 
exchanges direct on the message bus. I am not so experienced with Camel 
to measure this. Maybe you can give me a hint how you would solve it.

Bart

Re: Splitter for big files

Posted by Bart Frackiewicz <ba...@open-medium.com>.

Hi Claus,

Claus Ibsen schrieb:
> 
> Could you create a JIRA ticket for this improvement?

Thank you for creating the JIRA tickets. Do CAMEL-876 cover the XPATH 
issue? The batch functionality was new for me.

> Btw how big is the files you use? 

We process old datasets up to 1gb, the XML files are about 250mb.

Can i create a workaround for this? A colleague of mine found a good 
implementation in Servicemix, which splits the lines and send the new 
exchanges direct on the message bus. I am not so experienced with Camel 
to measure this. Maybe you can give me a hint how you would solve it.

Bart

RE: Splitter for big files

Posted by Claus Ibsen <ci...@silverbullet.dk>.

Hi

Looking into the source code of the splitter it looks like it creates the list of splitted exchanges before they are being processed. That is why it then will consume memory for big files.

Maybe somekind of batch size option is needed so you can set for instance number, say 20 as batch size.

   .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))

Could you create a JIRA ticket for this improvement?
Btw how big is the files you use? 

The file component uses a File as the object. 
So when you split using the input stream then Camel should use the type converter from File -> InputStream, that doesn't read the entire content into memory. This happends in the splitter where it creates the entire list of new exchanges to fire.

At least that is what I can read from the source code after a long days work, so please read the code as 4 eyes is better that 2 ;)



Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: Bart Frackiewicz [mailto:bart@open-medium.com] 
Sent: 2. september 2008 17:40
To: camel-user@activemq.apache.org
Subject: Splitter for big files

Hi,

i am using this route for a couple of CSV file routes:

   from("file:/tmp/input/?delete=true")
   .splitter(body(InputStream.class).tokenize("\r\n"))
   .beanRef("myBean", "process")
   .to("file:/tmp/output/?append=true")

This works fine for small CSV files, but for big files i noticed
that camel uses a lot of memory, it seems that camel is reading
the file into memory. What is the configuration to use a stream
in the splitter?

I recognized the same behaviour in the xpath splitter:

   from("file:/tmp/input/?delete=true")
   .splitter(ns.xpath("//member"))
   ...

BTW, i found a posting from march, where James suggest following
implementation for an own splitter:

-- quote --

   from("file:///c:/temp?noop=true)").
     splitter().method("myBean", "split").
     to("activemq:someQueue")

Then register "myBean" with a split method...

class SomeBean {
   public Iterator split(File file) {
      /// figure out how to split this file into rows...
   }
}
-- quote --

But this won't work for me (Camel 1.4).

Bart