You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@camel.apache.org by Claus Ibsen <ci...@silverbullet.dk> on 2008/09/03 16:04:47 UTC
RE: [SPAM] RE: Splitter for big files

Hi

With or without these improvements the transaction issue is still the same.

The patches just improve the memory usage to not load the entire file into memory before splitting.

The transactional issue should be handled by external Transaction managers such as Spring, JTA in a J2EE container or others. Notice this usually only works with JMS and JDBC.

So if you for instance want to read a big file, split it into lines, processes each line and store each line in a database. Then you could put the exchanges on a JMS queue before it's stored in the database to ensure a safe point. Then the JMS can redo until the database is updated.

from(file).split().to(jms);
from(jms).process().to(jdbc);


Med venlig hilsen
 
Claus Ibsen
......................................
Silverbullet
Skovsgårdsvænget 21
8362 Hørning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: cmoulliard [mailto:cmoulliard@gmail.com] 
Sent: 3. september 2008 15:41
To: camel-user@activemq.apache.org
Subject: [SPAM] RE: Splitter for big files


If we implement what the different stakeholders propose, can we guarantee
that in case a problem occurs during the parsing of the file, a rollback of
the messages created (by the batch or the tokenisation) will be done ?

Kind regards,

 

Claus Ibsen wrote:
> 
> Hi
> 
> I have created 2 tickets to track this:
> CAMEL-875, CAMEL-876
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> 
> -----Original Message-----
> From: Claus Ibsen [mailto:ci@silverbullet.dk] 
> Sent: 2. september 2008 21:44
> To: camel-user@activemq.apache.org
> Subject: RE: Splitter for big files
> 
> Ah of course well spotted. The tokenize is the memory hog. Good idea with
> the java.util.Scanner.
> 
> So combined with the batch stuff we should be able to operate on really
> big files without consuming too much memory ;)
> 
> 
> Med venlig hilsen
>  
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsgårdsvænget 21
> 8362 Hørning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> -----Original Message-----
> From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be] 
> Sent: 2. september 2008 21:28
> To: camel-user@activemq.apache.org
> Subject: Re: Splitter for big files
> 
> L.S.,
> 
> Just added my pair of eyes ;).  One part of the problem is indeed the 
> list of exchanges that is returned by the expression, but I think you're 
> also reading the entire file into memory a first time for tokenizing 
> it.  ExpressionBuilder.tokenizeExpression() converts the type to string 
> and then uses a StringTokenizer on that.  I think we could add support 
> there for tokenizing File, InputStreams and Readers directly using a 
> Scanner.
> 
> Regards,
> 
> Gert
> 
> Claus Ibsen wrote:
>> Hi
>>
>> Looking into the source code of the splitter it looks like it creates the
>> list of splitted exchanges before they are being processed. That is why
>> it then will consume memory for big files.
>>
>> Maybe somekind of batch size option is needed so you can set for instance
>> number, say 20 as batch size.
>>
>>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>>
>> Could you create a JIRA ticket for this improvement?
>> Btw how big is the files you use? 
>>
>> The file component uses a File as the object. 
>> So when you split using the input stream then Camel should use the type
>> converter from File -> InputStream, that doesn't read the entire content
>> into memory. This happends in the splitter where it creates the entire
>> list of new exchanges to fire.
>>
>> At least that is what I can read from the source code after a long days
>> work, so please read the code as 4 eyes is better that 2 ;)
>>
>>
>>
>> Med venlig hilsen
>>  
>> Claus Ibsen
>> ......................................
>> Silverbullet
>> Skovsgårdsvænget 21
>> 8362 Hørning
>> Tlf. +45 2962 7576
>> Web: www.silverbullet.dk
>>
>> -----Original Message-----
>> From: Bart Frackiewicz [mailto:bart@open-medium.com] 
>> Sent: 2. september 2008 17:40
>> To: camel-user@activemq.apache.org
>> Subject: Splitter for big files
>>
>> Hi,
>>
>> i am using this route for a couple of CSV file routes:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(body(InputStream.class).tokenize("\r\n"))
>>    .beanRef("myBean", "process")
>>    .to("file:/tmp/output/?append=true")
>>
>> This works fine for small CSV files, but for big files i noticed
>> that camel uses a lot of memory, it seems that camel is reading
>> the file into memory. What is the configuration to use a stream
>> in the splitter?
>>
>> I recognized the same behaviour in the xpath splitter:
>>
>>    from("file:/tmp/input/?delete=true")
>>    .splitter(ns.xpath("//member"))
>>    ...
>>
>> BTW, i found a posting from march, where James suggest following
>> implementation for an own splitter:
>>
>> -- quote --
>>
>>    from("file:///c:/temp?noop=true)").
>>      splitter().method("myBean", "split").
>>      to("activemq:someQueue")
>>
>> Then register "myBean" with a split method...
>>
>> class SomeBean {
>>    public Iterator split(File file) {
>>       /// figure out how to split this file into rows...
>>    }
>> }
>> -- quote --
>>
>> But this won't work for me (Camel 1.4).
>>
>> Bart
>>
>>   
> 
> 
> 


-----
Enterprise Architect

Xpectis
12, route d'Esch
L-1470 Luxembourg

Phone +352 25 10 70 470
Mobile +352 621 45 36 22

e-mail : cmoulliard@xpectis.com
web site :  www.xpectis.com www.xpectis.com 
My Blog :  http://cmoulliard.blogspot.com/ http://cmoulliard.blogspot.com/  
-- 
View this message in context: http://www.nabble.com/Splitter-for-big-files-tp19272583s22882p19289425.html
Sent from the Camel - Users mailing list archive at Nabble.com.