You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-user@db.apache.org by hcadavid <he...@yahoo.com> on 2008/09/11 14:39:38 UTC

Derby problem: 13GB of space with 200000 records!

Dear friends,

I'm using derby db to record word's frequencies from a large text corpus
with a java program. It works nice with standard statements, like: "INSERT
INTO WORDS VALUES('"+word+"',1)" (it takes 50Mb to store 400000 words), but
when I switched to prepared statements and inner statements(in order to
improve performance) and repeated the process, after few hours of processing
(200MB of plain text), the database's disk consumption gets an absurd
dimension: 13GB!, I mean, 13GB of disk space to store 400000 words (of
standard length) and its frequencies!!. What may be the problem??
the biggest file is: seg0\c3c0.dat (13GB), there are no log files problem.

Here is how I'm making insertions and updates:

	        Connection con=EmbeddedDBMSConnectionBroker.getConnection();
		PreparedStatement st=con.prepareStatement("INSERT INTO WORDS
VALUES(?,1)");
		st.setString(1, word);
		
		word=word.trim().toLowerCase();
		
		try{
			st.execute();	
		}
		catch(SQLIntegrityConstraintViolationException e){
			PreparedStatement ps=con.prepareStatement("update words set
frequency=((select frequency from words where word=?)+1) where word=?");
			ps.setString(1, word);
			ps.setString(2, word);
			ps.execute();
		}
		
		con.commit();
		con.close();

This method is used concurrently by 100 threads. Please, anyone know the
causes of this estrange Derby's behavior?? (handling GBs of disk space just
for store few words isn't reasonable!).

Thanks in advance

Héctor
-- 
View this message in context: http://www.nabble.com/Derby-problem%3A-13GB-of-space-with-200000-records%21-tp19433858p19433858.html
Sent from the Apache Derby Users mailing list archive at Nabble.com.


Re: Derby problem: 13GB of space with 200000 records!

Posted by hcadavid <he...@yahoo.com>.
Dear friends,

Thanks for your suggestions.

the problem is fixed!, I changed the frequencies update logic from:
try{insert}catch{update}, to : count=update;if (count==0) insert, with the
multithread strategy (also reusing connections), and it works like a charm,
700MB of text on a 13MB database...
Already I don´t know why this happens (apparently the problem was updating
on exception handling), and the trim operation didn't affect the whole
operation.

Best regards,
Héctor


try{
>                        st.execute();
>                }
>                catch(SQLIntegrityConstraintViolationException e){
>                        PreparedSta



Sai Pullabhotla wrote:
> 
> Definitely the word.trim().toLowerCase() should appear before setting
> it on the statement. Otherwise, the statement would actually inserting
> the original word with white spaces and mixed cases. Which could be
> why there are so many words.
> 
> Also, when you do an update in the catch block, I don't think you have
> to do the inner select. You could simply use a statement like this:
> 
> update words set frequency=frequency+1 where word=?
> 
> Sai Pullabhotla
> Phone: (402) 408-5753
> Fax: (402) 408-6861
> www.jMethods.com
> 
> 
> 
> On Thu, Sep 11, 2008 at 7:39 AM, hcadavid <he...@yahoo.com> wrote:
>>
>> Dear friends,
>>
>> I'm using derby db to record word's frequencies from a large text corpus
>> with a java program. It works nice with standard statements, like:
>> "INSERT
>> INTO WORDS VALUES('"+word+"',1)" (it takes 50Mb to store 400000 words),
>> but
>> when I switched to prepared statements and inner statements(in order to
>> improve performance) and repeated the process, after few hours of
>> processing
>> (200MB of plain text), the database's disk consumption gets an absurd
>> dimension: 13GB!, I mean, 13GB of disk space to store 400000 words (of
>> standard length) and its frequencies!!. What may be the problem??
>> the biggest file is: seg0\c3c0.dat (13GB), there are no log files
>> problem.
>>
>> Here is how I'm making insertions and updates:
>>
>>                Connection
>> con=EmbeddedDBMSConnectionBroker.getConnection();
>>                PreparedStatement st=con.prepareStatement("INSERT INTO
>> WORDS
>> VALUES(?,1)");
>>                st.setString(1, word);
>>
>>                word=word.trim().toLowerCase();
>>
>>                try{
>>                        st.execute();
>>                }
>>                catch(SQLIntegrityConstraintViolationException e){
>>                        PreparedStatement ps=con.prepareStatement("update
>> words set
>> frequency=((select frequency from words where word=?)+1) where word=?");
>>                        ps.setString(1, word);
>>                        ps.setString(2, word);
>>                        ps.execute();
>>                }
>>
>>                con.commit();
>>                con.close();
>>
>> This method is used concurrently by 100 threads. Please, anyone know the
>> causes of this estrange Derby's behavior?? (handling GBs of disk space
>> just
>> for store few words isn't reasonable!).
>>
>> Thanks in advance
>>
>> Héctor
>> --
>> View this message in context:
>> http://www.nabble.com/Derby-problem%3A-13GB-of-space-with-200000-records%21-tp19433858p19433858.html
>> Sent from the Apache Derby Users mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Derby-problem%3A-13GB-of-space-with-200000-records%21-tp19433858p19465767.html
Sent from the Apache Derby Users mailing list archive at Nabble.com.


Re: Derby problem: 13GB of space with 200000 records!

Posted by Sai Pullabhotla <sa...@jmethods.com>.
Definitely the word.trim().toLowerCase() should appear before setting
it on the statement. Otherwise, the statement would actually inserting
the original word with white spaces and mixed cases. Which could be
why there are so many words.

Also, when you do an update in the catch block, I don't think you have
to do the inner select. You could simply use a statement like this:

update words set frequency=frequency+1 where word=?

Sai Pullabhotla
Phone: (402) 408-5753
Fax: (402) 408-6861
www.jMethods.com



On Thu, Sep 11, 2008 at 7:39 AM, hcadavid <he...@yahoo.com> wrote:
>
> Dear friends,
>
> I'm using derby db to record word's frequencies from a large text corpus
> with a java program. It works nice with standard statements, like: "INSERT
> INTO WORDS VALUES('"+word+"',1)" (it takes 50Mb to store 400000 words), but
> when I switched to prepared statements and inner statements(in order to
> improve performance) and repeated the process, after few hours of processing
> (200MB of plain text), the database's disk consumption gets an absurd
> dimension: 13GB!, I mean, 13GB of disk space to store 400000 words (of
> standard length) and its frequencies!!. What may be the problem??
> the biggest file is: seg0\c3c0.dat (13GB), there are no log files problem.
>
> Here is how I'm making insertions and updates:
>
>                Connection con=EmbeddedDBMSConnectionBroker.getConnection();
>                PreparedStatement st=con.prepareStatement("INSERT INTO WORDS
> VALUES(?,1)");
>                st.setString(1, word);
>
>                word=word.trim().toLowerCase();
>
>                try{
>                        st.execute();
>                }
>                catch(SQLIntegrityConstraintViolationException e){
>                        PreparedStatement ps=con.prepareStatement("update words set
> frequency=((select frequency from words where word=?)+1) where word=?");
>                        ps.setString(1, word);
>                        ps.setString(2, word);
>                        ps.execute();
>                }
>
>                con.commit();
>                con.close();
>
> This method is used concurrently by 100 threads. Please, anyone know the
> causes of this estrange Derby's behavior?? (handling GBs of disk space just
> for store few words isn't reasonable!).
>
> Thanks in advance
>
> Héctor
> --
> View this message in context: http://www.nabble.com/Derby-problem%3A-13GB-of-space-with-200000-records%21-tp19433858p19433858.html
> Sent from the Apache Derby Users mailing list archive at Nabble.com.
>
>

Re: Derby problem: 13GB of space with 200000 records!

Posted by Suavi Ali Demir <de...@yahoo.com>.
Does the same happen if you use single thread? Does disk space use go down if you compress table?

Should the 

word=word.trim().toLowerCase(); 

appear before setString?

When you store the frequency of words, you would have records like:
"home", 1217

Then the number of rows would not exceed couple thousand. Does this text really have 200k unique words? Do you have a unique index on word column? 

A suggestion: Single thread that re-uses the connection and the prepared statements might go faster than multiple threads.
Regards,
Ali

--- On Thu, 9/11/08, hcadavid <he...@yahoo.com> wrote:

> From: hcadavid <he...@yahoo.com>
> Subject: Derby problem: 13GB of space with 200000 records!
> To: derby-user@db.apache.org
> Date: Thursday, September 11, 2008, 5:39 AM
> Dear friends,
> 
> I'm using derby db to record word's frequencies
> from a large text corpus
> with a java program. It works nice with standard
> statements, like: "INSERT
> INTO WORDS VALUES('"+word+"',1)" (it
> takes 50Mb to store 400000 words), but
> when I switched to prepared statements and inner
> statements(in order to
> improve performance) and repeated the process, after few
> hours of processing
> (200MB of plain text), the database's disk consumption
> gets an absurd
> dimension: 13GB!, I mean, 13GB of disk space to store
> 400000 words (of
> standard length) and its frequencies!!. What may be the
> problem??
> the biggest file is: seg0\c3c0.dat (13GB), there are no
> log files problem.
> 
> Here is how I'm making insertions and updates:
> 
> 	        Connection
> con=EmbeddedDBMSConnectionBroker.getConnection();
> 		PreparedStatement st=con.prepareStatement("INSERT
> INTO WORDS
> VALUES(?,1)");
> 		st.setString(1, word);
> 		
> 		word=word.trim().toLowerCase();
> 		
> 		try{
> 			st.execute();	
> 		}
> 		catch(SQLIntegrityConstraintViolationException e){
> 			PreparedStatement ps=con.prepareStatement("update
> words set
> frequency=((select frequency from words where word=?)+1)
> where word=?");
> 			ps.setString(1, word);
> 			ps.setString(2, word);
> 			ps.execute();
> 		}
> 		
> 		con.commit();
> 		con.close();
> 
> This method is used concurrently by 100 threads. Please,
> anyone know the
> causes of this estrange Derby's behavior?? (handling
> GBs of disk space just
> for store few words isn't reasonable!).
> 
> Thanks in advance
> 
> Héctor
> -- 
> View this message in context:
> http://www.nabble.com/Derby-problem%3A-13GB-of-space-with-200000-records%21-tp19433858p19433858.html
> Sent from the Apache Derby Users mailing list archive at
> Nabble.com.