You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Marcel Bruch <ma...@gmail.com> on 2011/09/25 15:40:39 UTC

Using Jackrabbit/JCR as IDE workspace data backend

Hi,

I'm looking for some advice whether Jackrabbit might be a good choice for my problem. Any comments on this are greatly appreciated.


= Short description of the challenge =

We've built a Eclipse based tool that analyzes java source files and stores its analysis results in additional files. The workspace  potentially has a hundred or more projects and each project may have up to a thousand of files. Say, there will be 100 projects and 1000 java source files per project in a single workspace. Then, there will be 100*1000 = 100.000 files.

On a full workspace build, all these 100k files have to be compiled (by the IDE) and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather fast.
But the most common use case is that a single file is changed several times per minute and thus gets frequently analyzed.

At the moment, the analysis results are dumped on disk as plain json files; one json file for each java class. Each json file is around 5 to 100kb in size; some files grow up to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which might perfectly map to nodes in JCR).

= Question =

We would like to change the simple file system approach by a more sophisticated approach and I wonder whether Jackrabbit may be a suitable backend for this use case. Since we map all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure. What's your suggestion? Is Jackrabbit capable to quickly load and store json-like data - even if 100k files (nodes + their sub-nodes) have to be updated very in very short time?

Thanks for your suggestions. I've you need more details on what operations are performed or how data looks like, I would be glad to take your questions.

Marcel
-- 
Eclipse Code Recommenders:
 w www.eclipse.org/recommenders
 tw www.twitter.com/marcelbruch
 g+ www.gplus.to/marcelbruch


Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

For some related performance metrics, see for my latest performance report [1].

The SmallFileWriteTest report [2] is probably closest to what you're
trying to achieve. It measures the time it takes to write a hundred
10kB files (i.e. about 1MB of data) to the repository. With the latest
Jackrabbit version on my mid-range desktop computer (4 cores at
2.4GHz, 8GB RAM, reasonably fast disk), this takes about 200ms.

For comparison writing a hundred 10kB files directly to the file
system using a simple Groovy script takes only about 35ms, so the file
system is still the way to go if you're mostly looking for raw
performance.

[1] http://people.apache.org/~jukka/jackrabbit/report-2011-09-27/report.html
[2] http://people.apache.org/~jukka/jackrabbit/report-2011-09-27/SmallFileWriteTest.png

BR,

Jukka Zitting

Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Marcel Bruch <ma...@gmail.com>.
On 26.09.2011, at 20:42, Stefan Guggisberg wrote:

>> However, the overall performance is still a bit low (2:24-3:05 minutes in a clean repository). Any idea how the performance could be improved? Am I doing something conceptually wrong?
> 
> did you run my test with the same test data (local svn export of jackrabbit trunk)?
> 

On svn export of jackrabbit trunk it takes ~1 minute (note that in your code you increased the counter two times in the loop. Thus, I've just half of the units than you had).
However, it's factor two slower than on your machine. The size of the attached data, however, does not seem to be the most limiting factor, right?
(27 MB & 3k files -> 1 minute; 240 MB & 6k files -> 3 minutes)

Mailing List approach:

0:00:06.440: 500 units persisted. data  3 MB
0:00:15.066: 1000 units persisted. data  7 MB
0:00:47.602: 1500 units persisted. data  15 MB
0:00:50.955: 2000 units persisted. data  18 MB
0:00:54.042: 2500 units persisted. data  22 MB
0:01:00.552: 3000 units persisted. data  27 MB
Run took 0:01:07.887


Jackrabbit First Hops adapted: 

0:00:03.192: 500 units persisted.  data 3 MB 
0:00:08.062: 1000 units persisted.  data 8 MB 
0:00:38.184: 1500 units persisted.  data 19 MB 
0:00:40.160: 2000 units persisted.  data 22 MB 
0:00:42.568: 2500 units persisted.  data 25 MB 
0:00:46.677: 3000 units persisted.  data 32 MB 
Mon Sep 26 21:02:15 CEST 2011: 3245 units persisted
Run took 0:00:51.384

Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Stefan Guggisberg <st...@gmail.com>.

On 26.09.2011, at 19:56, Marcel Bruch <ma...@gmail.com> wrote:

> Hi Stefan,
> 
> On 26.09.2011, at 18:13, Stefan Guggisberg wrote:
> 
>>> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
>>> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
>>> an operation takes? Is it possible to improve the performance somehow?
>> 
>> the performance seems rather poor. it's hard to tell what's wrong
>> without having the test data. i noticed that you're storing the
>> content of the .json files as string properties. why aren't you
>> storing the json data as nodes & properties?
> 
> I had no code available for serializing the data as JCR nodes. Is there any simple snippet available somewhere?
> However, I thought as a first baseline this would work. 
> 
> 
>> anyway, i quickly ran an adapted ad hoc test on my machine
>> (macbook pro 2.66 ghz, standard harddisk). the test imports
>> an 'svn export' of jackrabbit/trunk.
>> 
>> importing ~6500 files takes ~30s which is IMO decent.
> 
> Thanks for writing your test agains your local files!
> 
> I run your code and compared the execution times. Unfortunately, it's not performing  faster :( 
> The minute delta might be cause by some file traversing differences of by the additional nodes/properties created in your code.
> 
> However, the overall performance is still a bit low (2:24-3:05 minutes in a clean repository). Any idea how the performance could be improved? Am I doing something conceptually wrong?

did you run my test with the same test data (local svn export of jackrabbit trunk)?

cheers
stefan

> I'm assuming that there is no big delta between creating hundreds of nodes and properties compared to dumping a file's content into Jackrabbit. Is this correct?
> 
> Thanks,
> Marcel
> 
> === Experiments performance results ===
> 
> 
> Jackrabbit First Hops code adapted:
> 
> 0:00:08.522: 500 units persisted.  data 17 MB 
> 0:00:17.057: 1000 units persisted.  data 33 MB 
> 0:00:31.763: 1500 units persisted.  data 53 MB 
> 0:00:41.404: 2000 units persisted.  data 72 MB 
> 0:00:53.140: 2500 units persisted.  data 97 MB 
> 0:01:02.988: 3000 units persisted.  data 113 MB 
> 0:01:16.314: 3500 units persisted.  data 133 MB 
> 0:01:35.171: 4000 units persisted.  data 143 MB 
> 0:01:49.414: 4500 units persisted.  data 173 MB 
> 0:02:04.617: 5000 units persisted.  data 204 MB 
> 0:02:12.593: 5500 units persisted.  data 221 MB 
> Mon Sep 26 19:54:58 CEST 2011: 5927 units persisted
> Run took 0:02:24.505
> 
> 
> Mailing List proposal:
> 
> 0:00:14.853: 500 units persisted. data  17 MB
> 0:00:26.353: 1000 units persisted. data  33 MB
> 0:00:36.114: 1500 units persisted. data  53 MB
> 0:00:53.274: 2000 units persisted. data  72 MB
> 0:01:06.643: 2500 units persisted. data  97 MB
> 0:01:18.230: 3000 units persisted. data  113 MB
> 0:01:36.765: 3500 units persisted. data  133 MB
> 0:01:44.245: 4000 units persisted. data  143 MB
> 0:02:04.026: 4500 units persisted. data  173 MB
> 0:02:37.533: 5000 units persisted. data  204 MB
> 0:02:48.089: 5500 units persisted. data  221 MB
> Run took 0:03:08.458
> 
> 

Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Marcel Bruch <ma...@gmail.com>.
Hi Stefan,

On 26.09.2011, at 18:13, Stefan Guggisberg wrote:

>> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
>> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
>> an operation takes? Is it possible to improve the performance somehow?
> 
> the performance seems rather poor. it's hard to tell what's wrong
> without having the test data. i noticed that you're storing the
> content of the .json files as string properties. why aren't you
> storing the json data as nodes & properties?

I had no code available for serializing the data as JCR nodes. Is there any simple snippet available somewhere?
However, I thought as a first baseline this would work. 


> anyway, i quickly ran an adapted ad hoc test on my machine
> (macbook pro 2.66 ghz, standard harddisk). the test imports
> an 'svn export' of jackrabbit/trunk.
> 
> importing ~6500 files takes ~30s which is IMO decent.

Thanks for writing your test agains your local files!

I run your code and compared the execution times. Unfortunately, it's not performing  faster :( 
The minute delta might be cause by some file traversing differences of by the additional nodes/properties created in your code.

However, the overall performance is still a bit low (2:24-3:05 minutes in a clean repository). Any idea how the performance could be improved? Am I doing something conceptually wrong?
I'm assuming that there is no big delta between creating hundreds of nodes and properties compared to dumping a file's content into Jackrabbit. Is this correct?

Thanks,
Marcel

=== Experiments performance results ===


Jackrabbit First Hops code adapted:

0:00:08.522: 500 units persisted.  data 17 MB 
0:00:17.057: 1000 units persisted.  data 33 MB 
0:00:31.763: 1500 units persisted.  data 53 MB 
0:00:41.404: 2000 units persisted.  data 72 MB 
0:00:53.140: 2500 units persisted.  data 97 MB 
0:01:02.988: 3000 units persisted.  data 113 MB 
0:01:16.314: 3500 units persisted.  data 133 MB 
0:01:35.171: 4000 units persisted.  data 143 MB 
0:01:49.414: 4500 units persisted.  data 173 MB 
0:02:04.617: 5000 units persisted.  data 204 MB 
0:02:12.593: 5500 units persisted.  data 221 MB 
Mon Sep 26 19:54:58 CEST 2011: 5927 units persisted
Run took 0:02:24.505


Mailing List proposal:

0:00:14.853: 500 units persisted. data  17 MB
0:00:26.353: 1000 units persisted. data  33 MB
0:00:36.114: 1500 units persisted. data  53 MB
0:00:53.274: 2000 units persisted. data  72 MB
0:01:06.643: 2500 units persisted. data  97 MB
0:01:18.230: 3000 units persisted. data  113 MB
0:01:36.765: 3500 units persisted. data  133 MB
0:01:44.245: 4000 units persisted. data  143 MB
0:02:04.026: 4500 units persisted. data  173 MB
0:02:37.533: 5000 units persisted. data  204 MB
0:02:48.089: 5500 units persisted. data  221 MB
Run took 0:03:08.458



Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Stefan Guggisberg <st...@gmail.com>.
On Mon, Sep 26, 2011 at 3:51 PM, Marcel Bruch <ma...@gmail.com> wrote:
> Thanks Stefan. I gave it a try. Could you or someone else comment on
> the code and its performance?
>
> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
> an operation takes? Is it possible to improve the performance somehow?

the performance seems rather poor. it's hard to tell what's wrong
without having the test data. i noticed that you're storing the
content of the .json files as string properties. why aren't you
storing the json data as nodes & properties?

anyway, i quickly ran an adapted ad hoc test on my machine
(macbook pro 2.66 ghz, standard harddisk). the test imports
an 'svn export' of jackrabbit/trunk.

importing ~6500 files takes ~30s which is IMO decent.

cheers
stefan


/////////////////////////////////////////////////////////////////////////////////////////////////////////
import org.apache.commons.io.FileUtils;
import org.apache.jackrabbit.core.TransientRepository;

import javax.jcr.Node;
import javax.jcr.Session;
import javax.jcr.SimpleCredentials;
import java.io.File;
import java.io.FileInputStream;
import java.util.Calendar;

public class JcrArtifactStoreTest {

    static final String FILE_ROOT = "/Users/stefan/tmp/jackrabbit-src/";

    static final boolean STORE_BINARY = false;

    static int count = 0;
    static long size = 0;
    static long ts = 0;

    public static void main(String[] args) throws Exception {

        TransientRepository repository = new TransientRepository();
        Session session = repository.login(new
SimpleCredentials("admin", "admin".toCharArray()));

        ts = System.currentTimeMillis();
        long ts0 = ts;

        importNode(new File(FILE_ROOT), session.getRootNode());

        session.save();

        long ts1 = System.currentTimeMillis();
        System.out.printf("%d ms: %d units persisted. data %s\n", ts1
- ts, count,
                FileUtils.byteCountToDisplaySize(size));
        ts = ts1;

        System.out.println("Total time: " + (ts1 - ts0) + " ms");
    }

    static void importNode(File file, Node parent) throws Exception {
        if (file.isDirectory()) {
            Node newNode = parent.addNode(file.getName(), "nt:folder");
            File[] children = file.listFiles();
            if (children != null) {
                for (int i = 0; i < children.length; i++) {
                    importNode(children[i], newNode);
                }
            }
        } else {
            Node newNode = parent.addNode(file.getName(), "nt:file");
            String nt = STORE_BINARY ? "nt:resource" : "nt:unstructured";
            Node content = newNode.addNode("jcr:content", nt);
            if (STORE_BINARY) {
                content.setProperty("jcr:data", new FileInputStream(file));
            } else {
                content.setProperty("jcr:data",
FileUtils.readFileToString(file));
            }
            content.setProperty("jcr:lastModified", Calendar.getInstance());
            content.setProperty("jcr:mimeType", "application/octet-stream");

            size += file.length();
            count++;
            if (++count % 500 == 0) {
                parent.getSession().save();

                long ts1 = System.currentTimeMillis();

                System.out.printf("%d ms: %d units persisted. data
%s\n", ts1 - ts, count,
                        FileUtils.byteCountToDisplaySize(size));
                ts = ts1;
            }
        }
    }
}



>
> The code I used to persist data is given below. The pure IO time w/o
> jackrabbit is ~1second w/ solid state disk.
>
> Thanks for your comments,
> Marcel
>
> Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
> Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
> Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
> Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
> Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
> Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
> Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
> Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
> Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
> Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
> Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
> Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
> Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
> Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
> Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
> Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
> Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
> Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
> Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
> Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
> Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
> Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
> Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
> Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
> Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
> Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
> Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
> Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
> Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
> Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted
>
>
> public class JcrArtifactStoreTest {
>
>    private TransientRepository repository;
>    private Session session;
>
>    @Before
>    public void setup() throws RepositoryException {
>
>        final File basedir = new File("recommenders/").getAbsoluteFile();
>        basedir.mkdir();
>        repository = new TransientRepository(basedir);
>        session = repository.login(new SimpleCredentials("username",
> "password".toCharArray()));
>    }
>
>    @Test
>    public void test2() throws ConfigurationException,
> RepositoryException, IOException {
>
>        int i = 0;
>        int size = 0;
>        final Iterator<File> it = findDataFiles();
>        final Node rootNode = session.getRootNode();
>
>        while (it.hasNext()) {
>            final File file = it.next();
>            Node activeNode = rootNode;
>            for (final String segment : new
> Path(file.getAbsolutePath()).segments()) {
>                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
>            }
>            // System.out.println(activeNode.getPath());
>            final String content = Files.toString(file, Charsets.UTF_8);
>            size += content.getBytes().length;
>            activeNode.setProperty("cu", content);
>            if (++i % 200 == 0) {
>                session.save();
>                System.out.printf("%s: %d units persisted.  data %s
> \n", new Date(), i,
>                        FileUtils.byteCountToDisplaySize(size));
>            }
>        }
>        session.save();
>        System.out.printf("%s: %d units persisted\n", new Date(), i);
>    }
>
>    @SuppressWarnings("unchecked")
>    private Iterator<File> findDataFiles() {
>        return FileUtils.iterateFiles(new
> File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
>                FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
>    }
>
>
>
>
> 2011/9/26 Stefan Guggisberg <st...@gmail.com>:
>> hi marcel,
>>
>> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <ma...@gmail.com> wrote:
>>> Hi,
>>>
>>> I'm looking for some advice whether Jackrabbit might be a good choice for my problem. Any comments on this are greatly appreciated.
>>>
>>>
>>> = Short description of the challenge =
>>>
>>> We've built a Eclipse based tool that analyzes java source files and stores its analysis results in additional files. The workspace  potentially has hundreds of projects and each project may have up to a few thousands of files. Say, there will be 200 projects and 1000 java source files per project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>>
>>> On a full workspace build, all these 200k files have to be compiled (by the IDE) and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather fast.
>>> But the most common use case is that a single file is changed several times per minute and thus gets frequently analyzed.
>>>
>>> At the moment, the analysis results are dumped on disk as plain json files; one json file for each java class. Each json file is around 5 to 100kb in size; some files grow up to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>>
>>> = Question =
>>>
>>> We would like to change the simple file system approach by a more sophisticated approach and I wonder whether Jackrabbit may be a suitable backend for this use case. Since we map all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure.
>>>
>>> What's your suggestion? Is Jackrabbit capable to quickly load and store json-like data - even if 200k files (nodes + their sub-nodes) have to be updated very in very short time?
>>
>> absolutely. if the data is reasonably structured/organized jackrabbit
>> should be a perfect fit.
>> i suggest to leverage the java package space hierarchy for organizing the data
>> (i.e. org.apache.jackrabbit.core.TransientRepository ->
>> /org/apache/jackrabbit/core/TransientRepository).
>> for further data modeling recommondations see [0].
>>
>> cheers
>> stefan
>>
>> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>>
>>>
>>>
>>> Thanks for your suggestions. I've you need more details on what operations are performed or how data looks like, I would be glad to take your questions.
>>>
>>> Marcel
>>>
>

Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Marcel Bruch <ma...@gmail.com>.
Thanks Stefan. I gave it a try. Could you or someone else comment on
the code and its performance?

I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
Storing ~240 MB took roughly 3 minutes. Is this the expected time such
an operation takes? Is it possible to improve the performance somehow?

The code I used to persist data is given below. The pure IO time w/o
jackrabbit is ~1second w/ solid state disk.

Thanks for your comments,
Marcel

Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted


public class JcrArtifactStoreTest {

    private TransientRepository repository;
    private Session session;

    @Before
    public void setup() throws RepositoryException {

        final File basedir = new File("recommenders/").getAbsoluteFile();
        basedir.mkdir();
        repository = new TransientRepository(basedir);
        session = repository.login(new SimpleCredentials("username",
"password".toCharArray()));
    }

    @Test
    public void test2() throws ConfigurationException,
RepositoryException, IOException {

        int i = 0;
        int size = 0;
        final Iterator<File> it = findDataFiles();
        final Node rootNode = session.getRootNode();

        while (it.hasNext()) {
            final File file = it.next();
            Node activeNode = rootNode;
            for (final String segment : new
Path(file.getAbsolutePath()).segments()) {
                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
            }
            // System.out.println(activeNode.getPath());
            final String content = Files.toString(file, Charsets.UTF_8);
            size += content.getBytes().length;
            activeNode.setProperty("cu", content);
            if (++i % 200 == 0) {
                session.save();
                System.out.printf("%s: %d units persisted.  data %s
\n", new Date(), i,
                        FileUtils.byteCountToDisplaySize(size));
            }
        }
        session.save();
        System.out.printf("%s: %d units persisted\n", new Date(), i);
    }

    @SuppressWarnings("unchecked")
    private Iterator<File> findDataFiles() {
        return FileUtils.iterateFiles(new
File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
                FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
    }




2011/9/26 Stefan Guggisberg <st...@gmail.com>:
> hi marcel,
>
> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <ma...@gmail.com> wrote:
>> Hi,
>>
>> I'm looking for some advice whether Jackrabbit might be a good choice for my problem. Any comments on this are greatly appreciated.
>>
>>
>> = Short description of the challenge =
>>
>> We've built a Eclipse based tool that analyzes java source files and stores its analysis results in additional files. The workspace  potentially has hundreds of projects and each project may have up to a few thousands of files. Say, there will be 200 projects and 1000 java source files per project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>
>> On a full workspace build, all these 200k files have to be compiled (by the IDE) and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather fast.
>> But the most common use case is that a single file is changed several times per minute and thus gets frequently analyzed.
>>
>> At the moment, the analysis results are dumped on disk as plain json files; one json file for each java class. Each json file is around 5 to 100kb in size; some files grow up to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>
>> = Question =
>>
>> We would like to change the simple file system approach by a more sophisticated approach and I wonder whether Jackrabbit may be a suitable backend for this use case. Since we map all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure.
>>
>> What's your suggestion? Is Jackrabbit capable to quickly load and store json-like data - even if 200k files (nodes + their sub-nodes) have to be updated very in very short time?
>
> absolutely. if the data is reasonably structured/organized jackrabbit
> should be a perfect fit.
> i suggest to leverage the java package space hierarchy for organizing the data
> (i.e. org.apache.jackrabbit.core.TransientRepository ->
> /org/apache/jackrabbit/core/TransientRepository).
> for further data modeling recommondations see [0].
>
> cheers
> stefan
>
> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>
>>
>>
>> Thanks for your suggestions. I've you need more details on what operations are performed or how data looks like, I would be glad to take your questions.
>>
>> Marcel
>>

Re: Using Jackrabbit/JCR as IDE workspace data backend

Posted by Stefan Guggisberg <st...@gmail.com>.
hi marcel,

On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <ma...@gmail.com> wrote:
> Hi,
>
> I'm looking for some advice whether Jackrabbit might be a good choice for my problem. Any comments on this are greatly appreciated.
>
>
> = Short description of the challenge =
>
> We've built a Eclipse based tool that analyzes java source files and stores its analysis results in additional files. The workspace  potentially has hundreds of projects and each project may have up to a few thousands of files. Say, there will be 200 projects and 1000 java source files per project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>
> On a full workspace build, all these 200k files have to be compiled (by the IDE) and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather fast.
> But the most common use case is that a single file is changed several times per minute and thus gets frequently analyzed.
>
> At the moment, the analysis results are dumped on disk as plain json files; one json file for each java class. Each json file is around 5 to 100kb in size; some files grow up to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>
> = Question =
>
> We would like to change the simple file system approach by a more sophisticated approach and I wonder whether Jackrabbit may be a suitable backend for this use case. Since we map all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure.
>
> What's your suggestion? Is Jackrabbit capable to quickly load and store json-like data - even if 200k files (nodes + their sub-nodes) have to be updated very in very short time?

absolutely. if the data is reasonably structured/organized jackrabbit
should be a perfect fit.
i suggest to leverage the java package space hierarchy for organizing the data
(i.e. org.apache.jackrabbit.core.TransientRepository ->
/org/apache/jackrabbit/core/TransientRepository).
for further data modeling recommondations see [0].

cheers
stefan

[0] http://wiki.apache.org/jackrabbit/DavidsModel

>
>
> Thanks for your suggestions. I've you need more details on what operations are performed or how data looks like, I would be glad to take your questions.
>
> Marcel
>
> --
> Eclipse Code Recommenders:
>  w www.eclipse.org/recommenders
>  tw www.twitter.com/marcelbruch
>  g+ www.gplus.to/marcelbruch
>
>