You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Anuj Kumar <an...@gmail.com> on 2011/03/14 19:42:15 UTC

Is this the right way to work with large number of N-Triples?

Hi All,

I am new to Jena and trying to explore it to work with large number of
N-Triples. The requirement is to read large number of N-Triples. For
example, a nt file from DBpedia dump that may run into GBs. I have to read
these triples, pick specific ones and further link it to the resource of
another set of triples. The goal is to link some of the entities based on
Linked Data concept. Once the mapping is done, I have to query the model
from that point onwards. I don't want to work by loading both the source and
target dataset in-memory.

To achieve this, I have first created a file model maker and then a named
model for the specific dataset being mapped. Now, I need to read the Triples
and add the mapping to this new model. What should be the right approach?

One way is to load the model using FileManager and iterate through the
statements and map them accordingly to the named model (i.e. our mapped
model) and at the end close it. This will work, but it will load all of the
triples in memory. Is this the right way to proceed or is there a way to
read the model sequentially at the time of mapping?

Just trying to understand the efficient way to map large set of N-Triples.
Need your suggestions.

Thanks,
Anuj

Re: Is this the right way to work with large number of N-Triples?

Posted by Anuj Kumar <an...@gmail.com>.
Sure. Will take a look at that as well. Interesting!
Regarding SARQ, I just tried it once. The errors were related to clean up of
Solr indexes during the tests. Here are the details-

 INFO [33039485@qtp-1012673-2] (SolrCore.java:1324) - [sarq] webapp=/solr
path=/update params={wt=ja
vabin&version=1} status=500 QTime=5
ERROR [33039485@qtp-1012673-2] (SolrException.java:139) -
java.io.IOException: Cannot delete .\solr\
sarq\data\index\lucene-d12b45df2c6d6ae2efebf4cb75b8da25-write.lock
        at
org.apache.lucene.store.NativeFSLockFactory.clearLock(NativeFSLockFactory.java:143)
        at org.apache.lucene.store.Directory.clearLock(Directory.java:141)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1541)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1402)
        at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:190)
        at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteAll(DirectUpdateHandler2.java:167)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:323)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFacto
ry.java:71)
        at
org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:234)
        at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:180)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBa
se.java:54)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
        at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:440)
        at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:943)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

ERROR [Finalizer] (SolrIndexWriter.java:242) - SolrIndexWriter was not
closed prior to finalize(), i
ndicates a bug -- POSSIBLE RESOURCE LEAK!!!
 INFO [33039485@qtp-1012673-2] (DirectUpdateHandler2.java:165) - [sarq]
REMOVING ALL DOCUMENTS FROM
INDEX
 INFO [33039485@qtp-1012673-2] (LogUpdateProcessorFactory.java:171) - {} 0 5
ERROR [33039485@qtp-1012673-2] (SolrException.java:139) -
java.io.IOException: Cannot delete .\solr\
sarq\data\index\lucene-d12b45df2c6d6ae2efebf4cb75b8da25-write.lock
        at
org.apache.lucene.store.NativeFSLockFactory.clearLock(NativeFSLockFactory.java:143)
        at org.apache.lucene.store.Directory.clearLock(Directory.java:141)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1541)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1402)
        at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:190)
        at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteAll(DirectUpdateHandler2.java:167)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:323)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFacto
ry.java:71)
        at
org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:234)
        at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:180)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBa
se.java:54)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
        at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:440)
        at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:943)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

I can take a look at it but first I need to understand the integration
point.

Thanks,
Anuj

On Thu, Mar 17, 2011 at 1:34 PM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

> About test failing strange... I don't see failures:
> Tests run: 41, Failures: 0, Errors: 0, Skipped: 0
> Share details on your failures, I might have a look (but not today).
>
> If you are keen, you can look at EARQ as well which is not just about
> ElasticSearch.
> It was done to experiment with a refactoring which made easier to plug-in
> different
> indexes... and indeed EARQ has Lucene, Solr and ElasticSearch in it):
> https://github.com/castagna/EARQ
>
> Paolo
>
> Anuj Kumar wrote:
>
>> Sure, I will let you know in case I have any queries. The tests were
>> failing when I built SARQ on my machine but I will look into it later. As
>> you mentioned, it is really good to understand the integration with LARQ as
>> a reference. So, I am doing that.
>>
>> Thanks for the info.
>>
>> - Anuj
>>
>> On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna <
>> castagna.lists@googlemail.com <ma...@googlemail.com>>
>> wrote:
>>
>>
>>
>>    Anuj Kumar wrote:
>>
>>        Thanks Paolo. I am looking into LARQ and also SARQ.
>>
>>
>>    Be warned: SARQ is just an experiment (and currently unsupported).
>>    However, if you prefer to use Solr, share with us you use case and
>>    your reasons
>>    and let me know if you have problems with it.
>>
>>    SARQ might be a little bit behind in relation to the removals from
>>    the index,
>>    but you can look at what LARQ does and port the same approach into
>> SARQ.
>>
>>    Paolo
>>
>>
>>
>>        On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
>>        castagna.lists@googlemail.com
>>        <ma...@googlemail.com>> wrote:
>>
>>
>>            Anuj Kumar wrote:
>>
>>                Hi Andy,
>>
>>                I have loaded few n-triples into TDB in the offline mode
>>                using tdbloader.
>>                Loading as well as query is fast but if I try to use a
>>                regex, it becomes
>>                very slow. It is taking few minutes. On my 32bit machine
>>                it takes more
>>                than
>>                10 mins (expected due to limited memory ~ 1.5GB) and on
>>                my 64bit machine
>>                (8GB) it takes around 5 mins.
>>
>>                The query is pretty exhaustive, correct me if it is
>>                happening due to the
>>                filter-
>>
>>                SELECT ?abstract
>>                WHERE {
>>                 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l
>> .
>>                 FILTER regex(?l, "Futurama", "i") .
>>                 ?resource <http://dbpedia.org/ontology/abstract>
>> ?abstract
>>                }
>>
>>                I have loaded few abstracts from dbpedia dump and I am
>>                trying to get the
>>                abstracts from the label. This is very slow. If I remove
>>                the FILTER and
>>                give
>>                the exact label, it is fast (should be because of TDB
>>                indexing).
>>
>>                What is the right way to do such regex search or text
>>                search over the
>>                graph?
>>                I have seen suggestions to use Lucene and I also saw the
>>                LARQ initiative.
>>                Is
>>                that the right way to go?
>>
>>            Yes, using LARQ (which is included in ARQ) will greatly
>>            speed up your
>>            query.
>>            LARQ documentation is here:
>>            http://jena.sourceforge.net/ARQ/lucene-arq.html
>>            You will need to build the Lucene index first, though.
>>
>>            Paolo
>>
>>
>>
>>                Thanks,
>>                Anuj
>>
>>                On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>>                andy.seaborne@epimorphics.com
>>                <ma...@epimorphics.com>> wrote:
>>
>>                 Just so you know: The TDB bulkloader can load all the
>>                data offline - it's
>>
>>                    faster than using Fuseki for data loading online.
>>
>>                         Andy
>>
>>
>>                    On 15/03/11 11:22, Anuj Kumar wrote:
>>
>>                     Hi Andy,
>>
>>                        Thanks for the info. I have loaded few GBs using
>>                        Fuseki Server but I
>>                        didn't
>>                        try RiotReader or Java APIs for TDB. Will try that.
>>                        Thanks for the response.
>>
>>                        Regards,
>>                        Anuj
>>
>>                        On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>                        andy.seaborne@epimorphics.com
>>                        <ma...@epimorphics.com>>  wrote:
>>
>>
>>                         1/ Have you considered reading the DBpedia data
>>                        into TDB?  This would
>>
>>                            keep
>>                            the triples on-disk (and have cached
>>                            in-memory versions of a subset).
>>
>>                            2/ A file can be read sequentially by using
>>                            the parser directly (See
>>                            RiotReader and pass in a Sink<Triple>  that
>>                            processes the stream of
>>                            triples).
>>
>>                                 Andy
>>
>>
>>                            On 14/03/11 18:42, Anuj Kumar wrote:
>>
>>                             Hi All,
>>
>>                                I am new to Jena and trying to explore
>>                                it to work with large number of
>>                                N-Triples. The requirement is to read
>>                                large number of N-Triples. For
>>                                example, a nt file from DBpedia dump
>>                                that may run into GBs. I have to
>>                                read
>>                                these triples, pick specific ones and
>>                                further link it to the resource
>>                                of
>>                                another set of triples. The goal is to
>>                                link some of the entities based
>>                                on
>>                                Linked Data concept. Once the mapping is
>>                                done, I have to query the
>>                                model
>>                                from that point onwards. I don't want to
>>                                work by loading both the
>>                                source
>>                                and
>>                                target dataset in-memory.
>>
>>                                To achieve this, I have first created a
>>                                file model maker and then a
>>                                named
>>                                model for the specific dataset being
>>                                mapped. Now, I need to read the
>>                                Triples
>>                                and add the mapping to this new model.
>>                                What should be the right
>>                                approach?
>>
>>                                One way is to load the model using
>>                                FileManager and iterate through the
>>                                statements and map them accordingly to
>>                                the named model (i.e. our
>>                                mapped
>>                                model) and at the end close it. This
>>                                will work, but it will load all
>>                                of
>>                                the
>>                                triples in memory. Is this the right way
>>                                to proceed or is there a way
>>                                to
>>                                read the model sequentially at the time
>>                                of mapping?
>>
>>                                Just trying to understand the efficient
>>                                way to map large set of
>>                                N-Triples.
>>                                Need your suggestions.
>>
>>                                Thanks,
>>                                Anuj
>>
>>
>>
>>
>>
>>
>>

Re: Is this the right way to work with large number of N-Triples?

Posted by Paolo Castagna <ca...@googlemail.com>.
About test failing strange... I don't see failures:
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0
Share details on your failures, I might have a look (but not today).

If you are keen, you can look at EARQ as well which is not just about ElasticSearch.
It was done to experiment with a refactoring which made easier to plug-in different
indexes... and indeed EARQ has Lucene, Solr and ElasticSearch in it):
https://github.com/castagna/EARQ

Paolo

Anuj Kumar wrote:
> Sure, I will let you know in case I have any queries. The tests were 
> failing when I built SARQ on my machine but I will look into it later. 
> As you mentioned, it is really good to understand the integration with 
> LARQ as a reference. So, I am doing that.
> 
> Thanks for the info.
> 
> - Anuj
> 
> On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna 
> <castagna.lists@googlemail.com <ma...@googlemail.com>> 
> wrote:
> 
> 
> 
>     Anuj Kumar wrote:
> 
>         Thanks Paolo. I am looking into LARQ and also SARQ.
> 
> 
>     Be warned: SARQ is just an experiment (and currently unsupported).
>     However, if you prefer to use Solr, share with us you use case and
>     your reasons
>     and let me know if you have problems with it.
> 
>     SARQ might be a little bit behind in relation to the removals from
>     the index,
>     but you can look at what LARQ does and port the same approach into SARQ.
> 
>     Paolo
> 
> 
> 
>         On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
>         castagna.lists@googlemail.com
>         <ma...@googlemail.com>> wrote:
> 
> 
>             Anuj Kumar wrote:
> 
>                 Hi Andy,
> 
>                 I have loaded few n-triples into TDB in the offline mode
>                 using tdbloader.
>                 Loading as well as query is fast but if I try to use a
>                 regex, it becomes
>                 very slow. It is taking few minutes. On my 32bit machine
>                 it takes more
>                 than
>                 10 mins (expected due to limited memory ~ 1.5GB) and on
>                 my 64bit machine
>                 (8GB) it takes around 5 mins.
> 
>                 The query is pretty exhaustive, correct me if it is
>                 happening due to the
>                 filter-
> 
>                 SELECT ?abstract
>                 WHERE {
>                  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>                  FILTER regex(?l, "Futurama", "i") .
>                  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
>                 }
> 
>                 I have loaded few abstracts from dbpedia dump and I am
>                 trying to get the
>                 abstracts from the label. This is very slow. If I remove
>                 the FILTER and
>                 give
>                 the exact label, it is fast (should be because of TDB
>                 indexing).
> 
>                 What is the right way to do such regex search or text
>                 search over the
>                 graph?
>                 I have seen suggestions to use Lucene and I also saw the
>                 LARQ initiative.
>                 Is
>                 that the right way to go?
> 
>             Yes, using LARQ (which is included in ARQ) will greatly
>             speed up your
>             query.
>             LARQ documentation is here:
>             http://jena.sourceforge.net/ARQ/lucene-arq.html
>             You will need to build the Lucene index first, though.
> 
>             Paolo
> 
> 
> 
>                 Thanks,
>                 Anuj
> 
>                 On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>                 andy.seaborne@epimorphics.com
>                 <ma...@epimorphics.com>> wrote:
> 
>                  Just so you know: The TDB bulkloader can load all the
>                 data offline - it's
> 
>                     faster than using Fuseki for data loading online.
> 
>                          Andy
> 
> 
>                     On 15/03/11 11:22, Anuj Kumar wrote:
> 
>                      Hi Andy,
> 
>                         Thanks for the info. I have loaded few GBs using
>                         Fuseki Server but I
>                         didn't
>                         try RiotReader or Java APIs for TDB. Will try that.
>                         Thanks for the response.
> 
>                         Regards,
>                         Anuj
> 
>                         On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>                         andy.seaborne@epimorphics.com
>                         <ma...@epimorphics.com>>  wrote:
> 
>                          1/ Have you considered reading the DBpedia data
>                         into TDB?  This would
> 
>                             keep
>                             the triples on-disk (and have cached
>                             in-memory versions of a subset).
> 
>                             2/ A file can be read sequentially by using
>                             the parser directly (See
>                             RiotReader and pass in a Sink<Triple>  that
>                             processes the stream of
>                             triples).
> 
>                                  Andy
> 
> 
>                             On 14/03/11 18:42, Anuj Kumar wrote:
> 
>                              Hi All,
> 
>                                 I am new to Jena and trying to explore
>                                 it to work with large number of
>                                 N-Triples. The requirement is to read
>                                 large number of N-Triples. For
>                                 example, a nt file from DBpedia dump
>                                 that may run into GBs. I have to
>                                 read
>                                 these triples, pick specific ones and
>                                 further link it to the resource
>                                 of
>                                 another set of triples. The goal is to
>                                 link some of the entities based
>                                 on
>                                 Linked Data concept. Once the mapping is
>                                 done, I have to query the
>                                 model
>                                 from that point onwards. I don't want to
>                                 work by loading both the
>                                 source
>                                 and
>                                 target dataset in-memory.
> 
>                                 To achieve this, I have first created a
>                                 file model maker and then a
>                                 named
>                                 model for the specific dataset being
>                                 mapped. Now, I need to read the
>                                 Triples
>                                 and add the mapping to this new model.
>                                 What should be the right
>                                 approach?
> 
>                                 One way is to load the model using
>                                 FileManager and iterate through the
>                                 statements and map them accordingly to
>                                 the named model (i.e. our
>                                 mapped
>                                 model) and at the end close it. This
>                                 will work, but it will load all
>                                 of
>                                 the
>                                 triples in memory. Is this the right way
>                                 to proceed or is there a way
>                                 to
>                                 read the model sequentially at the time
>                                 of mapping?
> 
>                                 Just trying to understand the efficient
>                                 way to map large set of
>                                 N-Triples.
>                                 Need your suggestions.
> 
>                                 Thanks,
>                                 Anuj
> 
> 
> 
> 
> 
> 

Re: Is this the right way to work with large number of N-Triples?

Posted by Anuj Kumar <an...@gmail.com>.
Sure, I will let you know in case I have any queries. The tests were failing
when I built SARQ on my machine but I will look into it later. As you
mentioned, it is really good to understand the integration with LARQ as a
reference. So, I am doing that.

Thanks for the info.

- Anuj

On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

>
>
> Anuj Kumar wrote:
>
>> Thanks Paolo. I am looking into LARQ and also SARQ.
>>
>
> Be warned: SARQ is just an experiment (and currently unsupported).
> However, if you prefer to use Solr, share with us you use case and your
> reasons
> and let me know if you have problems with it.
>
> SARQ might be a little bit behind in relation to the removals from the
> index,
> but you can look at what LARQ does and port the same approach into SARQ.
>
> Paolo
>
>
>
>> On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
>> castagna.lists@googlemail.com> wrote:
>>
>>
>>> Anuj Kumar wrote:
>>>
>>>  Hi Andy,
>>>>
>>>> I have loaded few n-triples into TDB in the offline mode using
>>>> tdbloader.
>>>> Loading as well as query is fast but if I try to use a regex, it becomes
>>>> very slow. It is taking few minutes. On my 32bit machine it takes more
>>>> than
>>>> 10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
>>>> (8GB) it takes around 5 mins.
>>>>
>>>> The query is pretty exhaustive, correct me if it is happening due to the
>>>> filter-
>>>>
>>>> SELECT ?abstract
>>>> WHERE {
>>>>  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>>>>  FILTER regex(?l, "Futurama", "i") .
>>>>  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
>>>> }
>>>>
>>>> I have loaded few abstracts from dbpedia dump and I am trying to get the
>>>> abstracts from the label. This is very slow. If I remove the FILTER and
>>>> give
>>>> the exact label, it is fast (should be because of TDB indexing).
>>>>
>>>> What is the right way to do such regex search or text search over the
>>>> graph?
>>>> I have seen suggestions to use Lucene and I also saw the LARQ
>>>> initiative.
>>>> Is
>>>> that the right way to go?
>>>>
>>>>  Yes, using LARQ (which is included in ARQ) will greatly speed up your
>>> query.
>>> LARQ documentation is here:
>>> http://jena.sourceforge.net/ARQ/lucene-arq.html
>>> You will need to build the Lucene index first, though.
>>>
>>> Paolo
>>>
>>>
>>>
>>>  Thanks,
>>>> Anuj
>>>>
>>>> On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>>>> andy.seaborne@epimorphics.com> wrote:
>>>>
>>>>  Just so you know: The TDB bulkloader can load all the data offline -
>>>> it's
>>>>
>>>>> faster than using Fuseki for data loading online.
>>>>>
>>>>>      Andy
>>>>>
>>>>>
>>>>> On 15/03/11 11:22, Anuj Kumar wrote:
>>>>>
>>>>>  Hi Andy,
>>>>>
>>>>>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>>>>>> didn't
>>>>>> try RiotReader or Java APIs for TDB. Will try that.
>>>>>> Thanks for the response.
>>>>>>
>>>>>> Regards,
>>>>>> Anuj
>>>>>>
>>>>>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>>>>> andy.seaborne@epimorphics.com>  wrote:
>>>>>>
>>>>>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>>>>>
>>>>>>  keep
>>>>>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>>>>>
>>>>>>> 2/ A file can be read sequentially by using the parser directly (See
>>>>>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>>>>>> triples).
>>>>>>>
>>>>>>>      Andy
>>>>>>>
>>>>>>>
>>>>>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>
>>>>>>>  I am new to Jena and trying to explore it to work with large number
>>>>>>>> of
>>>>>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>>>>>> example, a nt file from DBpedia dump that may run into GBs. I have
>>>>>>>> to
>>>>>>>> read
>>>>>>>> these triples, pick specific ones and further link it to the
>>>>>>>> resource
>>>>>>>> of
>>>>>>>> another set of triples. The goal is to link some of the entities
>>>>>>>> based
>>>>>>>> on
>>>>>>>> Linked Data concept. Once the mapping is done, I have to query the
>>>>>>>> model
>>>>>>>> from that point onwards. I don't want to work by loading both the
>>>>>>>> source
>>>>>>>> and
>>>>>>>> target dataset in-memory.
>>>>>>>>
>>>>>>>> To achieve this, I have first created a file model maker and then a
>>>>>>>> named
>>>>>>>> model for the specific dataset being mapped. Now, I need to read the
>>>>>>>> Triples
>>>>>>>> and add the mapping to this new model. What should be the right
>>>>>>>> approach?
>>>>>>>>
>>>>>>>> One way is to load the model using FileManager and iterate through
>>>>>>>> the
>>>>>>>> statements and map them accordingly to the named model (i.e. our
>>>>>>>> mapped
>>>>>>>> model) and at the end close it. This will work, but it will load all
>>>>>>>> of
>>>>>>>> the
>>>>>>>> triples in memory. Is this the right way to proceed or is there a
>>>>>>>> way
>>>>>>>> to
>>>>>>>> read the model sequentially at the time of mapping?
>>>>>>>>
>>>>>>>> Just trying to understand the efficient way to map large set of
>>>>>>>> N-Triples.
>>>>>>>> Need your suggestions.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anuj
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>

Re: Is this the right way to work with large number of N-Triples?

Posted by Paolo Castagna <ca...@googlemail.com>.

Anuj Kumar wrote:
> Thanks Paolo. I am looking into LARQ and also SARQ.

Be warned: SARQ is just an experiment (and currently unsupported).
However, if you prefer to use Solr, share with us you use case and your reasons
and let me know if you have problems with it.

SARQ might be a little bit behind in relation to the removals from the index,
but you can look at what LARQ does and port the same approach into SARQ.

Paolo

> 
> On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
> castagna.lists@googlemail.com> wrote:
> 
>>
>> Anuj Kumar wrote:
>>
>>> Hi Andy,
>>>
>>> I have loaded few n-triples into TDB in the offline mode using tdbloader.
>>> Loading as well as query is fast but if I try to use a regex, it becomes
>>> very slow. It is taking few minutes. On my 32bit machine it takes more
>>> than
>>> 10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
>>> (8GB) it takes around 5 mins.
>>>
>>> The query is pretty exhaustive, correct me if it is happening due to the
>>> filter-
>>>
>>> SELECT ?abstract
>>> WHERE {
>>>  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>>>  FILTER regex(?l, "Futurama", "i") .
>>>  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
>>> }
>>>
>>> I have loaded few abstracts from dbpedia dump and I am trying to get the
>>> abstracts from the label. This is very slow. If I remove the FILTER and
>>> give
>>> the exact label, it is fast (should be because of TDB indexing).
>>>
>>> What is the right way to do such regex search or text search over the
>>> graph?
>>> I have seen suggestions to use Lucene and I also saw the LARQ initiative.
>>> Is
>>> that the right way to go?
>>>
>> Yes, using LARQ (which is included in ARQ) will greatly speed up your
>> query.
>> LARQ documentation is here:
>> http://jena.sourceforge.net/ARQ/lucene-arq.html
>> You will need to build the Lucene index first, though.
>>
>> Paolo
>>
>>
>>
>>> Thanks,
>>> Anuj
>>>
>>> On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>>> andy.seaborne@epimorphics.com> wrote:
>>>
>>>  Just so you know: The TDB bulkloader can load all the data offline - it's
>>>> faster than using Fuseki for data loading online.
>>>>
>>>>       Andy
>>>>
>>>>
>>>> On 15/03/11 11:22, Anuj Kumar wrote:
>>>>
>>>>  Hi Andy,
>>>>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>>>>> didn't
>>>>> try RiotReader or Java APIs for TDB. Will try that.
>>>>> Thanks for the response.
>>>>>
>>>>> Regards,
>>>>> Anuj
>>>>>
>>>>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>>>> andy.seaborne@epimorphics.com>  wrote:
>>>>>
>>>>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>>>>
>>>>>> keep
>>>>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>>>>
>>>>>> 2/ A file can be read sequentially by using the parser directly (See
>>>>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>>>>> triples).
>>>>>>
>>>>>>       Andy
>>>>>>
>>>>>>
>>>>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>>>>
>>>>>>  Hi All,
>>>>>>
>>>>>>> I am new to Jena and trying to explore it to work with large number of
>>>>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>>>>> example, a nt file from DBpedia dump that may run into GBs. I have to
>>>>>>> read
>>>>>>> these triples, pick specific ones and further link it to the resource
>>>>>>> of
>>>>>>> another set of triples. The goal is to link some of the entities based
>>>>>>> on
>>>>>>> Linked Data concept. Once the mapping is done, I have to query the
>>>>>>> model
>>>>>>> from that point onwards. I don't want to work by loading both the
>>>>>>> source
>>>>>>> and
>>>>>>> target dataset in-memory.
>>>>>>>
>>>>>>> To achieve this, I have first created a file model maker and then a
>>>>>>> named
>>>>>>> model for the specific dataset being mapped. Now, I need to read the
>>>>>>> Triples
>>>>>>> and add the mapping to this new model. What should be the right
>>>>>>> approach?
>>>>>>>
>>>>>>> One way is to load the model using FileManager and iterate through the
>>>>>>> statements and map them accordingly to the named model (i.e. our
>>>>>>> mapped
>>>>>>> model) and at the end close it. This will work, but it will load all
>>>>>>> of
>>>>>>> the
>>>>>>> triples in memory. Is this the right way to proceed or is there a way
>>>>>>> to
>>>>>>> read the model sequentially at the time of mapping?
>>>>>>>
>>>>>>> Just trying to understand the efficient way to map large set of
>>>>>>> N-Triples.
>>>>>>> Need your suggestions.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anuj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
> 

Re: Is this the right way to work with large number of N-Triples?

Posted by Anuj Kumar <an...@gmail.com>.
Thanks Paolo. I am looking into LARQ and also SARQ.

On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

>
>
> Anuj Kumar wrote:
>
>> Hi Andy,
>>
>> I have loaded few n-triples into TDB in the offline mode using tdbloader.
>> Loading as well as query is fast but if I try to use a regex, it becomes
>> very slow. It is taking few minutes. On my 32bit machine it takes more
>> than
>> 10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
>> (8GB) it takes around 5 mins.
>>
>> The query is pretty exhaustive, correct me if it is happening due to the
>> filter-
>>
>> SELECT ?abstract
>> WHERE {
>>  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>>  FILTER regex(?l, "Futurama", "i") .
>>  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
>> }
>>
>> I have loaded few abstracts from dbpedia dump and I am trying to get the
>> abstracts from the label. This is very slow. If I remove the FILTER and
>> give
>> the exact label, it is fast (should be because of TDB indexing).
>>
>> What is the right way to do such regex search or text search over the
>> graph?
>> I have seen suggestions to use Lucene and I also saw the LARQ initiative.
>> Is
>> that the right way to go?
>>
>
> Yes, using LARQ (which is included in ARQ) will greatly speed up your
> query.
> LARQ documentation is here:
> http://jena.sourceforge.net/ARQ/lucene-arq.html
> You will need to build the Lucene index first, though.
>
> Paolo
>
>
>
>> Thanks,
>> Anuj
>>
>> On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>> andy.seaborne@epimorphics.com> wrote:
>>
>>  Just so you know: The TDB bulkloader can load all the data offline - it's
>>> faster than using Fuseki for data loading online.
>>>
>>>       Andy
>>>
>>>
>>> On 15/03/11 11:22, Anuj Kumar wrote:
>>>
>>>  Hi Andy,
>>>>
>>>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>>>> didn't
>>>> try RiotReader or Java APIs for TDB. Will try that.
>>>> Thanks for the response.
>>>>
>>>> Regards,
>>>> Anuj
>>>>
>>>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>>> andy.seaborne@epimorphics.com>  wrote:
>>>>
>>>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>>>
>>>>> keep
>>>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>>>
>>>>> 2/ A file can be read sequentially by using the parser directly (See
>>>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>>>> triples).
>>>>>
>>>>>       Andy
>>>>>
>>>>>
>>>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>>>
>>>>>  Hi All,
>>>>>
>>>>>> I am new to Jena and trying to explore it to work with large number of
>>>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>>>> example, a nt file from DBpedia dump that may run into GBs. I have to
>>>>>> read
>>>>>> these triples, pick specific ones and further link it to the resource
>>>>>> of
>>>>>> another set of triples. The goal is to link some of the entities based
>>>>>> on
>>>>>> Linked Data concept. Once the mapping is done, I have to query the
>>>>>> model
>>>>>> from that point onwards. I don't want to work by loading both the
>>>>>> source
>>>>>> and
>>>>>> target dataset in-memory.
>>>>>>
>>>>>> To achieve this, I have first created a file model maker and then a
>>>>>> named
>>>>>> model for the specific dataset being mapped. Now, I need to read the
>>>>>> Triples
>>>>>> and add the mapping to this new model. What should be the right
>>>>>> approach?
>>>>>>
>>>>>> One way is to load the model using FileManager and iterate through the
>>>>>> statements and map them accordingly to the named model (i.e. our
>>>>>> mapped
>>>>>> model) and at the end close it. This will work, but it will load all
>>>>>> of
>>>>>> the
>>>>>> triples in memory. Is this the right way to proceed or is there a way
>>>>>> to
>>>>>> read the model sequentially at the time of mapping?
>>>>>>
>>>>>> Just trying to understand the efficient way to map large set of
>>>>>> N-Triples.
>>>>>> Need your suggestions.
>>>>>>
>>>>>> Thanks,
>>>>>> Anuj
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>

Re: Is this the right way to work with large number of N-Triples?

Posted by Paolo Castagna <ca...@googlemail.com>.

Anuj Kumar wrote:
> Hi Andy,
> 
> I have loaded few n-triples into TDB in the offline mode using tdbloader.
> Loading as well as query is fast but if I try to use a regex, it becomes
> very slow. It is taking few minutes. On my 32bit machine it takes more than
> 10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
> (8GB) it takes around 5 mins.
> 
> The query is pretty exhaustive, correct me if it is happening due to the
> filter-
> 
> SELECT ?abstract
> WHERE {
>  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>  FILTER regex(?l, "Futurama", "i") .
>  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
> }
> 
> I have loaded few abstracts from dbpedia dump and I am trying to get the
> abstracts from the label. This is very slow. If I remove the FILTER and give
> the exact label, it is fast (should be because of TDB indexing).
> 
> What is the right way to do such regex search or text search over the graph?
> I have seen suggestions to use Lucene and I also saw the LARQ initiative. Is
> that the right way to go?

Yes, using LARQ (which is included in ARQ) will greatly speed up your query.
LARQ documentation is here: http://jena.sourceforge.net/ARQ/lucene-arq.html
You will need to build the Lucene index first, though.

Paolo

> 
> Thanks,
> Anuj
> 
> On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
> andy.seaborne@epimorphics.com> wrote:
> 
>> Just so you know: The TDB bulkloader can load all the data offline - it's
>> faster than using Fuseki for data loading online.
>>
>>        Andy
>>
>>
>> On 15/03/11 11:22, Anuj Kumar wrote:
>>
>>> Hi Andy,
>>>
>>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>>> didn't
>>> try RiotReader or Java APIs for TDB. Will try that.
>>> Thanks for the response.
>>>
>>> Regards,
>>> Anuj
>>>
>>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>> andy.seaborne@epimorphics.com>  wrote:
>>>
>>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>>> keep
>>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>>
>>>> 2/ A file can be read sequentially by using the parser directly (See
>>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>>> triples).
>>>>
>>>>        Andy
>>>>
>>>>
>>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>>
>>>>  Hi All,
>>>>> I am new to Jena and trying to explore it to work with large number of
>>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>>> example, a nt file from DBpedia dump that may run into GBs. I have to
>>>>> read
>>>>> these triples, pick specific ones and further link it to the resource of
>>>>> another set of triples. The goal is to link some of the entities based
>>>>> on
>>>>> Linked Data concept. Once the mapping is done, I have to query the model
>>>>> from that point onwards. I don't want to work by loading both the source
>>>>> and
>>>>> target dataset in-memory.
>>>>>
>>>>> To achieve this, I have first created a file model maker and then a
>>>>> named
>>>>> model for the specific dataset being mapped. Now, I need to read the
>>>>> Triples
>>>>> and add the mapping to this new model. What should be the right
>>>>> approach?
>>>>>
>>>>> One way is to load the model using FileManager and iterate through the
>>>>> statements and map them accordingly to the named model (i.e. our mapped
>>>>> model) and at the end close it. This will work, but it will load all of
>>>>> the
>>>>> triples in memory. Is this the right way to proceed or is there a way to
>>>>> read the model sequentially at the time of mapping?
>>>>>
>>>>> Just trying to understand the efficient way to map large set of
>>>>> N-Triples.
>>>>> Need your suggestions.
>>>>>
>>>>> Thanks,
>>>>> Anuj
>>>>>
>>>>>
>>>>>
> 

Re: Is this the right way to work with large number of N-Triples?

Posted by Anuj Kumar <an...@gmail.com>.
Hi Andy,

I have loaded few n-triples into TDB in the offline mode using tdbloader.
Loading as well as query is fast but if I try to use a regex, it becomes
very slow. It is taking few minutes. On my 32bit machine it takes more than
10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
(8GB) it takes around 5 mins.

The query is pretty exhaustive, correct me if it is happening due to the
filter-

SELECT ?abstract
WHERE {
 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
 FILTER regex(?l, "Futurama", "i") .
 ?resource <http://dbpedia.org/ontology/abstract> ?abstract
}

I have loaded few abstracts from dbpedia dump and I am trying to get the
abstracts from the label. This is very slow. If I remove the FILTER and give
the exact label, it is fast (should be because of TDB indexing).

What is the right way to do such regex search or text search over the graph?
I have seen suggestions to use Lucene and I also saw the LARQ initiative. Is
that the right way to go?

Thanks,
Anuj

On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
andy.seaborne@epimorphics.com> wrote:

> Just so you know: The TDB bulkloader can load all the data offline - it's
> faster than using Fuseki for data loading online.
>
>        Andy
>
>
> On 15/03/11 11:22, Anuj Kumar wrote:
>
>> Hi Andy,
>>
>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>> didn't
>> try RiotReader or Java APIs for TDB. Will try that.
>> Thanks for the response.
>>
>> Regards,
>> Anuj
>>
>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>> andy.seaborne@epimorphics.com>  wrote:
>>
>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>> keep
>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>
>>> 2/ A file can be read sequentially by using the parser directly (See
>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>> triples).
>>>
>>>        Andy
>>>
>>>
>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>
>>>  Hi All,
>>>>
>>>> I am new to Jena and trying to explore it to work with large number of
>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>> example, a nt file from DBpedia dump that may run into GBs. I have to
>>>> read
>>>> these triples, pick specific ones and further link it to the resource of
>>>> another set of triples. The goal is to link some of the entities based
>>>> on
>>>> Linked Data concept. Once the mapping is done, I have to query the model
>>>> from that point onwards. I don't want to work by loading both the source
>>>> and
>>>> target dataset in-memory.
>>>>
>>>> To achieve this, I have first created a file model maker and then a
>>>> named
>>>> model for the specific dataset being mapped. Now, I need to read the
>>>> Triples
>>>> and add the mapping to this new model. What should be the right
>>>> approach?
>>>>
>>>> One way is to load the model using FileManager and iterate through the
>>>> statements and map them accordingly to the named model (i.e. our mapped
>>>> model) and at the end close it. This will work, but it will load all of
>>>> the
>>>> triples in memory. Is this the right way to proceed or is there a way to
>>>> read the model sequentially at the time of mapping?
>>>>
>>>> Just trying to understand the efficient way to map large set of
>>>> N-Triples.
>>>> Need your suggestions.
>>>>
>>>> Thanks,
>>>> Anuj
>>>>
>>>>
>>>>
>>

Re: Is this the right way to work with large number of N-Triples?

Posted by Andy Seaborne <an...@epimorphics.com>.
Just so you know: The TDB bulkloader can load all the data offline - 
it's faster than using Fuseki for data loading online.

	Andy

On 15/03/11 11:22, Anuj Kumar wrote:
> Hi Andy,
>
> Thanks for the info. I have loaded few GBs using Fuseki Server but I didn't
> try RiotReader or Java APIs for TDB. Will try that.
> Thanks for the response.
>
> Regards,
> Anuj
>
> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
> andy.seaborne@epimorphics.com>  wrote:
>
>> 1/ Have you considered reading the DBpedia data into TDB?  This would keep
>> the triples on-disk (and have cached in-memory versions of a subset).
>>
>> 2/ A file can be read sequentially by using the parser directly (See
>> RiotReader and pass in a Sink<Triple>  that processes the stream of triples).
>>
>>         Andy
>>
>>
>> On 14/03/11 18:42, Anuj Kumar wrote:
>>
>>> Hi All,
>>>
>>> I am new to Jena and trying to explore it to work with large number of
>>> N-Triples. The requirement is to read large number of N-Triples. For
>>> example, a nt file from DBpedia dump that may run into GBs. I have to read
>>> these triples, pick specific ones and further link it to the resource of
>>> another set of triples. The goal is to link some of the entities based on
>>> Linked Data concept. Once the mapping is done, I have to query the model
>>> from that point onwards. I don't want to work by loading both the source
>>> and
>>> target dataset in-memory.
>>>
>>> To achieve this, I have first created a file model maker and then a named
>>> model for the specific dataset being mapped. Now, I need to read the
>>> Triples
>>> and add the mapping to this new model. What should be the right approach?
>>>
>>> One way is to load the model using FileManager and iterate through the
>>> statements and map them accordingly to the named model (i.e. our mapped
>>> model) and at the end close it. This will work, but it will load all of
>>> the
>>> triples in memory. Is this the right way to proceed or is there a way to
>>> read the model sequentially at the time of mapping?
>>>
>>> Just trying to understand the efficient way to map large set of N-Triples.
>>> Need your suggestions.
>>>
>>> Thanks,
>>> Anuj
>>>
>>>
>

Re: Is this the right way to work with large number of N-Triples?

Posted by Anuj Kumar <an...@gmail.com>.
Hi Andy,

Thanks for the info. I have loaded few GBs using Fuseki Server but I didn't
try RiotReader or Java APIs for TDB. Will try that.
Thanks for the response.

Regards,
Anuj

On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne <
andy.seaborne@epimorphics.com> wrote:

> 1/ Have you considered reading the DBpedia data into TDB?  This would keep
> the triples on-disk (and have cached in-memory versions of a subset).
>
> 2/ A file can be read sequentially by using the parser directly (See
> RiotReader and pass in a Sink<Triple> that processes the stream of triples).
>
>        Andy
>
>
> On 14/03/11 18:42, Anuj Kumar wrote:
>
>> Hi All,
>>
>> I am new to Jena and trying to explore it to work with large number of
>> N-Triples. The requirement is to read large number of N-Triples. For
>> example, a nt file from DBpedia dump that may run into GBs. I have to read
>> these triples, pick specific ones and further link it to the resource of
>> another set of triples. The goal is to link some of the entities based on
>> Linked Data concept. Once the mapping is done, I have to query the model
>> from that point onwards. I don't want to work by loading both the source
>> and
>> target dataset in-memory.
>>
>> To achieve this, I have first created a file model maker and then a named
>> model for the specific dataset being mapped. Now, I need to read the
>> Triples
>> and add the mapping to this new model. What should be the right approach?
>>
>> One way is to load the model using FileManager and iterate through the
>> statements and map them accordingly to the named model (i.e. our mapped
>> model) and at the end close it. This will work, but it will load all of
>> the
>> triples in memory. Is this the right way to proceed or is there a way to
>> read the model sequentially at the time of mapping?
>>
>> Just trying to understand the efficient way to map large set of N-Triples.
>> Need your suggestions.
>>
>> Thanks,
>> Anuj
>>
>>

Re: Is this the right way to work with large number of N-Triples?

Posted by Andy Seaborne <an...@epimorphics.com>.
1/ Have you considered reading the DBpedia data into TDB?  This would 
keep the triples on-disk (and have cached in-memory versions of a subset).

2/ A file can be read sequentially by using the parser directly (See 
RiotReader and pass in a Sink<Triple> that processes the stream of triples).

	Andy

On 14/03/11 18:42, Anuj Kumar wrote:
> Hi All,
>
> I am new to Jena and trying to explore it to work with large number of
> N-Triples. The requirement is to read large number of N-Triples. For
> example, a nt file from DBpedia dump that may run into GBs. I have to read
> these triples, pick specific ones and further link it to the resource of
> another set of triples. The goal is to link some of the entities based on
> Linked Data concept. Once the mapping is done, I have to query the model
> from that point onwards. I don't want to work by loading both the source and
> target dataset in-memory.
>
> To achieve this, I have first created a file model maker and then a named
> model for the specific dataset being mapped. Now, I need to read the Triples
> and add the mapping to this new model. What should be the right approach?
>
> One way is to load the model using FileManager and iterate through the
> statements and map them accordingly to the named model (i.e. our mapped
> model) and at the end close it. This will work, but it will load all of the
> triples in memory. Is this the right way to proceed or is there a way to
> read the model sequentially at the time of mapping?
>
> Just trying to understand the efficient way to map large set of N-Triples.
> Need your suggestions.
>
> Thanks,
> Anuj
>