You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> on 2019/08/06 14:05:45 UTC

How to parse huge RDF data in a tar.gz file.

Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
within my Java program.
The following code does work properly, but it is inefficient because
the process reads and loads the entire RDF data in an entry of the
given tar.gz file into a main memory before parsing.
So, could you please let me know a better way to save a memory space ?

TarArchiveInputStream tarInput = new TarArchiveInputStream(new
GzipCompressorInputStream(new FileInputStream(filename)));
TarArchiveEntry currentEntry;
PipedRDFIterator<Triple> iter = new
PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);

while ((currentEntry = tarInput.getNextTarEntry()) != null) {
  String currentFile = currentEntry.getName();
  Lang lang = RDFLanguages.filenameToLang(currentFile);
  parser_object = RDFParserBuilder
    .create()
    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
    .source(new StringReader(CharStreams.toString(new
InputStreamReader(tarInput))))
    .checking(checking)
    .lang(lang)
    .build();
  parser_object.parse(inputStream);
}
tarInput.close();

Sincerely yours,
Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by Yasunori Yamamoto <yy...@dbcls.rois.ac.jp>.
Thank you very much!
Yes, the crash was on the second entry, and I modified the code to use
CloseShieldInputStream.
It works without any problems.

2019年8月8日(木) 2:32 Andy Seaborne <an...@apache.org>:
>
> Presumably on the second entry?
>
> Protect the parser stream from the unwanted close with
> CloseShieldInputStream:
>
> On 07/08/2019 17:57, Yasunori Yamamoto wrote:
> > Hi Andy,
> >
> > Thank you for your reply.
> > Is the following code what you assume?
> > If so, it crashed with Exception in thread "main"
> > java.lang.NullPointerException.
> >
> > TarArchiveInputStream tarInput = new TarArchiveInputStream(new ...);
> > TarArchiveEntry currentEntry;
> > while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> > ...
>
>      InputStream in = new CloseShieldInputStream(tarInput);
>
> >    parser_object = RDFParserBuilder
> >      .create()
> >      .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
> >      .source(tarInput)
>
>     .source(in)
>
> >      .checking(checking)
> >      .lang(lang)
> >      .build();
> > ...
> > }
> >
>
>
> which is also a good way of putting a breakpoint to track down who is
> calling close()
>
> (TokenizeText if the first entry is a Turtle file. And  in the JDK XML
> parser if the fist entry is RDF/XML)
>
>      Andy
>
> > Error stack follows.
> > at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:296)
> > at java.io.InputStream.skip(java.base@9-internal/InputStream.java:351)
> > at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:111)
> > at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:344)
> > at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:271)
> > at ... ( where my code calls tarInput.getNextTarEntry() )
> >
> > Regards,
> > Yasunori
> >
> > 2019年8月7日(水) 18:04 Andy Seaborne <an...@apache.org>:
> >>
> >> Yasunori,
> >>
> >> It should be possible to pass the InputStream for the tar entry contents
> >> directly to the RDFParserBuilder.source, no need to convert to a string
> >> first.
> >>
> >> IIRC TarArchiveInputStream is a bit weird - it signals "end of file" at
> >> the end of the tar archive entry, the the app moves to the next entry
> >> and the input stream is then for that entry and can be passed to a new
> >> RDFParserBuilder call.
> >>
> >> An RDFParser does not close an inputStream it is passed.
> >>
> >> It will need a new RDFParser for each entry.
> >>
> >> If that is now hat is happened, please let us know.
> >>
> >>       Andy
> >>
> >>
> >> On 06/08/2019 23:31, Yasunori Yamamoto wrote:
> >>> Files in a tar are in RDF/XML or Turtle.
> >>>
> >>> Yasunori
> >>>
> >>> 2019/08/07 3:11、ajs6f <aj...@apache.org>のメール:
> >>>
> >>> In what format are these RDF files?
> >>>
> >>> ajs6f
> >>>
> >>>> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
> >>>>
> >>>> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
> >>>> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
> >>>> within my Java program.
> >>>> The following code does work properly, but it is inefficient because
> >>>> the process reads and loads the entire RDF data in an entry of the
> >>>> given tar.gz file into a main memory before parsing.
> >>>> So, could you please let me know a better way to save a memory space ?
> >>>>
> >>>> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
> >>>> GzipCompressorInputStream(new FileInputStream(filename)));
> >>>> TarArchiveEntry currentEntry;
> >>>> PipedRDFIterator<Triple> iter = new
> >>>> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
> >>>> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
> >>>>
> >>>> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> >>>> String currentFile = currentEntry.getName();
> >>>> Lang lang = RDFLanguages.filenameToLang(currentFile);
> >>>> parser_object = RDFParserBuilder
> >>>>     .create()
> >>>>     .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
> >>>>     .source(new StringReader(CharStreams.toString(new
> >>>> InputStreamReader(tarInput))))
> >>>>     .checking(checking)
> >>>>     .lang(lang)
> >>>>     .build();
> >>>> parser_object.parse(inputStream);
> >>>> }
> >>>> tarInput.close();
> >>>>
> >>>> Sincerely yours,
> >>>> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by Andy Seaborne <an...@apache.org>.
Presumably on the second entry?

Protect the parser stream from the unwanted close with 
CloseShieldInputStream:

On 07/08/2019 17:57, Yasunori Yamamoto wrote:
> Hi Andy,
> 
> Thank you for your reply.
> Is the following code what you assume?
> If so, it crashed with Exception in thread "main"
> java.lang.NullPointerException.
> 
> TarArchiveInputStream tarInput = new TarArchiveInputStream(new ...);
> TarArchiveEntry currentEntry;
> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> ...

     InputStream in = new CloseShieldInputStream(tarInput);

>    parser_object = RDFParserBuilder
>      .create()
>      .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>      .source(tarInput)

    .source(in)

>      .checking(checking)
>      .lang(lang)
>      .build();
> ...
> }
> 


which is also a good way of putting a breakpoint to track down who is 
calling close()

(TokenizeText if the first entry is a Turtle file. And  in the JDK XML 
parser if the fist entry is RDF/XML)

     Andy

> Error stack follows.
> at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:296)
> at java.io.InputStream.skip(java.base@9-internal/InputStream.java:351)
> at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:111)
> at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:344)
> at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:271)
> at ... ( where my code calls tarInput.getNextTarEntry() )
> 
> Regards,
> Yasunori
> 
> 2019年8月7日(水) 18:04 Andy Seaborne <an...@apache.org>:
>>
>> Yasunori,
>>
>> It should be possible to pass the InputStream for the tar entry contents
>> directly to the RDFParserBuilder.source, no need to convert to a string
>> first.
>>
>> IIRC TarArchiveInputStream is a bit weird - it signals "end of file" at
>> the end of the tar archive entry, the the app moves to the next entry
>> and the input stream is then for that entry and can be passed to a new
>> RDFParserBuilder call.
>>
>> An RDFParser does not close an inputStream it is passed.
>>
>> It will need a new RDFParser for each entry.
>>
>> If that is now hat is happened, please let us know.
>>
>>       Andy
>>
>>
>> On 06/08/2019 23:31, Yasunori Yamamoto wrote:
>>> Files in a tar are in RDF/XML or Turtle.
>>>
>>> Yasunori
>>>
>>> 2019/08/07 3:11、ajs6f <aj...@apache.org>のメール:
>>>
>>> In what format are these RDF files?
>>>
>>> ajs6f
>>>
>>>> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
>>>>
>>>> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
>>>> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
>>>> within my Java program.
>>>> The following code does work properly, but it is inefficient because
>>>> the process reads and loads the entire RDF data in an entry of the
>>>> given tar.gz file into a main memory before parsing.
>>>> So, could you please let me know a better way to save a memory space ?
>>>>
>>>> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
>>>> GzipCompressorInputStream(new FileInputStream(filename)));
>>>> TarArchiveEntry currentEntry;
>>>> PipedRDFIterator<Triple> iter = new
>>>> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
>>>> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
>>>>
>>>> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
>>>> String currentFile = currentEntry.getName();
>>>> Lang lang = RDFLanguages.filenameToLang(currentFile);
>>>> parser_object = RDFParserBuilder
>>>>     .create()
>>>>     .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>>>>     .source(new StringReader(CharStreams.toString(new
>>>> InputStreamReader(tarInput))))
>>>>     .checking(checking)
>>>>     .lang(lang)
>>>>     .build();
>>>> parser_object.parse(inputStream);
>>>> }
>>>> tarInput.close();
>>>>
>>>> Sincerely yours,
>>>> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by Yasunori Yamamoto <yy...@dbcls.rois.ac.jp>.
Hi Andy,

Thank you for your reply.
Is the following code what you assume?
If so, it crashed with Exception in thread "main"
java.lang.NullPointerException.

TarArchiveInputStream tarInput = new TarArchiveInputStream(new ...);
TarArchiveEntry currentEntry;
while ((currentEntry = tarInput.getNextTarEntry()) != null) {
...
  parser_object = RDFParserBuilder
    .create()
    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
    .source(tarInput)
    .checking(checking)
    .lang(lang)
    .build();
...
}

Error stack follows.
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:296)
at java.io.InputStream.skip(java.base@9-internal/InputStream.java:351)
at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:111)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:344)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:271)
at ... ( where my code calls tarInput.getNextTarEntry() )

Regards,
Yasunori

2019年8月7日(水) 18:04 Andy Seaborne <an...@apache.org>:
>
> Yasunori,
>
> It should be possible to pass the InputStream for the tar entry contents
> directly to the RDFParserBuilder.source, no need to convert to a string
> first.
>
> IIRC TarArchiveInputStream is a bit weird - it signals "end of file" at
> the end of the tar archive entry, the the app moves to the next entry
> and the input stream is then for that entry and can be passed to a new
> RDFParserBuilder call.
>
> An RDFParser does not close an inputStream it is passed.
>
> It will need a new RDFParser for each entry.
>
> If that is now hat is happened, please let us know.
>
>      Andy
>
>
> On 06/08/2019 23:31, Yasunori Yamamoto wrote:
> > Files in a tar are in RDF/XML or Turtle.
> >
> > Yasunori
> >
> > 2019/08/07 3:11、ajs6f <aj...@apache.org>のメール:
> >
> > In what format are these RDF files?
> >
> > ajs6f
> >
> >> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
> >>
> >> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
> >> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
> >> within my Java program.
> >> The following code does work properly, but it is inefficient because
> >> the process reads and loads the entire RDF data in an entry of the
> >> given tar.gz file into a main memory before parsing.
> >> So, could you please let me know a better way to save a memory space ?
> >>
> >> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
> >> GzipCompressorInputStream(new FileInputStream(filename)));
> >> TarArchiveEntry currentEntry;
> >> PipedRDFIterator<Triple> iter = new
> >> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
> >> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
> >>
> >> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> >> String currentFile = currentEntry.getName();
> >> Lang lang = RDFLanguages.filenameToLang(currentFile);
> >> parser_object = RDFParserBuilder
> >>    .create()
> >>    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
> >>    .source(new StringReader(CharStreams.toString(new
> >> InputStreamReader(tarInput))))
> >>    .checking(checking)
> >>    .lang(lang)
> >>    .build();
> >> parser_object.parse(inputStream);
> >> }
> >> tarInput.close();
> >>
> >> Sincerely yours,
> >> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by Andy Seaborne <an...@apache.org>.
Yasunori,

It should be possible to pass the InputStream for the tar entry contents 
directly to the RDFParserBuilder.source, no need to convert to a string 
first.

IIRC TarArchiveInputStream is a bit weird - it signals "end of file" at 
the end of the tar archive entry, the the app moves to the next entry 
and the input stream is then for that entry and can be passed to a new 
RDFParserBuilder call.

An RDFParser does not close an inputStream it is passed.

It will need a new RDFParser for each entry.

If that is now hat is happened, please let us know.

     Andy


On 06/08/2019 23:31, Yasunori Yamamoto wrote:
> Files in a tar are in RDF/XML or Turtle.
> 
> Yasunori
> 
> 2019/08/07 3:11、ajs6f <aj...@apache.org>のメール:
> 
> In what format are these RDF files?
> 
> ajs6f
> 
>> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
>>
>> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
>> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
>> within my Java program.
>> The following code does work properly, but it is inefficient because
>> the process reads and loads the entire RDF data in an entry of the
>> given tar.gz file into a main memory before parsing.
>> So, could you please let me know a better way to save a memory space ?
>>
>> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
>> GzipCompressorInputStream(new FileInputStream(filename)));
>> TarArchiveEntry currentEntry;
>> PipedRDFIterator<Triple> iter = new
>> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
>> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
>>
>> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
>> String currentFile = currentEntry.getName();
>> Lang lang = RDFLanguages.filenameToLang(currentFile);
>> parser_object = RDFParserBuilder
>>    .create()
>>    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>>    .source(new StringReader(CharStreams.toString(new
>> InputStreamReader(tarInput))))
>>    .checking(checking)
>>    .lang(lang)
>>    .build();
>> parser_object.parse(inputStream);
>> }
>> tarInput.close();
>>
>> Sincerely yours,
>> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by Yasunori Yamamoto <yy...@dbcls.rois.ac.jp>.
Files in a tar are in RDF/XML or Turtle.

Yasunori

2019/08/07 3:11、ajs6f <aj...@apache.org>のメール:

In what format are these RDF files?

ajs6f

> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
>
> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
> within my Java program.
> The following code does work properly, but it is inefficient because
> the process reads and loads the entire RDF data in an entry of the
> given tar.gz file into a main memory before parsing.
> So, could you please let me know a better way to save a memory space ?
>
> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
> GzipCompressorInputStream(new FileInputStream(filename)));
> TarArchiveEntry currentEntry;
> PipedRDFIterator<Triple> iter = new
> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
>
> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> String currentFile = currentEntry.getName();
> Lang lang = RDFLanguages.filenameToLang(currentFile);
> parser_object = RDFParserBuilder
>   .create()
>   .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>   .source(new StringReader(CharStreams.toString(new
> InputStreamReader(tarInput))))
>   .checking(checking)
>   .lang(lang)
>   .build();
> parser_object.parse(inputStream);
> }
> tarInput.close();
>
> Sincerely yours,
> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Posted by ajs6f <aj...@apache.org>.
In what format are these RDF files?

ajs6f

> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <yy...@dbcls.rois.ac.jp> wrote:
> 
> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
> within my Java program.
> The following code does work properly, but it is inefficient because
> the process reads and loads the entire RDF data in an entry of the
> given tar.gz file into a main memory before parsing.
> So, could you please let me know a better way to save a memory space ?
> 
> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
> GzipCompressorInputStream(new FileInputStream(filename)));
> TarArchiveEntry currentEntry;
> PipedRDFIterator<Triple> iter = new
> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
> 
> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
>  String currentFile = currentEntry.getName();
>  Lang lang = RDFLanguages.filenameToLang(currentFile);
>  parser_object = RDFParserBuilder
>    .create()
>    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>    .source(new StringReader(CharStreams.toString(new
> InputStreamReader(tarInput))))
>    .checking(checking)
>    .lang(lang)
>    .build();
>  parser_object.parse(inputStream);
> }
> tarInput.close();
> 
> Sincerely yours,
> Yasunori Yamamoto