You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Abhishek Shivkumar <ab...@gmail.com> on 2013/01/08 12:00:40 UTC

Re: Parsing a freebase RDF dump.

Hi Andy,

   I am using the script to correct the errors. When I run the script dwim
on all the part files, it shows error messages, and continues processing.
Are these errors that are corrected, or still existing that need attention?
Sample error message is:

ERROR [line:25335, col:25] Unknown char: \(92)

Just wanted to know if we can ignore these messages while running the dwim
script.

Thank you!

With Regards,
Abhishek S


On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:

> If you want to parse the Freebase dump, try this:
>
> http://people.apache.org/~**andy/Freebase20121223/Notes.**txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>
> It takes about 90 minutes on my home desktop machine to fix and parse the
> data.
>
> To load it, get a very large machine - it has been reported [1] that a
> previous dump has been loaded into TDB.
>
>         Andy
>
> [1] http://lists.freebase.com/**pipermail/freebase-discuss/**
> 2012-December/010169.html<http://lists.freebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>

Re: Parsing a freebase RDF dump.

Posted by Andy Seaborne <an...@apache.org>.
On 27/08/13 00:21, Yuhan Zhang wrote:
> thanks for the suggestion, Andy. I'll convert the turtle file and retry it.

It seems their encoding of any odd characters is $xxxx, I'm not sure 
where that came from.  After the report from the 20121223 I thought 
they'd clean this up - maybe a new one was get through.  It's worth a 
report to the freebase list.

	Andy

>
> Yuhan Zhang
> Senior Software Engineer
> OneScreen Inc.
> www.onescreen.com
> (949) 525-4825 Ext: 177
> yzhang@onescreen.com <eh...@onescreen.com>
>
>
> On Mon, Aug 26, 2013 at 1:29 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 26/08/13 18:41, Yuhan Zhang wrote:
>>
>>> Hi Andy,
>>>
>>> the line 4643044 looks like this:
>>>
>>> ns:m.04ln83j key:user.robert.world$0027s_**tallest.building "preceded_by"
>>>
>>
>> Raw $ is not allowed in a prefixed name.
>>
>> But I guess $0027 is intended (which is ') so use %27.
>>
>> - - - - - -
>>
>> You can use \$ (in which case the URI will have a real $ in it) but some
>> tools may have problems, or use %24 (in which case the URI will have the 3
>> chars %-2-4 in it)
>>
>> [172s]  PN_LOCAL_ESC    ::=     '\' ('_' | '~' | '.' | '-' | '!' | '$' |
>> '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@'
>> | '%')
>>
>> http://www.w3.org/TR/turtle/#**sec-grammar-grammar<http://www.w3.org/TR/turtle/#sec-grammar-grammar>
>>
>>          Andy
>>
>>
>>>
>>> Yuhan Zhang
>>> Senior Software Engineer
>>> OneScreen Inc.
>>> www.onescreen.com
>>> (949) 525-4825 Ext: 177
>>> yzhang@onescreen.com <eh...@onescreen.com>
>>>
>>>
>>> On Sat, Aug 24, 2013 at 2:50 AM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>   On 24/08/13 03:29, Yuhan Zhang wrote:
>>>>
>>>>   I reached a similar error with  jena-2.10.1 but with a different
>>>>> character
>>>>> when parsing a more recent version of freebase-rdf-2013-08-04-00-00.
>>>>>
>>>>>
>>>> Yes - they keep changing things and don't check much.
>>>>
>>>>
>>>>    WARN  [line: 4632165, col: 55] Bad IRI: <
>>>>
>>>>> http://croctail.corpwatch.org/****#cw_506630,cw_{key}<http://croctail.corpwatch.org/**#cw_506630,cw_%7Bkey%7D>
>>>>> <http://**croctail.corpwatch.org/#cw_**506630,cw_%7Bkey%7D<http://croctail.corpwatch.org/#cw_506630,cw_%7Bkey%7D>
>>>>>>>
>>>>>
>>>>> Code: 4/UNWISE_CHARACTER
>>>>> in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
>>>>> characters are permitted in RDF URI References, XML system identifiers,
>>>>> and
>>>>> XML Schema anyURIs.
>>>>>
>>>>>
>>>> Only a warning - no '{' or '}' in IRIs
>>>>
>>>>
>>>>    ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)
>>>>
>>>>>
>>>>>
>>>> What's that line?  $ is illegal in some places, but legal in others.
>>>>
>>>>           Andy
>>>>
>>>>
>>>>   Yuhan Zhang
>>>>> Senior Software Engineer
>>>>> OneScreen Inc.
>>>>> www.onescreen.com
>>>>> (949) 525-4825 Ext: 177
>>>>> yzhang@onescreen.com <eh...@onescreen.com>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>    On 08/01/13 11:49, Rob Vesse wrote:
>>>>>
>>>>>>
>>>>>>    2.10.0 is the current development snapshot, you can get this via
>>>>>> maven
>>>>>>
>>>>>>> by
>>>>>>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>>>>>>
>>>>>>>
>>>>>>> If you need to download the JARs (I.e. non-maven builds) you can find
>>>>>>> them
>>>>>>> on the Apache artifactory at
>>>>>>> https://repository.apache.org/******index.html#nexus-search;**<https://repository.apache.org/****index.html#nexus-search;**>
>>>>>>> quick~**jena<https://**repository.apache.org/**index.**
>>>>>>> html#nexus-search;quick~**jena<https://repository.apache.org/**index.html#nexus-search;quick~**jena>
>>>>>>> **>
>>>>>>> <https://**repository.apache.**org/index.**html#nexus-search;**
>>>>>>> quick~jena<http://repository.apache.org/index.**html#nexus-search;quick~jena>
>>>>>>> <https://repository.**apache.org/index.html#nexus-**search;quick~jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> You need to click on Show All Versions for the module you want in
>>>>>>> order
>>>>>>> to
>>>>>>> see download links for snapshots
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>>
>>>>>>>   And the download is available at:
>>>>>>
>>>>>> https://repository.apache.org/******content/repositories/**<https://repository.apache.org/****content/repositories/**>
>>>>>> <ht**tps://repository.apache.org/****content/repositories/**<https://repository.apache.org/**content/repositories/**>
>>>>>>>
>>>>>> snapshots/org/apache/jena/******apache-jena/<https://**
>>>>>> repository.apache.org/content/****repositories/snapshots/org/****<http://repository.apache.org/content/**repositories/snapshots/org/**>
>>>>>> apache/jena/apache-jena/<https**://repository.apache.org/**
>>>>>> content/repositories/**snapshots/org/apache/jena/**apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>> (cough - see message of 28/Dec in this thread)
>>>>>>
>>>>>>            Andy
>>>>>>
>>>>>>
>>>>>>
>>>>>>    On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>     1. I am using the correct version of rdf file that you have.
>>>>>>>
>>>>>>>   2. This error of unknown char (\92) is appearing in all the files at
>>>>>>>> different line numbers. I am not sure what this unknown char \(92)
>>>>>>>> is.
>>>>>>>> Tried to look in the surrounding of the line number in the file
>>>>>>>> contents
>>>>>>>> but can't find it :(
>>>>>>>> 3. I can only find version 2.7.4 at
>>>>>>>> http://www.apache.org/dist/******jena/binaries/<http://www.apache.org/dist/****jena/binaries/>
>>>>>>>> <http://www.**apache.org/dist/**jena/**binaries/<http://www.apache.org/dist/**jena/binaries/>
>>>>>>>>>
>>>>>>>> <http://www.**apache.org/dist/**jena/binaries/<http://apache.org/dist/jena/binaries/>
>>>>>>>> <http://www.**apache.org/dist/jena/binaries/<http://www.apache.org/dist/jena/binaries/>
>>>>>>>> **>
>>>>>>>> **>.
>>>>>>>>
>>>>>>>>
>>>>>>>> May be THIS is the reason. Do
>>>>>>>> you know where I can download the 2.10.0 version?
>>>>>>>>
>>>>>>>> Thanks much!
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> With Regards,
>>>>>>>> Abhishek S
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>>     Hi Andy,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>         I am using the script to correct the errors. When I run the
>>>>>>>>>> script
>>>>>>>>>> dwim
>>>>>>>>>> on all the part files, it shows error messages, and continues
>>>>>>>>>> processing.
>>>>>>>>>> Are these errors that are corrected, or still existing that need
>>>>>>>>>> attention?
>>>>>>>>>> Sample error message is:
>>>>>>>>>>
>>>>>>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    What's on the lines around there?
>>>>>>>>>>
>>>>>>>>> And if you've split the dump, which file?
>>>>>>>>>
>>>>>>>>> That needs correcting in the source.  I can pare the first 30k lines
>>>>>>>>> of
>>>>>>>>> the file with Jena with no fixups.
>>>>>>>>>
>>>>>>>>> Maybe you don't have exactly the version of Freebase that I did
>>>>>>>>> freebase-rdf-2012-12-23-00-00.********gz.  There is no suspect
>>>>>>>>> forms
>>>>>>>>>
>>>>>>>>> around
>>>>>>>>>
>>>>>>>>> line 25K of my copy.
>>>>>>>>>
>>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk.
>>>>>>>>> <---25335
>>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>>>>>>
>>>>>>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     Just wanted to know if we can ignore these messages while running
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>   dwim
>>>>>>>>>> script.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    You can ignore WARN.  ERRORs usually stop the parser as they
>>>>>>>>>>
>>>>>>>>> indicate
>>>>>>>>> structural problems.
>>>>>>>>>
>>>>>>>>>             Andy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> With Regards,
>>>>>>>>>> Abhishek S
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>      If you want to parse the Freebase dump, try this:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> http://people.apache.org/~**********andy/Freebase20121223/**
>>>>>>>>>>> Notes.**<http://people.apache.org/~********andy/Freebase20121223/Notes.**>
>>>>>>>>>>> ******txt<http://people.**apache.org/~******andy/**
>>>>>>>>>>> Freebase20121223/Notes.********txt<http://people.apache.org/~******andy/Freebase20121223/Notes.******txt>
>>>>>>>>>>>>
>>>>>>>>>>> <http://people.**apache.org/~******andy/**<http://apache.org/~****andy/**>
>>>>>>>>>>> Freebase20121223/Notes.******txt<http://people.apache.org/~**
>>>>>>>>>>> ****andy/Freebase20121223/**Notes.****txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>>>>>>>>
>>>>>>>>>>> **>
>>>>>>>>>>> <http:
>>>>>>>>>>> //people.apache.org/%7E**andy/******Freebase20121223/Notes.****
>>>>>>>>>>> txt<http://people.apache.org/%7E**andy/****Freebase20121223/Notes.**txt>
>>>>>>>>>>> <http://people.apache.org/%**7E**andy/**Freebase20121223/**
>>>>>>>>>>> Notes.**txt<http://people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt>
>>>>>>>>>>>>
>>>>>>>>>>> **<http://people.apache.org/%**7E****andy/Freebase20121223/**
>>>>>>>>>>> Notes.***<http://people.apache.org/%7E****andy/Freebase20121223/Notes.***>
>>>>>>>>>>> *txt<http://people.apache.org/**%7E**andy/Freebase20121223/**
>>>>>>>>>>> Notes.**txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   <http://people.apache.org/%********7Eandy/Freebase20121223/**
>>>>>>>>>>> Notes.****txt
>>>>>>>>>>> <http:/
>>>>>>>>>>> /people.apache.org/%7Eandy/******Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/****Freebase20121223/Notes.txt>
>>>>>>>>>>> <ht**tp://people.apache.org/%**7Eandy/**Freebase20121223/**
>>>>>>>>>>> Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>>>>>>
>>>>>>>>>>> <htt**p://people.apache.org/%**7Eandy/**Freebase20121223/**
>>>>>>>>>>> Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>>>>> <http://people.**apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   It takes about 90 minutes on my home desktop machine to fix and
>>>>>>>>>>> parse
>>>>>>>>>>> the
>>>>>>>>>>> data.
>>>>>>>>>>>
>>>>>>>>>>> To load it, get a very large machine - it has been reported [1]
>>>>>>>>>>> that a
>>>>>>>>>>> previous dump has been loaded into TDB.
>>>>>>>>>>>
>>>>>>>>>>>              Andy
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> http://lists.freebase.com/**********pipermail/freebase-**
>>>>>>>>>>> discuss/****<http://lists.freebase.com/********pipermail/freebase-discuss/****>
>>>>>>>>>>> <http://lists.**freebase.com/******pipermail/**
>>>>>>>>>>> freebase-discuss/**<http://lists.freebase.com/******pipermail/freebase-discuss/**>
>>>>>>>>>>>>
>>>>>>>>>>> <http://lists.freebase.com/********pipermail/freebase-discuss/**
>>>>>>>>>>> ****<http://lists.freebase.com/******pipermail/freebase-discuss/****>
>>>>>>>>>>> <http://lists.freebase.**com/****pipermail/freebase-**discuss/**<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   <**http://list <http://list>
>>>>>>>>>>> s.freebase.com/**pipermail/******freebase-discuss/**<http://s.freebase.com/**pipermail/****freebase-discuss/**>
>>>>>>>>>>> <http://s.**freebase.com/**pipermail/****freebase-discuss/**<http://s.freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>>>>>>
>>>>>>>>>>> <http://s.**freebase.com/****pipermail/**freebase-discuss/****<http://freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>>>>> <http://s.freebase.com/****pipermail/freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>    2012-December/010169.html<******http**://lists.freebase.com/**
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> pipermail/freebase-discuss/********2012-December/010169.html<***
>>>>>>>>>>> *http**
>>>>>>>>>>> ://lists.fre <http://lists.fre>
>>>>>>>>>>> ebase.com/pipermail/freebase-******discuss/2012-December/**
>>>>>>>>>>> 010169.**<http://ebase.com/pipermail/freebase-****discuss/2012-December/010169.**>
>>>>>>>>>>> **html<http://ebase.com/**pipermail/freebase-**discuss/**
>>>>>>>>>>> 2012-December/010169.**html<http://ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html>
>>>>>>>>>>>>
>>>>>>>>>>> <http://ebase.com/**pipermail/**freebase-discuss/**<http://ebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>>>> 2012-December/010169.html<http**://ebase.com/pipermail/**
>>>>>>>>>>> freebase-discuss/2012-**December/010169.html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Parsing a freebase RDF dump.

Posted by Yuhan Zhang <yz...@onescreen.com>.
thanks for the suggestion, Andy. I'll convert the turtle file and retry it.

Yuhan Zhang
Senior Software Engineer
OneScreen Inc.
www.onescreen.com
(949) 525-4825 Ext: 177
yzhang@onescreen.com <eh...@onescreen.com>


On Mon, Aug 26, 2013 at 1:29 PM, Andy Seaborne <an...@apache.org> wrote:

> On 26/08/13 18:41, Yuhan Zhang wrote:
>
>> Hi Andy,
>>
>> the line 4643044 looks like this:
>>
>> ns:m.04ln83j key:user.robert.world$0027s_**tallest.building "preceded_by"
>>
>
> Raw $ is not allowed in a prefixed name.
>
> But I guess $0027 is intended (which is ') so use %27.
>
> - - - - - -
>
> You can use \$ (in which case the URI will have a real $ in it) but some
> tools may have problems, or use %24 (in which case the URI will have the 3
> chars %-2-4 in it)
>
> [172s]  PN_LOCAL_ESC    ::=     '\' ('_' | '~' | '.' | '-' | '!' | '$' |
> '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@'
> | '%')
>
> http://www.w3.org/TR/turtle/#**sec-grammar-grammar<http://www.w3.org/TR/turtle/#sec-grammar-grammar>
>
>         Andy
>
>
>>
>> Yuhan Zhang
>> Senior Software Engineer
>> OneScreen Inc.
>> www.onescreen.com
>> (949) 525-4825 Ext: 177
>> yzhang@onescreen.com <eh...@onescreen.com>
>>
>>
>> On Sat, Aug 24, 2013 at 2:50 AM, Andy Seaborne <an...@apache.org> wrote:
>>
>>  On 24/08/13 03:29, Yuhan Zhang wrote:
>>>
>>>  I reached a similar error with  jena-2.10.1 but with a different
>>>> character
>>>> when parsing a more recent version of freebase-rdf-2013-08-04-00-00.
>>>>
>>>>
>>> Yes - they keep changing things and don't check much.
>>>
>>>
>>>   WARN  [line: 4632165, col: 55] Bad IRI: <
>>>
>>>> http://croctail.corpwatch.org/****#cw_506630,cw_{key}<http://croctail.corpwatch.org/**#cw_506630,cw_%7Bkey%7D>
>>>> <http://**croctail.corpwatch.org/#cw_**506630,cw_%7Bkey%7D<http://croctail.corpwatch.org/#cw_506630,cw_%7Bkey%7D>
>>>> >>
>>>>
>>>> Code: 4/UNWISE_CHARACTER
>>>> in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
>>>> characters are permitted in RDF URI References, XML system identifiers,
>>>> and
>>>> XML Schema anyURIs.
>>>>
>>>>
>>> Only a warning - no '{' or '}' in IRIs
>>>
>>>
>>>   ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)
>>>
>>>>
>>>>
>>> What's that line?  $ is illegal in some places, but legal in others.
>>>
>>>          Andy
>>>
>>>
>>>  Yuhan Zhang
>>>> Senior Software Engineer
>>>> OneScreen Inc.
>>>> www.onescreen.com
>>>> (949) 525-4825 Ext: 177
>>>> yzhang@onescreen.com <eh...@onescreen.com>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>   On 08/01/13 11:49, Rob Vesse wrote:
>>>>
>>>>>
>>>>>   2.10.0 is the current development snapshot, you can get this via
>>>>> maven
>>>>>
>>>>>> by
>>>>>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>>>>>
>>>>>>
>>>>>> If you need to download the JARs (I.e. non-maven builds) you can find
>>>>>> them
>>>>>> on the Apache artifactory at
>>>>>> https://repository.apache.org/******index.html#nexus-search;**<https://repository.apache.org/****index.html#nexus-search;**>
>>>>>> quick~**jena<https://**repository.apache.org/**index.**
>>>>>> html#nexus-search;quick~**jena<https://repository.apache.org/**index.html#nexus-search;quick~**jena>
>>>>>> **>
>>>>>> <https://**repository.apache.**org/index.**html#nexus-search;**
>>>>>> quick~jena<http://repository.apache.org/index.**html#nexus-search;quick~jena>
>>>>>> <https://repository.**apache.org/index.html#nexus-**search;quick~jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>> You need to click on Show All Versions for the module you want in
>>>>>> order
>>>>>> to
>>>>>> see download links for snapshots
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>>
>>>>>>  And the download is available at:
>>>>>
>>>>> https://repository.apache.org/******content/repositories/**<https://repository.apache.org/****content/repositories/**>
>>>>> <ht**tps://repository.apache.org/****content/repositories/**<https://repository.apache.org/**content/repositories/**>
>>>>> >
>>>>> snapshots/org/apache/jena/******apache-jena/<https://**
>>>>> repository.apache.org/content/****repositories/snapshots/org/****<http://repository.apache.org/content/**repositories/snapshots/org/**>
>>>>> apache/jena/apache-jena/<https**://repository.apache.org/**
>>>>> content/repositories/**snapshots/org/apache/jena/**apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>>>>> >
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>> (cough - see message of 28/Dec in this thread)
>>>>>
>>>>>           Andy
>>>>>
>>>>>
>>>>>
>>>>>   On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>    1. I am using the correct version of rdf file that you have.
>>>>>>
>>>>>>  2. This error of unknown char (\92) is appearing in all the files at
>>>>>>> different line numbers. I am not sure what this unknown char \(92)
>>>>>>> is.
>>>>>>> Tried to look in the surrounding of the line number in the file
>>>>>>> contents
>>>>>>> but can't find it :(
>>>>>>> 3. I can only find version 2.7.4 at
>>>>>>> http://www.apache.org/dist/******jena/binaries/<http://www.apache.org/dist/****jena/binaries/>
>>>>>>> <http://www.**apache.org/dist/**jena/**binaries/<http://www.apache.org/dist/**jena/binaries/>
>>>>>>> >
>>>>>>> <http://www.**apache.org/dist/**jena/binaries/<http://apache.org/dist/jena/binaries/>
>>>>>>> <http://www.**apache.org/dist/jena/binaries/<http://www.apache.org/dist/jena/binaries/>
>>>>>>> **>
>>>>>>> **>.
>>>>>>>
>>>>>>>
>>>>>>> May be THIS is the reason. Do
>>>>>>> you know where I can download the 2.10.0 version?
>>>>>>>
>>>>>>> Thanks much!
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> With Regards,
>>>>>>> Abhishek S
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>>>>
>>>>>>>
>>>>>>>>    Hi Andy,
>>>>>>>>
>>>>>>>>
>>>>>>>>>        I am using the script to correct the errors. When I run the
>>>>>>>>> script
>>>>>>>>> dwim
>>>>>>>>> on all the part files, it shows error messages, and continues
>>>>>>>>> processing.
>>>>>>>>> Are these errors that are corrected, or still existing that need
>>>>>>>>> attention?
>>>>>>>>> Sample error message is:
>>>>>>>>>
>>>>>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   What's on the lines around there?
>>>>>>>>>
>>>>>>>> And if you've split the dump, which file?
>>>>>>>>
>>>>>>>> That needs correcting in the source.  I can pare the first 30k lines
>>>>>>>> of
>>>>>>>> the file with Jena with no fixups.
>>>>>>>>
>>>>>>>> Maybe you don't have exactly the version of Freebase that I did
>>>>>>>> freebase-rdf-2012-12-23-00-00.********gz.  There is no suspect
>>>>>>>> forms
>>>>>>>>
>>>>>>>> around
>>>>>>>>
>>>>>>>> line 25K of my copy.
>>>>>>>>
>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk.
>>>>>>>> <---25335
>>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>>>>>
>>>>>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    Just wanted to know if we can ignore these messages while running
>>>>>>>> the
>>>>>>>>
>>>>>>>>  dwim
>>>>>>>>> script.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   You can ignore WARN.  ERRORs usually stop the parser as they
>>>>>>>>>
>>>>>>>> indicate
>>>>>>>> structural problems.
>>>>>>>>
>>>>>>>>            Andy
>>>>>>>>
>>>>>>>>
>>>>>>>>    Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>> With Regards,
>>>>>>>>> Abhishek S
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>     If you want to parse the Freebase dump, try this:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> http://people.apache.org/~**********andy/Freebase20121223/**
>>>>>>>>>> Notes.**<http://people.apache.org/~********andy/Freebase20121223/Notes.**>
>>>>>>>>>> ******txt<http://people.**apache.org/~******andy/**
>>>>>>>>>> Freebase20121223/Notes.********txt<http://people.apache.org/~******andy/Freebase20121223/Notes.******txt>
>>>>>>>>>> >
>>>>>>>>>> <http://people.**apache.org/~******andy/**<http://apache.org/~****andy/**>
>>>>>>>>>> Freebase20121223/Notes.******txt<http://people.apache.org/~**
>>>>>>>>>> ****andy/Freebase20121223/**Notes.****txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>>>>>> >
>>>>>>>>>> **>
>>>>>>>>>> <http:
>>>>>>>>>> //people.apache.org/%7E**andy/******Freebase20121223/Notes.****
>>>>>>>>>> txt<http://people.apache.org/%7E**andy/****Freebase20121223/Notes.**txt>
>>>>>>>>>> <http://people.apache.org/%**7E**andy/**Freebase20121223/**
>>>>>>>>>> Notes.**txt<http://people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt>
>>>>>>>>>> >
>>>>>>>>>> **<http://people.apache.org/%**7E****andy/Freebase20121223/**
>>>>>>>>>> Notes.***<http://people.apache.org/%7E****andy/Freebase20121223/Notes.***>
>>>>>>>>>> *txt<http://people.apache.org/**%7E**andy/Freebase20121223/**
>>>>>>>>>> Notes.**txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  <http://people.apache.org/%********7Eandy/Freebase20121223/**
>>>>>>>>>> Notes.****txt
>>>>>>>>>> <http:/
>>>>>>>>>> /people.apache.org/%7Eandy/******Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/****Freebase20121223/Notes.txt>
>>>>>>>>>> <ht**tp://people.apache.org/%**7Eandy/**Freebase20121223/**
>>>>>>>>>> Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>>>> >
>>>>>>>>>> <htt**p://people.apache.org/%**7Eandy/**Freebase20121223/**
>>>>>>>>>> Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>>>> <http://people.**apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  It takes about 90 minutes on my home desktop machine to fix and
>>>>>>>>>> parse
>>>>>>>>>> the
>>>>>>>>>> data.
>>>>>>>>>>
>>>>>>>>>> To load it, get a very large machine - it has been reported [1]
>>>>>>>>>> that a
>>>>>>>>>> previous dump has been loaded into TDB.
>>>>>>>>>>
>>>>>>>>>>             Andy
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://lists.freebase.com/**********pipermail/freebase-**
>>>>>>>>>> discuss/****<http://lists.freebase.com/********pipermail/freebase-discuss/****>
>>>>>>>>>> <http://lists.**freebase.com/******pipermail/**
>>>>>>>>>> freebase-discuss/**<http://lists.freebase.com/******pipermail/freebase-discuss/**>
>>>>>>>>>> >
>>>>>>>>>> <http://lists.freebase.com/********pipermail/freebase-discuss/**
>>>>>>>>>> ****<http://lists.freebase.com/******pipermail/freebase-discuss/****>
>>>>>>>>>> <http://lists.freebase.**com/****pipermail/freebase-**discuss/**<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  <**http://list <http://list>
>>>>>>>>>> s.freebase.com/**pipermail/******freebase-discuss/**<http://s.freebase.com/**pipermail/****freebase-discuss/**>
>>>>>>>>>> <http://s.**freebase.com/**pipermail/****freebase-discuss/**<http://s.freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>>>> >
>>>>>>>>>> <http://s.**freebase.com/****pipermail/**freebase-discuss/****<http://freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>>>> <http://s.freebase.com/****pipermail/freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>   2012-December/010169.html<******http**://lists.freebase.com/**
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> pipermail/freebase-discuss/********2012-December/010169.html<***
>>>>>>>>>> *http**
>>>>>>>>>> ://lists.fre <http://lists.fre>
>>>>>>>>>> ebase.com/pipermail/freebase-******discuss/2012-December/**
>>>>>>>>>> 010169.**<http://ebase.com/pipermail/freebase-****discuss/2012-December/010169.**>
>>>>>>>>>> **html<http://ebase.com/**pipermail/freebase-**discuss/**
>>>>>>>>>> 2012-December/010169.**html<http://ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html>
>>>>>>>>>> >
>>>>>>>>>> <http://ebase.com/**pipermail/**freebase-discuss/**<http://ebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>>> 2012-December/010169.html<http**://ebase.com/pipermail/**
>>>>>>>>>> freebase-discuss/2012-**December/010169.html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
The information contained in this e-mail is for the exclusive use of the 
intended recipient(s) and may be confidential, proprietary, and/or legally 
privileged. Inadvertent disclosure of this message does not constitute a 
waiver of any privilege.  If you receive this message in error, please do 
not directly or indirectly print, copy, retransmit, disseminate, or 
otherwise use the information. In addition, please delete this e-mail and 
all copies and notify the sender.

Re: Parsing a freebase RDF dump.

Posted by Andy Seaborne <an...@apache.org>.
On 26/08/13 18:41, Yuhan Zhang wrote:
> Hi Andy,
>
> the line 4643044 looks like this:
>
> ns:m.04ln83j key:user.robert.world$0027s_tallest.building "preceded_by"

Raw $ is not allowed in a prefixed name.

But I guess $0027 is intended (which is ') so use %27.

- - - - - -

You can use \$ (in which case the URI will have a real $ in it) but some 
tools may have problems, or use %24 (in which case the URI will have the 
3 chars %-2-4 in it)

[172s] 	PN_LOCAL_ESC 	::= 	'\' ('_' | '~' | '.' | '-' | '!' | '$' | '&' 
| "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' 
| '%')

http://www.w3.org/TR/turtle/#sec-grammar-grammar

	Andy

>
>
> Yuhan Zhang
> Senior Software Engineer
> OneScreen Inc.
> www.onescreen.com
> (949) 525-4825 Ext: 177
> yzhang@onescreen.com <eh...@onescreen.com>
>
>
> On Sat, Aug 24, 2013 at 2:50 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 24/08/13 03:29, Yuhan Zhang wrote:
>>
>>> I reached a similar error with  jena-2.10.1 but with a different character
>>> when parsing a more recent version of freebase-rdf-2013-08-04-00-00.
>>>
>>
>> Yes - they keep changing things and don't check much.
>>
>>
>>   WARN  [line: 4632165, col: 55] Bad IRI: <
>>> http://croctail.corpwatch.org/**#cw_506630,cw_{key}<http://croctail.corpwatch.org/#cw_506630,cw_%7Bkey%7D>>
>>> Code: 4/UNWISE_CHARACTER
>>> in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
>>> characters are permitted in RDF URI References, XML system identifiers,
>>> and
>>> XML Schema anyURIs.
>>>
>>
>> Only a warning - no '{' or '}' in IRIs
>>
>>
>>   ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)
>>>
>>
>> What's that line?  $ is illegal in some places, but legal in others.
>>
>>          Andy
>>
>>
>>> Yuhan Zhang
>>> Senior Software Engineer
>>> OneScreen Inc.
>>> www.onescreen.com
>>> (949) 525-4825 Ext: 177
>>> yzhang@onescreen.com <eh...@onescreen.com>
>>>
>>>
>>>
>>> On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>   On 08/01/13 11:49, Rob Vesse wrote:
>>>>
>>>>   2.10.0 is the current development snapshot, you can get this via maven
>>>>> by
>>>>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>>>>
>>>>>
>>>>> If you need to download the JARs (I.e. non-maven builds) you can find
>>>>> them
>>>>> on the Apache artifactory at
>>>>> https://repository.apache.org/****index.html#nexus-search;**
>>>>> quick~**jena<https://repository.apache.org/**index.html#nexus-search;quick~**jena>
>>>>> <https://**repository.apache.org/index.**html#nexus-search;quick~jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>>>>>
>>>>>
>>>>>
>>>>> You need to click on Show All Versions for the module you want in order
>>>>> to
>>>>> see download links for snapshots
>>>>>
>>>>> Rob
>>>>>
>>>>>
>>>> And the download is available at:
>>>>
>>>> https://repository.apache.org/****content/repositories/**<https://repository.apache.org/**content/repositories/**>
>>>> snapshots/org/apache/jena/****apache-jena/<https://**
>>>> repository.apache.org/content/**repositories/snapshots/org/**
>>>> apache/jena/apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>>>>>
>>>>
>>>>
>>>> (cough - see message of 28/Dec in this thread)
>>>>
>>>>           Andy
>>>>
>>>>
>>>>
>>>>   On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>    1. I am using the correct version of rdf file that you have.
>>>>>
>>>>>> 2. This error of unknown char (\92) is appearing in all the files at
>>>>>> different line numbers. I am not sure what this unknown char \(92) is.
>>>>>> Tried to look in the surrounding of the line number in the file
>>>>>> contents
>>>>>> but can't find it :(
>>>>>> 3. I can only find version 2.7.4 at
>>>>>> http://www.apache.org/dist/****jena/binaries/<http://www.apache.org/dist/**jena/binaries/>
>>>>>> <http://www.**apache.org/dist/jena/binaries/<http://www.apache.org/dist/jena/binaries/>
>>>>>> **>.
>>>>>>
>>>>>> May be THIS is the reason. Do
>>>>>> you know where I can download the 2.10.0 version?
>>>>>>
>>>>>> Thanks much!
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> With Regards,
>>>>>> Abhishek S
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>>>>>>
>>>>>>    On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>>>
>>>>>>>
>>>>>>>    Hi Andy,
>>>>>>>
>>>>>>>>
>>>>>>>>        I am using the script to correct the errors. When I run the
>>>>>>>> script
>>>>>>>> dwim
>>>>>>>> on all the part files, it shows error messages, and continues
>>>>>>>> processing.
>>>>>>>> Are these errors that are corrected, or still existing that need
>>>>>>>> attention?
>>>>>>>> Sample error message is:
>>>>>>>>
>>>>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>>>>
>>>>>>>>
>>>>>>>>   What's on the lines around there?
>>>>>>> And if you've split the dump, which file?
>>>>>>>
>>>>>>> That needs correcting in the source.  I can pare the first 30k lines
>>>>>>> of
>>>>>>> the file with Jena with no fixups.
>>>>>>>
>>>>>>> Maybe you don't have exactly the version of Freebase that I did
>>>>>>> freebase-rdf-2012-12-23-00-00.******gz.  There is no suspect forms
>>>>>>> around
>>>>>>>
>>>>>>> line 25K of my copy.
>>>>>>>
>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk.
>>>>>>> <---25335
>>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>>>>
>>>>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    Just wanted to know if we can ignore these messages while running
>>>>>>> the
>>>>>>>
>>>>>>>> dwim
>>>>>>>> script.
>>>>>>>>
>>>>>>>>
>>>>>>>>   You can ignore WARN.  ERRORs usually stop the parser as they
>>>>>>> indicate
>>>>>>> structural problems.
>>>>>>>
>>>>>>>            Andy
>>>>>>>
>>>>>>>
>>>>>>>    Thank you!
>>>>>>>
>>>>>>>>
>>>>>>>> With Regards,
>>>>>>>> Abhishek S
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     If you want to parse the Freebase dump, try this:
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://people.apache.org/~********andy/Freebase20121223/Notes.**
>>>>>>>>> ******txt<http://people.apache.org/~******andy/Freebase20121223/Notes.******txt>
>>>>>>>>> <http://people.**apache.org/~****andy/**
>>>>>>>>> Freebase20121223/Notes.****txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>>>>> **>
>>>>>>>>> <http:
>>>>>>>>> //people.apache.org/%7E**andy/****Freebase20121223/Notes.**txt<http://people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt>
>>>>>>>>> **<http://people.apache.org/%7E****andy/Freebase20121223/Notes.***
>>>>>>>>> *txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> <http://people.apache.org/%******7Eandy/Freebase20121223/**
>>>>>>>>> Notes.****txt
>>>>>>>>> <http:/
>>>>>>>>> /people.apache.org/%7Eandy/****Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>>> <htt**p://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> It takes about 90 minutes on my home desktop machine to fix and
>>>>>>>>> parse
>>>>>>>>> the
>>>>>>>>> data.
>>>>>>>>>
>>>>>>>>> To load it, get a very large machine - it has been reported [1]
>>>>>>>>> that a
>>>>>>>>> previous dump has been loaded into TDB.
>>>>>>>>>
>>>>>>>>>             Andy
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://lists.freebase.com/********pipermail/freebase-discuss/****<http://lists.freebase.com/******pipermail/freebase-discuss/**>
>>>>>>>>> <http://lists.freebase.com/******pipermail/freebase-discuss/****<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>>>>>>
>>>>>>>>> <**http://list <http://list>
>>>>>>>>> s.freebase.com/**pipermail/****freebase-discuss/**<http://s.freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>>> <http://s.**freebase.com/**pipermail/**freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   2012-December/010169.html<****http**://lists.freebase.com/**
>>>>>>>>>
>>>>>>>>> pipermail/freebase-discuss/******2012-December/010169.html<**http**
>>>>>>>>> ://lists.fre <http://lists.fre>
>>>>>>>>> ebase.com/pipermail/freebase-****discuss/2012-December/010169.**
>>>>>>>>> **html<http://ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html>
>>>>>>>>> <http://ebase.com/**pipermail/freebase-discuss/**
>>>>>>>>> 2012-December/010169.html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Parsing a freebase RDF dump.

Posted by Yuhan Zhang <yz...@onescreen.com>.
Hi Andy,

the line 4643044 looks like this:

ns:m.04ln83j key:user.robert.world$0027s_tallest.building "preceded_by"


Yuhan Zhang
Senior Software Engineer
OneScreen Inc.
www.onescreen.com
(949) 525-4825 Ext: 177
yzhang@onescreen.com <eh...@onescreen.com>


On Sat, Aug 24, 2013 at 2:50 AM, Andy Seaborne <an...@apache.org> wrote:

> On 24/08/13 03:29, Yuhan Zhang wrote:
>
>> I reached a similar error with  jena-2.10.1 but with a different character
>> when parsing a more recent version of freebase-rdf-2013-08-04-00-00.
>>
>
> Yes - they keep changing things and don't check much.
>
>
>  WARN  [line: 4632165, col: 55] Bad IRI: <
>> http://croctail.corpwatch.org/**#cw_506630,cw_{key}<http://croctail.corpwatch.org/#cw_506630,cw_%7Bkey%7D>>
>> Code: 4/UNWISE_CHARACTER
>> in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
>> characters are permitted in RDF URI References, XML system identifiers,
>> and
>> XML Schema anyURIs.
>>
>
> Only a warning - no '{' or '}' in IRIs
>
>
>  ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)
>>
>
> What's that line?  $ is illegal in some places, but legal in others.
>
>         Andy
>
>
>> Yuhan Zhang
>> Senior Software Engineer
>> OneScreen Inc.
>> www.onescreen.com
>> (949) 525-4825 Ext: 177
>> yzhang@onescreen.com <eh...@onescreen.com>
>>
>>
>>
>> On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:
>>
>>  On 08/01/13 11:49, Rob Vesse wrote:
>>>
>>>  2.10.0 is the current development snapshot, you can get this via maven
>>>> by
>>>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>>>
>>>>
>>>> If you need to download the JARs (I.e. non-maven builds) you can find
>>>> them
>>>> on the Apache artifactory at
>>>> https://repository.apache.org/****index.html#nexus-search;**
>>>> quick~**jena<https://repository.apache.org/**index.html#nexus-search;quick~**jena>
>>>> <https://**repository.apache.org/index.**html#nexus-search;quick~jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>>> >
>>>>
>>>>
>>>> You need to click on Show All Versions for the module you want in order
>>>> to
>>>> see download links for snapshots
>>>>
>>>> Rob
>>>>
>>>>
>>> And the download is available at:
>>>
>>> https://repository.apache.org/****content/repositories/**<https://repository.apache.org/**content/repositories/**>
>>> snapshots/org/apache/jena/****apache-jena/<https://**
>>> repository.apache.org/content/**repositories/snapshots/org/**
>>> apache/jena/apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>>> >
>>>
>>>
>>> (cough - see message of 28/Dec in this thread)
>>>
>>>          Andy
>>>
>>>
>>>
>>>  On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>>>> wrote:
>>>>
>>>>   1. I am using the correct version of rdf file that you have.
>>>>
>>>>> 2. This error of unknown char (\92) is appearing in all the files at
>>>>> different line numbers. I am not sure what this unknown char \(92) is.
>>>>> Tried to look in the surrounding of the line number in the file
>>>>> contents
>>>>> but can't find it :(
>>>>> 3. I can only find version 2.7.4 at
>>>>> http://www.apache.org/dist/****jena/binaries/<http://www.apache.org/dist/**jena/binaries/>
>>>>> <http://www.**apache.org/dist/jena/binaries/<http://www.apache.org/dist/jena/binaries/>
>>>>> **>.
>>>>>
>>>>> May be THIS is the reason. Do
>>>>> you know where I can download the 2.10.0 version?
>>>>>
>>>>> Thanks much!
>>>>>
>>>>> Thank you!
>>>>>
>>>>> With Regards,
>>>>> Abhishek S
>>>>>
>>>>>
>>>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>   On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>>
>>>>>>
>>>>>>   Hi Andy,
>>>>>>
>>>>>>>
>>>>>>>       I am using the script to correct the errors. When I run the
>>>>>>> script
>>>>>>> dwim
>>>>>>> on all the part files, it shows error messages, and continues
>>>>>>> processing.
>>>>>>> Are these errors that are corrected, or still existing that need
>>>>>>> attention?
>>>>>>> Sample error message is:
>>>>>>>
>>>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>>>
>>>>>>>
>>>>>>>  What's on the lines around there?
>>>>>> And if you've split the dump, which file?
>>>>>>
>>>>>> That needs correcting in the source.  I can pare the first 30k lines
>>>>>> of
>>>>>> the file with Jena with no fixups.
>>>>>>
>>>>>> Maybe you don't have exactly the version of Freebase that I did
>>>>>> freebase-rdf-2012-12-23-00-00.******gz.  There is no suspect forms
>>>>>> around
>>>>>>
>>>>>> line 25K of my copy.
>>>>>>
>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk.
>>>>>> <---25335
>>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>>>
>>>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>>>
>>>>>>
>>>>>>
>>>>>>   Just wanted to know if we can ignore these messages while running
>>>>>> the
>>>>>>
>>>>>>> dwim
>>>>>>> script.
>>>>>>>
>>>>>>>
>>>>>>>  You can ignore WARN.  ERRORs usually stop the parser as they
>>>>>> indicate
>>>>>> structural problems.
>>>>>>
>>>>>>           Andy
>>>>>>
>>>>>>
>>>>>>   Thank you!
>>>>>>
>>>>>>>
>>>>>>> With Regards,
>>>>>>> Abhishek S
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    If you want to parse the Freebase dump, try this:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> http://people.apache.org/~********andy/Freebase20121223/Notes.**
>>>>>>>> ******txt<http://people.apache.org/~******andy/Freebase20121223/Notes.******txt>
>>>>>>>> <http://people.**apache.org/~****andy/**
>>>>>>>> Freebase20121223/Notes.****txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>>>> **>
>>>>>>>> <http:
>>>>>>>> //people.apache.org/%7E**andy/****Freebase20121223/Notes.**txt<http://people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt>
>>>>>>>> **<http://people.apache.org/%7E****andy/Freebase20121223/Notes.***
>>>>>>>> *txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>>>> >
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> <http://people.apache.org/%******7Eandy/Freebase20121223/**
>>>>>>>> Notes.****txt
>>>>>>>> <http:/
>>>>>>>> /people.apache.org/%7Eandy/****Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt>
>>>>>>>> <htt**p://people.apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> It takes about 90 minutes on my home desktop machine to fix and
>>>>>>>> parse
>>>>>>>> the
>>>>>>>> data.
>>>>>>>>
>>>>>>>> To load it, get a very large machine - it has been reported [1]
>>>>>>>> that a
>>>>>>>> previous dump has been loaded into TDB.
>>>>>>>>
>>>>>>>>            Andy
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://lists.freebase.com/********pipermail/freebase-discuss/****<http://lists.freebase.com/******pipermail/freebase-discuss/**>
>>>>>>>> <http://lists.freebase.com/******pipermail/freebase-discuss/****<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>>>> >
>>>>>>>> <**http://list <http://list>
>>>>>>>> s.freebase.com/**pipermail/****freebase-discuss/**<http://s.freebase.com/**pipermail/**freebase-discuss/**>
>>>>>>>> <http://s.**freebase.com/**pipermail/**freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>>>> >
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  2012-December/010169.html<****http**://lists.freebase.com/**
>>>>>>>>
>>>>>>>> pipermail/freebase-discuss/******2012-December/010169.html<**http**
>>>>>>>> ://lists.fre <http://lists.fre>
>>>>>>>> ebase.com/pipermail/freebase-****discuss/2012-December/010169.**
>>>>>>>> **html<http://ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html>
>>>>>>>> <http://ebase.com/**pipermail/freebase-discuss/**
>>>>>>>> 2012-December/010169.html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>>> >
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

-- 
The information contained in this e-mail is for the exclusive use of the 
intended recipient(s) and may be confidential, proprietary, and/or legally 
privileged. Inadvertent disclosure of this message does not constitute a 
waiver of any privilege.  If you receive this message in error, please do 
not directly or indirectly print, copy, retransmit, disseminate, or 
otherwise use the information. In addition, please delete this e-mail and 
all copies and notify the sender.

Re: Parsing a freebase RDF dump.

Posted by Andy Seaborne <an...@apache.org>.
On 24/08/13 03:29, Yuhan Zhang wrote:
> I reached a similar error with  jena-2.10.1 but with a different character
> when parsing a more recent version of freebase-rdf-2013-08-04-00-00.

Yes - they keep changing things and don't check much.

> WARN  [line: 4632165, col: 55] Bad IRI: <
> http://croctail.corpwatch.org/#cw_506630,cw_{key}> Code: 4/UNWISE_CHARACTER
> in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
> characters are permitted in RDF URI References, XML system identifiers, and
> XML Schema anyURIs.

Only a warning - no '{' or '}' in IRIs

> ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)

What's that line?  $ is illegal in some places, but legal in others.

	Andy

>
> Yuhan Zhang
> Senior Software Engineer
> OneScreen Inc.
> www.onescreen.com
> (949) 525-4825 Ext: 177
> yzhang@onescreen.com <eh...@onescreen.com>
>
>
> On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 08/01/13 11:49, Rob Vesse wrote:
>>
>>> 2.10.0 is the current development snapshot, you can get this via maven by
>>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>>
>>>
>>> If you need to download the JARs (I.e. non-maven builds) you can find them
>>> on the Apache artifactory at
>>> https://repository.apache.org/**index.html#nexus-search;quick~**jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>>
>>> You need to click on Show All Versions for the module you want in order to
>>> see download links for snapshots
>>>
>>> Rob
>>>
>>
>> And the download is available at:
>>
>> https://repository.apache.org/**content/repositories/**
>> snapshots/org/apache/jena/**apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>>
>> (cough - see message of 28/Dec in this thread)
>>
>>          Andy
>>
>>
>>
>>> On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>>> wrote:
>>>
>>>   1. I am using the correct version of rdf file that you have.
>>>> 2. This error of unknown char (\92) is appearing in all the files at
>>>> different line numbers. I am not sure what this unknown char \(92) is.
>>>> Tried to look in the surrounding of the line number in the file contents
>>>> but can't find it :(
>>>> 3. I can only find version 2.7.4 at
>>>> http://www.apache.org/dist/**jena/binaries/<http://www.apache.org/dist/jena/binaries/>.
>>>> May be THIS is the reason. Do
>>>> you know where I can download the 2.10.0 version?
>>>>
>>>> Thanks much!
>>>>
>>>> Thank you!
>>>>
>>>> With Regards,
>>>> Abhishek S
>>>>
>>>>
>>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>   On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>>
>>>>>   Hi Andy,
>>>>>>
>>>>>>       I am using the script to correct the errors. When I run the script
>>>>>> dwim
>>>>>> on all the part files, it shows error messages, and continues
>>>>>> processing.
>>>>>> Are these errors that are corrected, or still existing that need
>>>>>> attention?
>>>>>> Sample error message is:
>>>>>>
>>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>>
>>>>>>
>>>>> What's on the lines around there?
>>>>> And if you've split the dump, which file?
>>>>>
>>>>> That needs correcting in the source.  I can pare the first 30k lines of
>>>>> the file with Jena with no fixups.
>>>>>
>>>>> Maybe you don't have exactly the version of Freebase that I did
>>>>> freebase-rdf-2012-12-23-00-00.****gz.  There is no suspect forms around
>>>>> line 25K of my copy.
>>>>>
>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk. <---25335
>>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>>
>>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>>
>>>>>
>>>>>
>>>>>   Just wanted to know if we can ignore these messages while running the
>>>>>> dwim
>>>>>> script.
>>>>>>
>>>>>>
>>>>> You can ignore WARN.  ERRORs usually stop the parser as they indicate
>>>>> structural problems.
>>>>>
>>>>>           Andy
>>>>>
>>>>>
>>>>>   Thank you!
>>>>>>
>>>>>> With Regards,
>>>>>> Abhishek S
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>    If you want to parse the Freebase dump, try this:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://people.apache.org/~******andy/Freebase20121223/Notes.******txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>>> <http:
>>>>>>> //people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>>>>
>>>>>>>
>>>>>>> <http://people.apache.org/%****7Eandy/Freebase20121223/Notes.****txt
>>>>>>> <http:/
>>>>>>> /people.apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> It takes about 90 minutes on my home desktop machine to fix and parse
>>>>>>> the
>>>>>>> data.
>>>>>>>
>>>>>>> To load it, get a very large machine - it has been reported [1] that a
>>>>>>> previous dump has been loaded into TDB.
>>>>>>>
>>>>>>>            Andy
>>>>>>>
>>>>>>> [1]
>>>>>>> http://lists.freebase.com/******pipermail/freebase-discuss/**<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>>> <**http://list <http://list>
>>>>>>> s.freebase.com/**pipermail/**freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>>>>
>>>>>>> 2012-December/010169.html<**http**://lists.freebase.com/**
>>>>>>>
>>>>>>> pipermail/freebase-discuss/****2012-December/010169.html<http**
>>>>>>> ://lists.fre <http://lists.fre>
>>>>>>> ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>


Re: Parsing a freebase RDF dump.

Posted by Yuhan Zhang <yz...@onescreen.com>.
I reached a similar error with  jena-2.10.1 but with a different character
when parsing a more recent version of freebase-rdf-2013-08-04-00-00.


WARN  [line: 4632165, col: 55] Bad IRI: <
http://croctail.corpwatch.org/#cw_506630,cw_{key}> Code: 4/UNWISE_CHARACTER
in FRAGMENT: The character matches no grammar rules of URIs/IRIs. These
characters are permitted in RDF URI References, XML system identifiers, and
XML Schema anyURIs.
ERROR [line: 4643044, col: 35] Unknown char: $(36;0x0024)

Yuhan Zhang
Senior Software Engineer
OneScreen Inc.
www.onescreen.com
(949) 525-4825 Ext: 177
yzhang@onescreen.com <eh...@onescreen.com>


On Tue, Jan 8, 2013 at 4:24 AM, Andy Seaborne <an...@apache.org> wrote:

> On 08/01/13 11:49, Rob Vesse wrote:
>
>> 2.10.0 is the current development snapshot, you can get this via maven by
>> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>>
>>
>> If you need to download the JARs (I.e. non-maven builds) you can find them
>> on the Apache artifactory at
>> https://repository.apache.org/**index.html#nexus-search;quick~**jena<https://repository.apache.org/index.html#nexus-search;quick~jena>
>>
>> You need to click on Show All Versions for the module you want in order to
>> see download links for snapshots
>>
>> Rob
>>
>
> And the download is available at:
>
> https://repository.apache.org/**content/repositories/**
> snapshots/org/apache/jena/**apache-jena/<https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/>
>
> (cough - see message of 28/Dec in this thread)
>
>         Andy
>
>
>
>> On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com>
>> wrote:
>>
>>  1. I am using the correct version of rdf file that you have.
>>> 2. This error of unknown char (\92) is appearing in all the files at
>>> different line numbers. I am not sure what this unknown char \(92) is.
>>> Tried to look in the surrounding of the line number in the file contents
>>> but can't find it :(
>>> 3. I can only find version 2.7.4 at
>>> http://www.apache.org/dist/**jena/binaries/<http://www.apache.org/dist/jena/binaries/>.
>>> May be THIS is the reason. Do
>>> you know where I can download the 2.10.0 version?
>>>
>>> Thanks much!
>>>
>>> Thank you!
>>>
>>> With Regards,
>>> Abhishek S
>>>
>>>
>>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>  On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>>
>>>>  Hi Andy,
>>>>>
>>>>>      I am using the script to correct the errors. When I run the script
>>>>> dwim
>>>>> on all the part files, it shows error messages, and continues
>>>>> processing.
>>>>> Are these errors that are corrected, or still existing that need
>>>>> attention?
>>>>> Sample error message is:
>>>>>
>>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>>
>>>>>
>>>> What's on the lines around there?
>>>> And if you've split the dump, which file?
>>>>
>>>> That needs correcting in the source.  I can pare the first 30k lines of
>>>> the file with Jena with no fixups.
>>>>
>>>> Maybe you don't have exactly the version of Freebase that I did
>>>> freebase-rdf-2012-12-23-00-00.****gz.  There is no suspect forms around
>>>> line 25K of my copy.
>>>>
>>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk. <---25335
>>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>>
>>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>>
>>>>
>>>>
>>>>  Just wanted to know if we can ignore these messages while running the
>>>>> dwim
>>>>> script.
>>>>>
>>>>>
>>>> You can ignore WARN.  ERRORs usually stop the parser as they indicate
>>>> structural problems.
>>>>
>>>>          Andy
>>>>
>>>>
>>>>  Thank you!
>>>>>
>>>>> With Regards,
>>>>> Abhishek S
>>>>>
>>>>>
>>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org>
>>>>> wrote:
>>>>>
>>>>>   If you want to parse the Freebase dump, try this:
>>>>>
>>>>>>
>>>>>>
>>>>>> http://people.apache.org/~******andy/Freebase20121223/Notes.******txt<http://people.apache.org/~****andy/Freebase20121223/Notes.****txt>
>>>>>> <http:
>>>>>> //people.apache.org/%7E**andy/**Freebase20121223/Notes.**txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>> >
>>>>>>
>>>>>> <http://people.apache.org/%****7Eandy/Freebase20121223/Notes.****txt
>>>>>> <http:/
>>>>>> /people.apache.org/%7Eandy/**Freebase20121223/Notes.txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>> >
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> It takes about 90 minutes on my home desktop machine to fix and parse
>>>>>> the
>>>>>> data.
>>>>>>
>>>>>> To load it, get a very large machine - it has been reported [1] that a
>>>>>> previous dump has been loaded into TDB.
>>>>>>
>>>>>>           Andy
>>>>>>
>>>>>> [1]
>>>>>> http://lists.freebase.com/******pipermail/freebase-discuss/**<http://lists.freebase.com/****pipermail/freebase-discuss/**>
>>>>>> <**http://list <http://list>
>>>>>> s.freebase.com/**pipermail/**freebase-discuss/**<http://s.freebase.com/**pipermail/freebase-discuss/**>
>>>>>> >
>>>>>> 2012-December/010169.html<**http**://lists.freebase.com/**
>>>>>>
>>>>>> pipermail/freebase-discuss/****2012-December/010169.html<http**
>>>>>> ://lists.fre <http://lists.fre>
>>>>>> ebase.com/pipermail/freebase-**discuss/2012-December/010169.**html<http://ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>> >
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

-- 
The information contained in this e-mail is for the exclusive use of the 
intended recipient(s) and may be confidential, proprietary, and/or legally 
privileged. Inadvertent disclosure of this message does not constitute a 
waiver of any privilege.  If you receive this message in error, please do 
not directly or indirectly print, copy, retransmit, disseminate, or 
otherwise use the information. In addition, please delete this e-mail and 
all copies and notify the sender.

Re: Parsing a freebase RDF dump.

Posted by Andy Seaborne <an...@apache.org>.
On 08/01/13 11:49, Rob Vesse wrote:
> 2.10.0 is the current development snapshot, you can get this via maven by
> setting the version for your Jena dependencies to 2.10.0-SNAPSHOT
>
>
> If you need to download the JARs (I.e. non-maven builds) you can find them
> on the Apache artifactory at
> https://repository.apache.org/index.html#nexus-search;quick~jena
>
> You need to click on Show All Versions for the module you want in order to
> see download links for snapshots
>
> Rob

And the download is available at:

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/apache-jena/

(cough - see message of 28/Dec in this thread)

	Andy

>
> On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com> wrote:
>
>> 1. I am using the correct version of rdf file that you have.
>> 2. This error of unknown char (\92) is appearing in all the files at
>> different line numbers. I am not sure what this unknown char \(92) is.
>> Tried to look in the surrounding of the line number in the file contents
>> but can't find it :(
>> 3. I can only find version 2.7.4 at
>> http://www.apache.org/dist/jena/binaries/. May be THIS is the reason. Do
>> you know where I can download the 2.10.0 version?
>>
>> Thanks much!
>>
>> Thank you!
>>
>> With Regards,
>> Abhishek S
>>
>>
>> On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>>
>>>> Hi Andy,
>>>>
>>>>      I am using the script to correct the errors. When I run the script
>>>> dwim
>>>> on all the part files, it shows error messages, and continues
>>>> processing.
>>>> Are these errors that are corrected, or still existing that need
>>>> attention?
>>>> Sample error message is:
>>>>
>>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>>
>>>
>>> What's on the lines around there?
>>> And if you've split the dump, which file?
>>>
>>> That needs correcting in the source.  I can pare the first 30k lines of
>>> the file with Jena with no fixups.
>>>
>>> Maybe you don't have exactly the version of Freebase that I did
>>> freebase-rdf-2012-12-23-00-00.**gz.  There is no suspect forms around
>>> line 25K of my copy.
>>>
>>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk. <---25335
>>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>>
>>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>>
>>>
>>>
>>>> Just wanted to know if we can ignore these messages while running the
>>>> dwim
>>>> script.
>>>>
>>>
>>> You can ignore WARN.  ERRORs usually stop the parser as they indicate
>>> structural problems.
>>>
>>>          Andy
>>>
>>>
>>>> Thank you!
>>>>
>>>> With Regards,
>>>> Abhishek S
>>>>
>>>>
>>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>   If you want to parse the Freebase dump, try this:
>>>>>
>>>>>
>>>>> http://people.apache.org/~****andy/Freebase20121223/Notes.****txt<http:
>>>>> //people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>>>
>>>>> <http://people.apache.org/%**7Eandy/Freebase20121223/Notes.**txt<http:/
>>>>> /people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>>>>
>>>>>
>>>>>
>>>>> It takes about 90 minutes on my home desktop machine to fix and parse
>>>>> the
>>>>> data.
>>>>>
>>>>> To load it, get a very large machine - it has been reported [1] that a
>>>>> previous dump has been loaded into TDB.
>>>>>
>>>>>           Andy
>>>>>
>>>>> [1]
>>>>> http://lists.freebase.com/****pipermail/freebase-discuss/**<http://list
>>>>> s.freebase.com/**pipermail/freebase-discuss/**>
>>>>> 2012-December/010169.html<http**://lists.freebase.com/**
>>>>>
>>>>> pipermail/freebase-discuss/**2012-December/010169.html<http://lists.fre
>>>>> ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>


Re: Parsing a freebase RDF dump.

Posted by Rob Vesse <rv...@yarcdata.com>.
2.10.0 is the current development snapshot, you can get this via maven by
setting the version for your Jena dependencies to 2.10.0-SNAPSHOT


If you need to download the JARs (I.e. non-maven builds) you can find them
on the Apache artifactory at
https://repository.apache.org/index.html#nexus-search;quick~jena

You need to click on Show All Versions for the module you want in order to
see download links for snapshots

Rob

On 1/8/13 11:45 AM, "Abhishek Shivkumar" <ab...@gmail.com> wrote:

>1. I am using the correct version of rdf file that you have.
>2. This error of unknown char (\92) is appearing in all the files at
>different line numbers. I am not sure what this unknown char \(92) is.
>Tried to look in the surrounding of the line number in the file contents
>but can't find it :(
>3. I can only find version 2.7.4 at
>http://www.apache.org/dist/jena/binaries/. May be THIS is the reason. Do
>you know where I can download the 2.10.0 version?
>
>Thanks much!
>
>Thank you!
>
>With Regards,
>Abhishek S
>
>
>On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 08/01/13 11:00, Abhishek Shivkumar wrote:
>>
>>> Hi Andy,
>>>
>>>     I am using the script to correct the errors. When I run the script
>>> dwim
>>> on all the part files, it shows error messages, and continues
>>>processing.
>>> Are these errors that are corrected, or still existing that need
>>> attention?
>>> Sample error message is:
>>>
>>> ERROR [line:25335, col:25] Unknown char: \(92)
>>>
>>
>> What's on the lines around there?
>> And if you've split the dump, which file?
>>
>> That needs correcting in the source.  I can pare the first 30k lines of
>> the file with Jena with no fixups.
>>
>> Maybe you don't have exactly the version of Freebase that I did
>> freebase-rdf-2012-12-23-00-00.**gz.  There is no suspect forms around
>> line 25K of my copy.
>>
>> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
>> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk. <---25335
>> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>>
>> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>>
>>
>>
>>> Just wanted to know if we can ignore these messages while running the
>>>dwim
>>> script.
>>>
>>
>> You can ignore WARN.  ERRORs usually stop the parser as they indicate
>> structural problems.
>>
>>         Andy
>>
>>
>>> Thank you!
>>>
>>> With Regards,
>>> Abhishek S
>>>
>>>
>>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>  If you want to parse the Freebase dump, try this:
>>>>
>>>> 
>>>>http://people.apache.org/~****andy/Freebase20121223/Notes.****txt<http:
>>>>//people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>>> 
>>>><http://people.apache.org/%**7Eandy/Freebase20121223/Notes.**txt<http:/
>>>>/people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>>> >
>>>>
>>>>
>>>> It takes about 90 minutes on my home desktop machine to fix and parse
>>>>the
>>>> data.
>>>>
>>>> To load it, get a very large machine - it has been reported [1] that a
>>>> previous dump has been loaded into TDB.
>>>>
>>>>          Andy
>>>>
>>>> [1] 
>>>>http://lists.freebase.com/****pipermail/freebase-discuss/**<http://list
>>>>s.freebase.com/**pipermail/freebase-discuss/**>
>>>> 2012-December/010169.html<http**://lists.freebase.com/**
>>>> 
>>>>pipermail/freebase-discuss/**2012-December/010169.html<http://lists.fre
>>>>ebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>>> >
>>>>
>>>>
>>>
>>


Re: Parsing a freebase RDF dump.

Posted by Abhishek Shivkumar <ab...@gmail.com>.
1. I am using the correct version of rdf file that you have.
2. This error of unknown char (\92) is appearing in all the files at
different line numbers. I am not sure what this unknown char \(92) is.
Tried to look in the surrounding of the line number in the file contents
but can't find it :(
3. I can only find version 2.7.4 at
http://www.apache.org/dist/jena/binaries/. May be THIS is the reason. Do
you know where I can download the 2.10.0 version?

Thanks much!

Thank you!

With Regards,
Abhishek S


On Tue, Jan 8, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:

> On 08/01/13 11:00, Abhishek Shivkumar wrote:
>
>> Hi Andy,
>>
>>     I am using the script to correct the errors. When I run the script
>> dwim
>> on all the part files, it shows error messages, and continues processing.
>> Are these errors that are corrected, or still existing that need
>> attention?
>> Sample error message is:
>>
>> ERROR [line:25335, col:25] Unknown char: \(92)
>>
>
> What's on the lines around there?
> And if you've split the dump, which file?
>
> That needs correcting in the source.  I can pare the first 30k lines of
> the file with Jena with no fixups.
>
> Maybe you don't have exactly the version of Freebase that I did
> freebase-rdf-2012-12-23-00-00.**gz.  There is no suspect forms around
> line 25K of my copy.
>
> ns:award.award_winner   ns:type.type.instance   ns:m.03cpgmq.
> ns:award.award_winner   ns:type.type.instance   ns:m.05x3tbk. <---25335
> ns:award.award_winner   ns:type.type.instance   ns:m.05q_rp.
>
> You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).
>
>
>
>> Just wanted to know if we can ignore these messages while running the dwim
>> script.
>>
>
> You can ignore WARN.  ERRORs usually stop the parser as they indicate
> structural problems.
>
>         Andy
>
>
>> Thank you!
>>
>> With Regards,
>> Abhishek S
>>
>>
>> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>  If you want to parse the Freebase dump, try this:
>>>
>>> http://people.apache.org/~****andy/Freebase20121223/Notes.****txt<http://people.apache.org/%7E**andy/Freebase20121223/Notes.**txt>
>>> <http://people.apache.org/%**7Eandy/Freebase20121223/Notes.**txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>> >
>>>
>>>
>>> It takes about 90 minutes on my home desktop machine to fix and parse the
>>> data.
>>>
>>> To load it, get a very large machine - it has been reported [1] that a
>>> previous dump has been loaded into TDB.
>>>
>>>          Andy
>>>
>>> [1] http://lists.freebase.com/****pipermail/freebase-discuss/**<http://lists.freebase.com/**pipermail/freebase-discuss/**>
>>> 2012-December/010169.html<http**://lists.freebase.com/**
>>> pipermail/freebase-discuss/**2012-December/010169.html<http://lists.freebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>> >
>>>
>>>
>>
>

Re: Parsing a freebase RDF dump.

Posted by Andy Seaborne <an...@apache.org>.
On 08/01/13 11:00, Abhishek Shivkumar wrote:
> Hi Andy,
>
>     I am using the script to correct the errors. When I run the script dwim
> on all the part files, it shows error messages, and continues processing.
> Are these errors that are corrected, or still existing that need attention?
> Sample error message is:
>
> ERROR [line:25335, col:25] Unknown char: \(92)

What's on the lines around there?
And if you've split the dump, which file?

That needs correcting in the source.  I can pare the first 30k lines of 
the file with Jena with no fixups.

Maybe you don't have exactly the version of Freebase that I did 
freebase-rdf-2012-12-23-00-00.gz.  There is no suspect forms around line 
25K of my copy.

ns:award.award_winner	ns:type.type.instance	ns:m.03cpgmq.
ns:award.award_winner	ns:type.type.instance	ns:m.05x3tbk. <---25335
ns:award.award_winner	ns:type.type.instance	ns:m.05q_rp.

You also need the latest version of Jena (recent 2.10.0 SNAPSHOT).

>
> Just wanted to know if we can ignore these messages while running the dwim
> script.

You can ignore WARN.  ERRORs usually stop the parser as they indicate 
structural problems.

	Andy

>
> Thank you!
>
> With Regards,
> Abhishek S
>
>
> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> If you want to parse the Freebase dump, try this:
>>
>> http://people.apache.org/~**andy/Freebase20121223/Notes.**txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>
>> It takes about 90 minutes on my home desktop machine to fix and parse the
>> data.
>>
>> To load it, get a very large machine - it has been reported [1] that a
>> previous dump has been loaded into TDB.
>>
>>          Andy
>>
>> [1] http://lists.freebase.com/**pipermail/freebase-discuss/**
>> 2012-December/010169.html<http://lists.freebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>
>


Re: Parsing a freebase RDF dump.

Posted by Olivier Rossel <ol...@gmail.com>.
isn't BaseKB supposed to be a clean RDF version of Freebase?
http://basekb.com/



On Tue, Jan 8, 2013 at 12:00 PM, Abhishek Shivkumar
<ab...@gmail.com> wrote:
> Hi Andy,
>
>    I am using the script to correct the errors. When I run the script dwim
> on all the part files, it shows error messages, and continues processing.
> Are these errors that are corrected, or still existing that need attention?
> Sample error message is:
>
> ERROR [line:25335, col:25] Unknown char: \(92)
>
> Just wanted to know if we can ignore these messages while running the dwim
> script.
>
> Thank you!
>
> With Regards,
> Abhishek S
>
>
> On Sat, Dec 29, 2012 at 1:58 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> If you want to parse the Freebase dump, try this:
>>
>> http://people.apache.org/~**andy/Freebase20121223/Notes.**txt<http://people.apache.org/%7Eandy/Freebase20121223/Notes.txt>
>>
>> It takes about 90 minutes on my home desktop machine to fix and parse the
>> data.
>>
>> To load it, get a very large machine - it has been reported [1] that a
>> previous dump has been loaded into TDB.
>>
>>         Andy
>>
>> [1] http://lists.freebase.com/**pipermail/freebase-discuss/**
>> 2012-December/010169.html<http://lists.freebase.com/pipermail/freebase-discuss/2012-December/010169.html>
>>