You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2013/11/21 13:12:05 UTC

svn commit: r1544151 - in /stanbol/trunk/entityhub/indexing/freebase: README.md fbrankings-uri.sh

Author: rwesten
Date: Thu Nov 21 12:12:04 2013
New Revision: 1544151

URL: http://svn.apache.org/r1544151
Log:
STANBOL-1214: added the freebase entity ranking file provided by Viktor Gal. Also updated the documentation to link to this file. I kept the old script (as google might switch back to the old dump format) and use fbrankings-uri.sh as name for the new one.

Added:
    stanbol/trunk/entityhub/indexing/freebase/fbrankings-uri.sh
Modified:
    stanbol/trunk/entityhub/indexing/freebase/README.md

Modified: stanbol/trunk/entityhub/indexing/freebase/README.md
URL: http://svn.apache.org/viewvc/stanbol/trunk/entityhub/indexing/freebase/README.md?rev=1544151&r1=1544150&r2=1544151&view=diff
==============================================================================
--- stanbol/trunk/entityhub/indexing/freebase/README.md (original)
+++ stanbol/trunk/entityhub/indexing/freebase/README.md Thu Nov 21 12:12:04 2013
@@ -87,21 +87,25 @@ folder.
 
 The Entityhub Indexing tool supports the use of index time boosts. Those
 boosts can be set based on the number of referenced an Entity has within the
-freebase knowledge base by calling 
+freebase knowledge base by using one of the following scripts:
 
-    gunzip -c ${FB_DUMP} \
-        | grep "^ns:m\..*\t.*\tns:m\." \
-        | cut -f 3 | sed 's/.$//' \
-        | sort -S $MAX_SORT_MEM \
-        | uniq -c  \
-        | sort -nr -S $MAX_SORT_MEM > $INCOMING_FILE
+1. [fbrankings.sh](fbranlings.sh): intended for freebase dumps that do use 
+namespace prefix mappings.
+2. [fbrankings-uri.sh](fbrankings-uri.sh): intended for freebase dumps that
+do use full qualified URIs.calling 
 
-NOTE: Ubuntu requires a different syntax for grep e.g.
+
+NOTE: Ubuntu requires a different syntax for grep e.g. of the `fbrankings.sh`
+instead of
+
+    grep "^ns:m\..*\t.*\tns:m\."
+
+you will need to use
 
     grep $'^ns:m\..*\t.*\tns:m\.'
 
-See also the [fbranking.sh] script in the same directory. The $INCOMING_FILE
-needs to be copied to 'indexing/resource/incoming_lings.txt'.
+The resulting $INCOMING_FILE needs to be copied to 'indexing/resource/incoming_lings.txt'.
+The file will have a size of about 1.5GByte
 
 ### (4) Dealing with corrupted RDF
 

Added: stanbol/trunk/entityhub/indexing/freebase/fbrankings-uri.sh
URL: http://svn.apache.org/viewvc/stanbol/trunk/entityhub/indexing/freebase/fbrankings-uri.sh?rev=1544151&view=auto
==============================================================================
--- stanbol/trunk/entityhub/indexing/freebase/fbrankings-uri.sh (added)
+++ stanbol/trunk/entityhub/indexing/freebase/fbrankings-uri.sh Thu Nov 21 12:12:04 2013
@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+echo ">> Building incoming links File for freebase.com <<"
+
+WORKSPACE=indexing/resources
+
+MAX_SORT_MEM=4G
+
+# Turn on echoing and exit on error
+set -x -e -o pipefail
+
+INCOMING_FILE=${WORKSPACE}/incoming_links.txt
+FB_DUMP=$1
+
+gunzip -c ${FB_DUMP} \
+| grep "^<http://rdf.freebase.com/ns/m\..*<.*>.*<http://rdf.freebase.com/ns/m\." \
+| cut -f 3 \
+| sed 's/.*\/ns\/\(.*\)>/\1/g' \
+| sort -S $MAX_SORT_MEM \
+| uniq -c \
+| sort -nr -S $MAX_SORT_MEM > $INCOMING_FILE
+