You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Andrea Di Menna <an...@inqmobile.com> on 2012/11/02 17:40:18 UTC

EntityHub Referenced Site and redirects

Hi all,

I have created a EntityHub Solr index from dbpedia 3.8 using the default
settings for the dbpedia indexing tool.
The index was created successfully.

Now that I working on it I am noticing that wikipedia redirects are
completely missing from the EntityHub.

I have used the fetch_prepare.sh tool to download data from DBpedia, and
among the resources there is also redirects_en.nt.bz2
There is a rule in the mappings.txt file to map dbp-ont:wikiPageRedirects
to rdfs:seeAlso.

>From what I can see, the problems seems to be that the indexing tool is
only taking into account the resources listed in the incoming_links.txt
file.
This file is built upon page_links_en.nt.bz2 and ranks entities on the
basis of the incoming links.
Page redirects will never have incoming links hence will not be listed in
incoming_links.txt

Is my understanding correct or am I missing anything?
Should I forcibly insert page redirects entities in the incoming_links file
to get them included in the Solr index?

Thank you very much for your time

-- 
Andrea Di Menna




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.


Re: EntityHub Referenced Site and redirects

Posted by Andrea Di Menna <an...@inqmobile.com>.
Hi Rupert,

thanks for your help, very detailed and focused as always.

I like solution (3) so I think I will work on that, as it seems to have all
the properties I need.

P.S. In order to build the DBpedia 3.8 index I had to preprocess almost all
the dbpedia resource files with the rules used in fetch_prepare (sed on
some unicode chars) to stop RIOT complaints.

Cheers
Andrea

2012/11/3 Rupert Westenthaler <ru...@gmail.com>

>
> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/incoming_links_en.txt.bz2




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.


Re: EntityHub Referenced Site and redirects

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

Your observation sounds reasonable. With dbpedia 3.7 pagelinks for
redirected URIs where still present. With dbpedia there are two
pagelinks files

    http://downloads.dbpedia.org/3.8/en/page_links_en.nt.bz2
    http://downloads.dbpedia.org/3.8/en/page_links_unredirected_en.nt.bz2

so I assume that the default one "page_links_en.nt.bz2" counts
pagelinks from redirected URIs with the target URI of the redirect.

I see three possible solutions:

(1) Using the "page_links_unredirected_en.nt.bz2" to calculate the
"incoming_links.txt" file. This should result in the same behavior as
for dbpedia 3.7.

(2) You can add all pages with redirects in a 2nd run of the tool
(with a different configuration). For that you will need to deactivate
the "entityIdIterator" and "entityDataProvider" in the
'indexing.properties' file and instead activate

entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.StaticEntityScoreProvider,score:0.1

Note the static score of "0.1" ~ the same as for entities with two
incoming links

Finally you need to configure the
"org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes".
While the FieldValueFilter is already included in the configuration.
you will need to change the "entityTypes.properties" file. you need to
set the

field=dbp-ont:wikiPageRedirects
#comment the values property (see STANBOL-794)
#values=*

NOTE: make sure to update to the newest trunk version as I implemented
STANBOL-794 only minutes ago

After this configuration changes the IndexingTool will ONLY index
Entities with redirects (and therefore not overriding existing data in
the SolrIndex)

(3) Usually redirected Entities are only included to have their labels
available. However since the implementation of the
LDPathSourceProcessor (STANBOL-590) it is also possible to index those
labels directly within the document of the Entity where the redirect
points to. Do do this you need to adapt the configuration like
follows:

In "indexing.properties" add the
'org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor'
to the entityProcessors.

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:dbpediacontext.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

The parameter "ldpath:dbpediacontext.ldpath" points to an file with
this name in the indexing/config directory. It needs to contain the
LDpath statement used to obtain the data. To copy the labels of
incoming redirects to "skos:altLabel" you can add the statement

skos:altLabel = ^dbp-ont:wikiPageRedirects/rdfs:label;

Finally in the mappings.txt file I would also recommend to configure
the following mappings

# add rdfs:labels and rdfs:labels of redirected sites to dbp-ont:surfaceForm
rdfs:label > dbp-ont:surfaceForm
skos:altLabel > dbp-ont:surfaceForm

this will allow you to configure e.g. the KeywordLinkingEngine to
match against "dbp-ont:surfaceForm" and by that finding Entities for
any label (including redirected one). You should also deactivate
redirecting if you use option (3) as this will save you a lot of Solr
queries.

best
Rupert

ps. An incoming_links.txt file compiled from "page_links_en.nt.bz2" is
available at http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/incoming_links_en.txt.bz2

On Fri, Nov 2, 2012 at 5:40 PM, Andrea Di Menna <an...@inqmobile.com> wrote:
> Hi all,
>
> I have created a EntityHub Solr index from dbpedia 3.8 using the default
> settings for the dbpedia indexing tool.
> The index was created successfully.
>
> Now that I working on it I am noticing that wikipedia redirects are
> completely missing from the EntityHub.
>
> I have used the fetch_prepare.sh tool to download data from DBpedia, and
> among the resources there is also redirects_en.nt.bz2
> There is a rule in the mappings.txt file to map dbp-ont:wikiPageRedirects
> to rdfs:seeAlso.
>
> From what I can see, the problems seems to be that the indexing tool is
> only taking into account the resources listed in the incoming_links.txt
> file.
> This file is built upon page_links_en.nt.bz2 and ranks entities on the
> basis of the incoming links.
> Page redirects will never have incoming links hence will not be listed in
> incoming_links.txt
>
> Is my understanding correct or am I missing anything?
> Should I forcibly insert page redirects entities in the incoming_links file
> to get them included in the Solr index?
>
> Thank you very much for your time
>
> --
> Andrea Di Menna
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi (again)

> (2) dbp-ont:surfaceForm
>
> I recommended to you to copy labels of redirected pages to the
> "dbp-ont:surfaceForm" field. In the meantime I made some tests with an
> index build like that. The results where really bad because of that I
> must revoke this recommendation!
>
> The reason for that is that the scoring algorithm of Solr is affected
> by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
> dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
> about ~15 contain "Paris". So if you now make a query for Paris in
> this field
>
>     (((@en/dbp\-ont\:surfaceForm/:"paris")))
>
> you will notice that dbpedia:Paris is not within the top 10 search
> results. Instead Entities like "Paris Barclay" are listed because they
> do have only a single value for "dbp-ont:surfaceForm" and therefore
> the match for "Paris" is much more relevant.

Just talked about this problem with Sebastian Schaffert. He suggested
to try setting

    omitNorms="true"

for all fields used for labels within the Entityhub. This should have
the affect that Entities with a lot of  "dbp-ont:surfaceForm" values
are no longer penalized by the Solr ranking algorithm. So testing that
will require some time.

best
Rupert


>
> This means that the current index-layout where URIs of redirected
> pages are represented as own Entities within the index is much better
> suited for entity extraction.


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Posted by Andrea Di Menna <ni...@gmail.com>.
Hi Rupert,

thank you very much for your help re (1) and (2).

For what regards the script used to insert redirects into the
incoming_links.txt file I am using something like this:

# Rank entities by popularity by counting the number of incoming links in the
# wikipedia graph: computing this takes around 2 hours
if [ ! -f $WORKSPACE/indexing/resources/incoming_links.txt ]
then
    if [ ! -f page_links_en.nt.bz2 ]
    then
        curl $DBPEDIA/en/page_links_en.nt.bz2
    fi
    bzcat page_links_en.nt.bz2 \
    | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
    | sort -S $MAX_SORT_MEM \
    | uniq -c  \
    | sort -nr -S $MAX_SORT_MEM >
$WORKSPACE/indexing/resources/incoming_links.txt

    # Sort the incoming links on the entities, removing initial spaces
added by uniq
    cat $WORKSPACE/indexing/resources/incoming_links.txt \
    | sed 's/^\s*//' \
    | sort -k 2b,2 > $WORKSPACE/indexing/resources/incoming_links_sorted_k2.txt

    mv $WORKSPACE/indexing/resources/incoming_links.txt
$WORKSPACE/indexing/resources/original_incoming_links.txt

    # Sort redirects
    zcat redirects_en.nt.gz | grep -v "^#" \
    | sed 's/^<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>
./\1 \2/' \
    | sort -k 2b,2 > $WORKSPACE/indexing/resources/redirects_sorted_k2.txt

    # Join redirects with the original incoming links to assign the
same ranking to redirects
    join -j 2 -o 2.1 1.1
$WORKSPACE/indexing/resources/redirects_sorted_k2.txt
$WORKSPACE/indexing/resources/incoming_links_sorted_k2.txt \
    > $WORKSPACE/indexing/resources/incoming_links_redirects.txt

    # Merge the two files - maybe use sort merge?!
    cat $WORKSPACE/indexing/resources/incoming_links_redirects.txt
$WORKSPACE/indexing/resources/incoming_links_sorted_k2.txt \
    | sort -nr -S $MAX_SORT_MEM >
$WORKSPACE/indexing/resources/incoming_links.txt

    # WE ARE NOT REMOVING INTERMEDIATE FILES
    # rm -f $WORKSPACE/indexing/resources/incoming_links_sorted_k2.txt
    # rm -f $WORKSPACE/indexing/resources/redirects_sorted_k2.txt
    # rm -f $WORKSPACE/indexing/resources/incoming_links_redirects.txt
fi

Do you see any problem with that?

By comparing lines count of the redirects, the original incoming_links
file and the new incoming_links file I noticed there is a number of
redirects which map to entities which are not in the
incoming_links.txt file. That means those entities have no incoming
links (e.g. Template pages).

I am wondering if there are other DBpedia resources with no incoming
links which should be taken into account though when creating the Solr
index.
What are your thoughts?

Moreover, is the page_links_en.nt.bz2 file needed for the index?
If not then after computing the incoming_links.txt then it should
probably be moved out of the rdfdata dir otherwise it will be
processed by tdbloader (the file is about 150M rows...)

Cheers
Andrea

2012/11/15 Rupert Westenthaler <ru...@gmail.com>:
> Hi Andrea,
>
> A followup:
>
> (1) Sharing your indexes:
>
> This would be great! I talked with a collage of mine. Most likely we
> will add an FTP upload folder to the dev.iks-project.eu server. For
> that we will need to add more HDD space to this virtual host what
> might take some more time to accomplish. I will notify you as soon as
> we are ready
>
> (2) dbp-ont:surfaceForm
>
> I recommended to you to copy labels of redirected pages to the
> "dbp-ont:surfaceForm" field. In the meantime I made some tests with an
> index build like that. The results where really bad because of that I
> must revoke this recommendation!
>
> The reason for that is that the scoring algorithm of Solr is affected
> by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
> dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
> about ~15 contain "Paris". So if you now make a query for Paris in
> this field
>
>     (((@en/dbp\-ont\:surfaceForm/:"paris")))
>
> you will notice that dbpedia:Paris is not within the top 10 search
> results. Instead Entities like "Paris Barclay" are listed because they
> do have only a single value for "dbp-ont:surfaceForm" and therefore
> the match for "Paris" is much more relevant.
>
> This means that the current index-layout where URIs of redirected
> pages are represented as own Entities within the index is much better
> suited for entity extraction.
>
> On Mon, Nov 5, 2012 at 10:59 AM, Andrea Di Menna <an...@inqmobile.com> wrote:
>> Hi Rupert,
>> I would be more than happy to share the indexes.
>> I have also created one including redirects by forcibly inserting
>> redirecting entities into the incoming_links.txt file.
>
> Do you have a script for creating such a incoming_links.txt file?
> Because this would be very useful for properly creating indexes that
> include Entities of redirected pages.
>
> best
> Rupert
>
>> Redirects have been assigned the same entity rank as the entities they
>> redirect to.
>>
>> Please let me know how and where to store those indexes.
>>
>> Cheers
>>
>> 2012/11/3 Rupert Westenthaler <ru...@gmail.com>
>>
>>> Hi,
>>>
>>> I have started to play around with indexing dbpedia 3.8 myself as well
>>> and I con confirm that one has to preprocess nearly all files. Because
>>> of that I have written a nice shell script that downloads, processes
>>> and re-compresses the RDF files
>>>
>>> # array syntax is ({item-1} {items-2} ... {item-n})
>>> # names need to include the language path segment!
>>> files=(dbpedia_3.8.owl \
>>>     en/labels_en.nt \
>>>     {all-the-other-files-you-need} \
>>>     )
>>>
>>> for i in "${files[@]}"
>>> do
>>>     :
>>>     # clean possible encoding errors
>>>     filename=$(basename $i)
>>>     if [ ! -f ${filename}.gz ]
>>>     then
>>>         url=${DBPEDIA}/${i}.bz2
>>>         wget -c ${url}
>>>         echo "cleaning $filename ..."
>>>         #corrects encoding and recompress using gz
>>>         #gz is used because it is faster
>>>         bzcat ${filename}.bz2 \
>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>>             | gzip -c > ${filename}.gz
>>>         rm -f ${filename}.bz2
>>>     fi
>>> done
>>>
>>> > the SolrIndex zip file is about 3.5GB.
>>> > I am using a min-score=2 in minincoming.properties
>>> > I think the 3.7 index file from the IKS project downloads site was
>>> created
>>> > with min-score=10.
>>>
>>> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
>>> 3.5GByte for all entities wih >=2 incomming links (should be about
>>> 4million entities) sound reasonable. If you  want to share your index
>>> with the Stanbol community I am sure we can find a server to host it.
>>>
>>>
>>> Note about languages:
>>>
>>> while it is easy include labels, comments, abstracts of additional
>>> languages it is not so easy to add proper Solr field definition for
>>> languages. While there is a great wiki page that provides all the
>>> necessary links [1] I find it still very hard to add configurations
>>> for languages I do not understand. So if someone can help with that I
>>> am happy to improve the Solr schemas used by the Entityhub (and the
>>> Entityhub Indexing tool)!
>>>
>>>
>>> Upgrading the default DBpedia index:
>>>
>>> After the ApacheCon I will work on replacing the default dbpedia index
>>> used with the Stanbol launchers with a dbpedia 3.8 based version (the
>>> current one is still based on 3.6). This will need some time because I
>>> expect that I will need to adapt a lot of unit/integration tests
>>> affected by data changes.
>>>
>>> [1] http://wiki.apache.org/solr/LanguageAnalysis
>>>
>>> >
>>> > I have indexed english resources and labels from other languages, as this
>>> > is what I currently need.
>>> >
>>> > Cheers
>>> > Andrea
>>> >
>>> > 2012/11/2 harish suvarna <hs...@gmail.com>
>>> >
>>> >> Andrea,
>>> >> Thanks for the update. I was also trying to create the Chinese and
>>> English
>>> >> dbpedia3.8 indexes. But ranout hardware power.
>>> >> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9
>>> GB
>>> >> (zip file). But I guess that contained labels from all languages.
>>> >>
>>> >> Did you index English only?
>>> >>
>>> >> -harish
>>> >>
>>> >> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
>>> >> >wrote:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > I have created a EntityHub Solr index from dbpedia 3.8 using the
>>> default
>>> >> > settings for the dbpedia indexing tool.
>>> >> > The index was created successfully.
>>> >> >
>>> >> > Now that I working on it I am noticing that wikipedia redirects are
>>> >> > completely missing from the EntityHub.
>>> >> >
>>> >> > I have used the fetch_prepare.sh tool to download data from DBpedia,
>>> and
>>> >> > among the resources there is also redirects_en.nt.bz2
>>> >> > There is a rule in the mappings.txt file to map
>>> dbp-ont:wikiPageRedirects
>>> >> > to rdfs:seeAlso.
>>> >> >
>>> >> > From what I can see, the problems seems to be that the indexing tool
>>> is
>>> >> > only taking into account the resources listed in the
>>> incoming_links.txt
>>> >> > file.
>>> >> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
>>> >> > basis of the incoming links.
>>> >> > Page redirects will never have incoming links hence will not be
>>> listed in
>>> >> > incoming_links.txt
>>> >> >
>>> >> > Is my understanding correct or am I missing anything?
>>> >> > Should I forcibly insert page redirects entities in the incoming_links
>>> >> file
>>> >> > to get them included in the Solr index?
>>> >> >
>>> >> > Thank you very much for your time
>>> >> >
>>> >> > --
>>> >> > Andrea Di Menna
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > This e-mail is only intended for the person(s) to whom it is addressed
>>> >> and
>>> >> > may contain CONFIDENTIAL information. Any opinions or views are
>>> personal
>>> >> to
>>> >> > the writer and do not represent those of INQ Mobile Limited, Hutchison
>>> >> > Whampoa Limited or its group companies.  If you  are not the intended
>>> >> > recipient, you are hereby notified that any use, retention,
>>> disclosure,
>>> >> > copying, printing, forwarding or dissemination of this communication
>>> is
>>> >> > strictly prohibited. If you have received this  communication in
>>> error,
>>> >> > please erase all copies of the message and its  attachments and notify
>>> >> the
>>> >> > sender immediately. INQ Mobile Limited is  a company registered in the
>>> >> > British Virgin Islands. www.inqmobile.com.
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >> --
>>> >> Thanks
>>> >> Harish
>>> >>
>>> >
>>> >
>>> >
>>> >
>>> > This e-mail is only intended for the person(s) to whom it is addressed
>>> and may contain CONFIDENTIAL information. Any opinions or views are
>>> personal to the writer and do not represent those of INQ Mobile Limited,
>>> Hutchison Whampoa Limited or its group companies.  If you  are not the
>>> intended recipient, you are hereby notified that any use, retention,
>>> disclosure, copying, printing, forwarding or dissemination of this
>>> communication is strictly prohibited. If you have received this
>>>  communication in error, please erase all copies of the message and its
>>>  attachments and notify the sender immediately. INQ Mobile Limited is  a
>>> company registered in the British Virgin Islands. www.inqmobile.com.
>>> >
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>
>>
>> --
>> Andrea Di Menna
>> INQ - Engineering
>> +393925803119
>> skype: ninniux
>> inqmobile.com
>> INQ¹ – Winner of the 2009 Best Handset
>>
>>
>>
>>
>> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
>>
>>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Andrea,

A followup:

(1) Sharing your indexes:

This would be great! I talked with a collage of mine. Most likely we
will add an FTP upload folder to the dev.iks-project.eu server. For
that we will need to add more HDD space to this virtual host what
might take some more time to accomplish. I will notify you as soon as
we are ready

(2) dbp-ont:surfaceForm

I recommended to you to copy labels of redirected pages to the
"dbp-ont:surfaceForm" field. In the meantime I made some tests with an
index build like that. The results where really bad because of that I
must revoke this recommendation!

The reason for that is that the scoring algorithm of Solr is affected
by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
about ~15 contain "Paris". So if you now make a query for Paris in
this field

    (((@en/dbp\-ont\:surfaceForm/:"paris")))

you will notice that dbpedia:Paris is not within the top 10 search
results. Instead Entities like "Paris Barclay" are listed because they
do have only a single value for "dbp-ont:surfaceForm" and therefore
the match for "Paris" is much more relevant.

This means that the current index-layout where URIs of redirected
pages are represented as own Entities within the index is much better
suited for entity extraction.

On Mon, Nov 5, 2012 at 10:59 AM, Andrea Di Menna <an...@inqmobile.com> wrote:
> Hi Rupert,
> I would be more than happy to share the indexes.
> I have also created one including redirects by forcibly inserting
> redirecting entities into the incoming_links.txt file.

Do you have a script for creating such a incoming_links.txt file?
Because this would be very useful for properly creating indexes that
include Entities of redirected pages.

best
Rupert

> Redirects have been assigned the same entity rank as the entities they
> redirect to.
>
> Please let me know how and where to store those indexes.
>
> Cheers
>
> 2012/11/3 Rupert Westenthaler <ru...@gmail.com>
>
>> Hi,
>>
>> I have started to play around with indexing dbpedia 3.8 myself as well
>> and I con confirm that one has to preprocess nearly all files. Because
>> of that I have written a nice shell script that downloads, processes
>> and re-compresses the RDF files
>>
>> # array syntax is ({item-1} {items-2} ... {item-n})
>> # names need to include the language path segment!
>> files=(dbpedia_3.8.owl \
>>     en/labels_en.nt \
>>     {all-the-other-files-you-need} \
>>     )
>>
>> for i in "${files[@]}"
>> do
>>     :
>>     # clean possible encoding errors
>>     filename=$(basename $i)
>>     if [ ! -f ${filename}.gz ]
>>     then
>>         url=${DBPEDIA}/${i}.bz2
>>         wget -c ${url}
>>         echo "cleaning $filename ..."
>>         #corrects encoding and recompress using gz
>>         #gz is used because it is faster
>>         bzcat ${filename}.bz2 \
>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>             | gzip -c > ${filename}.gz
>>         rm -f ${filename}.bz2
>>     fi
>> done
>>
>> > the SolrIndex zip file is about 3.5GB.
>> > I am using a min-score=2 in minincoming.properties
>> > I think the 3.7 index file from the IKS project downloads site was
>> created
>> > with min-score=10.
>>
>> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
>> 3.5GByte for all entities wih >=2 incomming links (should be about
>> 4million entities) sound reasonable. If you  want to share your index
>> with the Stanbol community I am sure we can find a server to host it.
>>
>>
>> Note about languages:
>>
>> while it is easy include labels, comments, abstracts of additional
>> languages it is not so easy to add proper Solr field definition for
>> languages. While there is a great wiki page that provides all the
>> necessary links [1] I find it still very hard to add configurations
>> for languages I do not understand. So if someone can help with that I
>> am happy to improve the Solr schemas used by the Entityhub (and the
>> Entityhub Indexing tool)!
>>
>>
>> Upgrading the default DBpedia index:
>>
>> After the ApacheCon I will work on replacing the default dbpedia index
>> used with the Stanbol launchers with a dbpedia 3.8 based version (the
>> current one is still based on 3.6). This will need some time because I
>> expect that I will need to adapt a lot of unit/integration tests
>> affected by data changes.
>>
>> [1] http://wiki.apache.org/solr/LanguageAnalysis
>>
>> >
>> > I have indexed english resources and labels from other languages, as this
>> > is what I currently need.
>> >
>> > Cheers
>> > Andrea
>> >
>> > 2012/11/2 harish suvarna <hs...@gmail.com>
>> >
>> >> Andrea,
>> >> Thanks for the update. I was also trying to create the Chinese and
>> English
>> >> dbpedia3.8 indexes. But ranout hardware power.
>> >> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9
>> GB
>> >> (zip file). But I guess that contained labels from all languages.
>> >>
>> >> Did you index English only?
>> >>
>> >> -harish
>> >>
>> >> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
>> >> >wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I have created a EntityHub Solr index from dbpedia 3.8 using the
>> default
>> >> > settings for the dbpedia indexing tool.
>> >> > The index was created successfully.
>> >> >
>> >> > Now that I working on it I am noticing that wikipedia redirects are
>> >> > completely missing from the EntityHub.
>> >> >
>> >> > I have used the fetch_prepare.sh tool to download data from DBpedia,
>> and
>> >> > among the resources there is also redirects_en.nt.bz2
>> >> > There is a rule in the mappings.txt file to map
>> dbp-ont:wikiPageRedirects
>> >> > to rdfs:seeAlso.
>> >> >
>> >> > From what I can see, the problems seems to be that the indexing tool
>> is
>> >> > only taking into account the resources listed in the
>> incoming_links.txt
>> >> > file.
>> >> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
>> >> > basis of the incoming links.
>> >> > Page redirects will never have incoming links hence will not be
>> listed in
>> >> > incoming_links.txt
>> >> >
>> >> > Is my understanding correct or am I missing anything?
>> >> > Should I forcibly insert page redirects entities in the incoming_links
>> >> file
>> >> > to get them included in the Solr index?
>> >> >
>> >> > Thank you very much for your time
>> >> >
>> >> > --
>> >> > Andrea Di Menna
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > This e-mail is only intended for the person(s) to whom it is addressed
>> >> and
>> >> > may contain CONFIDENTIAL information. Any opinions or views are
>> personal
>> >> to
>> >> > the writer and do not represent those of INQ Mobile Limited, Hutchison
>> >> > Whampoa Limited or its group companies.  If you  are not the intended
>> >> > recipient, you are hereby notified that any use, retention,
>> disclosure,
>> >> > copying, printing, forwarding or dissemination of this communication
>> is
>> >> > strictly prohibited. If you have received this  communication in
>> error,
>> >> > please erase all copies of the message and its  attachments and notify
>> >> the
>> >> > sender immediately. INQ Mobile Limited is  a company registered in the
>> >> > British Virgin Islands. www.inqmobile.com.
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Harish
>> >>
>> >
>> >
>> >
>> >
>> > This e-mail is only intended for the person(s) to whom it is addressed
>> and may contain CONFIDENTIAL information. Any opinions or views are
>> personal to the writer and do not represent those of INQ Mobile Limited,
>> Hutchison Whampoa Limited or its group companies.  If you  are not the
>> intended recipient, you are hereby notified that any use, retention,
>> disclosure, copying, printing, forwarding or dissemination of this
>> communication is strictly prohibited. If you have received this
>>  communication in error, please erase all copies of the message and its
>>  attachments and notify the sender immediately. INQ Mobile Limited is  a
>> company registered in the British Virgin Islands. www.inqmobile.com.
>> >
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>
>
> --
> Andrea Di Menna
> INQ - Engineering
> +393925803119
> skype: ninniux
> inqmobile.com
> INQ¹ – Winner of the 2009 Best Handset
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Posted by Andrea Di Menna <an...@inqmobile.com>.
Hi Rupert,
I would be more than happy to share the indexes.
I have also created one including redirects by forcibly inserting
redirecting entities into the incoming_links.txt file.
Redirects have been assigned the same entity rank as the entities they
redirect to.

Please let me know how and where to store those indexes.

Cheers

2012/11/3 Rupert Westenthaler <ru...@gmail.com>

> Hi,
>
> I have started to play around with indexing dbpedia 3.8 myself as well
> and I con confirm that one has to preprocess nearly all files. Because
> of that I have written a nice shell script that downloads, processes
> and re-compresses the RDF files
>
> # array syntax is ({item-1} {items-2} ... {item-n})
> # names need to include the language path segment!
> files=(dbpedia_3.8.owl \
>     en/labels_en.nt \
>     {all-the-other-files-you-need} \
>     )
>
> for i in "${files[@]}"
> do
>     :
>     # clean possible encoding errors
>     filename=$(basename $i)
>     if [ ! -f ${filename}.gz ]
>     then
>         url=${DBPEDIA}/${i}.bz2
>         wget -c ${url}
>         echo "cleaning $filename ..."
>         #corrects encoding and recompress using gz
>         #gz is used because it is faster
>         bzcat ${filename}.bz2 \
>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>             | gzip -c > ${filename}.gz
>         rm -f ${filename}.bz2
>     fi
> done
>
> > the SolrIndex zip file is about 3.5GB.
> > I am using a min-score=2 in minincoming.properties
> > I think the 3.7 index file from the IKS project downloads site was
> created
> > with min-score=10.
>
> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
> 3.5GByte for all entities wih >=2 incomming links (should be about
> 4million entities) sound reasonable. If you  want to share your index
> with the Stanbol community I am sure we can find a server to host it.
>
>
> Note about languages:
>
> while it is easy include labels, comments, abstracts of additional
> languages it is not so easy to add proper Solr field definition for
> languages. While there is a great wiki page that provides all the
> necessary links [1] I find it still very hard to add configurations
> for languages I do not understand. So if someone can help with that I
> am happy to improve the Solr schemas used by the Entityhub (and the
> Entityhub Indexing tool)!
>
>
> Upgrading the default DBpedia index:
>
> After the ApacheCon I will work on replacing the default dbpedia index
> used with the Stanbol launchers with a dbpedia 3.8 based version (the
> current one is still based on 3.6). This will need some time because I
> expect that I will need to adapt a lot of unit/integration tests
> affected by data changes.
>
> [1] http://wiki.apache.org/solr/LanguageAnalysis
>
> >
> > I have indexed english resources and labels from other languages, as this
> > is what I currently need.
> >
> > Cheers
> > Andrea
> >
> > 2012/11/2 harish suvarna <hs...@gmail.com>
> >
> >> Andrea,
> >> Thanks for the update. I was also trying to create the Chinese and
> English
> >> dbpedia3.8 indexes. But ranout hardware power.
> >> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9
> GB
> >> (zip file). But I guess that contained labels from all languages.
> >>
> >> Did you index English only?
> >>
> >> -harish
> >>
> >> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
> >> >wrote:
> >>
> >> > Hi all,
> >> >
> >> > I have created a EntityHub Solr index from dbpedia 3.8 using the
> default
> >> > settings for the dbpedia indexing tool.
> >> > The index was created successfully.
> >> >
> >> > Now that I working on it I am noticing that wikipedia redirects are
> >> > completely missing from the EntityHub.
> >> >
> >> > I have used the fetch_prepare.sh tool to download data from DBpedia,
> and
> >> > among the resources there is also redirects_en.nt.bz2
> >> > There is a rule in the mappings.txt file to map
> dbp-ont:wikiPageRedirects
> >> > to rdfs:seeAlso.
> >> >
> >> > From what I can see, the problems seems to be that the indexing tool
> is
> >> > only taking into account the resources listed in the
> incoming_links.txt
> >> > file.
> >> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
> >> > basis of the incoming links.
> >> > Page redirects will never have incoming links hence will not be
> listed in
> >> > incoming_links.txt
> >> >
> >> > Is my understanding correct or am I missing anything?
> >> > Should I forcibly insert page redirects entities in the incoming_links
> >> file
> >> > to get them included in the Solr index?
> >> >
> >> > Thank you very much for your time
> >> >
> >> > --
> >> > Andrea Di Menna
> >> >
> >> >
> >> >
> >> >
> >> > This e-mail is only intended for the person(s) to whom it is addressed
> >> and
> >> > may contain CONFIDENTIAL information. Any opinions or views are
> personal
> >> to
> >> > the writer and do not represent those of INQ Mobile Limited, Hutchison
> >> > Whampoa Limited or its group companies.  If you  are not the intended
> >> > recipient, you are hereby notified that any use, retention,
> disclosure,
> >> > copying, printing, forwarding or dissemination of this communication
> is
> >> > strictly prohibited. If you have received this  communication in
> error,
> >> > please erase all copies of the message and its  attachments and notify
> >> the
> >> > sender immediately. INQ Mobile Limited is  a company registered in the
> >> > British Virgin Islands. www.inqmobile.com.
> >> >
> >> >
> >>
> >>
> >> --
> >> Thanks
> >> Harish
> >>
> >
> >
> >
> >
> > This e-mail is only intended for the person(s) to whom it is addressed
> and may contain CONFIDENTIAL information. Any opinions or views are
> personal to the writer and do not represent those of INQ Mobile Limited,
> Hutchison Whampoa Limited or its group companies.  If you  are not the
> intended recipient, you are hereby notified that any use, retention,
> disclosure, copying, printing, forwarding or dissemination of this
> communication is strictly prohibited. If you have received this
>  communication in error, please erase all copies of the message and its
>  attachments and notify the sender immediately. INQ Mobile Limited is  a
> company registered in the British Virgin Islands. www.inqmobile.com.
> >
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Andrea Di Menna
INQ - Engineering
+393925803119
skype: ninniux
inqmobile.com
INQ¹ – Winner of the 2009 Best Handset




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.



Re: EntityHub Referenced Site and redirects

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi,

I have started to play around with indexing dbpedia 3.8 myself as well
and I con confirm that one has to preprocess nearly all files. Because
of that I have written a nice shell script that downloads, processes
and re-compresses the RDF files

# array syntax is ({item-1} {items-2} ... {item-n})
# names need to include the language path segment!
files=(dbpedia_3.8.owl \
    en/labels_en.nt \
    {all-the-other-files-you-need} \
    )

for i in "${files[@]}"
do
    :
    # clean possible encoding errors
    filename=$(basename $i)
    if [ ! -f ${filename}.gz ]
    then
        url=${DBPEDIA}/${i}.bz2
        wget -c ${url}
        echo "cleaning $filename ..."
        #corrects encoding and recompress using gz
        #gz is used because it is faster
        bzcat ${filename}.bz2 \
            | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
            | gzip -c > ${filename}.gz
        rm -f ${filename}.bz2
    fi
done

> the SolrIndex zip file is about 3.5GB.
> I am using a min-score=2 in minincoming.properties
> I think the 3.7 index file from the IKS project downloads site was created
> with min-score=10.

The dbpedia 3.7 index was build by ogrisel, but I think you are right.
3.5GByte for all entities wih >=2 incomming links (should be about
4million entities) sound reasonable. If you  want to share your index
with the Stanbol community I am sure we can find a server to host it.


Note about languages:

while it is easy include labels, comments, abstracts of additional
languages it is not so easy to add proper Solr field definition for
languages. While there is a great wiki page that provides all the
necessary links [1] I find it still very hard to add configurations
for languages I do not understand. So if someone can help with that I
am happy to improve the Solr schemas used by the Entityhub (and the
Entityhub Indexing tool)!


Upgrading the default DBpedia index:

After the ApacheCon I will work on replacing the default dbpedia index
used with the Stanbol launchers with a dbpedia 3.8 based version (the
current one is still based on 3.6). This will need some time because I
expect that I will need to adapt a lot of unit/integration tests
affected by data changes.

[1] http://wiki.apache.org/solr/LanguageAnalysis

>
> I have indexed english resources and labels from other languages, as this
> is what I currently need.
>
> Cheers
> Andrea
>
> 2012/11/2 harish suvarna <hs...@gmail.com>
>
>> Andrea,
>> Thanks for the update. I was also trying to create the Chinese and English
>> dbpedia3.8 indexes. But ranout hardware power.
>> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9 GB
>> (zip file). But I guess that contained labels from all languages.
>>
>> Did you index English only?
>>
>> -harish
>>
>> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
>> >wrote:
>>
>> > Hi all,
>> >
>> > I have created a EntityHub Solr index from dbpedia 3.8 using the default
>> > settings for the dbpedia indexing tool.
>> > The index was created successfully.
>> >
>> > Now that I working on it I am noticing that wikipedia redirects are
>> > completely missing from the EntityHub.
>> >
>> > I have used the fetch_prepare.sh tool to download data from DBpedia, and
>> > among the resources there is also redirects_en.nt.bz2
>> > There is a rule in the mappings.txt file to map dbp-ont:wikiPageRedirects
>> > to rdfs:seeAlso.
>> >
>> > From what I can see, the problems seems to be that the indexing tool is
>> > only taking into account the resources listed in the incoming_links.txt
>> > file.
>> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
>> > basis of the incoming links.
>> > Page redirects will never have incoming links hence will not be listed in
>> > incoming_links.txt
>> >
>> > Is my understanding correct or am I missing anything?
>> > Should I forcibly insert page redirects entities in the incoming_links
>> file
>> > to get them included in the Solr index?
>> >
>> > Thank you very much for your time
>> >
>> > --
>> > Andrea Di Menna
>> >
>> >
>> >
>> >
>> > This e-mail is only intended for the person(s) to whom it is addressed
>> and
>> > may contain CONFIDENTIAL information. Any opinions or views are personal
>> to
>> > the writer and do not represent those of INQ Mobile Limited, Hutchison
>> > Whampoa Limited or its group companies.  If you  are not the intended
>> > recipient, you are hereby notified that any use, retention, disclosure,
>> > copying, printing, forwarding or dissemination of this communication is
>> > strictly prohibited. If you have received this  communication in error,
>> > please erase all copies of the message and its  attachments and notify
>> the
>> > sender immediately. INQ Mobile Limited is  a company registered in the
>> > British Virgin Islands. www.inqmobile.com.
>> >
>> >
>>
>>
>> --
>> Thanks
>> Harish
>>
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Posted by Andrea Di Menna <an...@inqmobile.com>.
Hi Harish,

the SolrIndex zip file is about 3.5GB.
I am using a min-score=2 in minincoming.properties
I think the 3.7 index file from the IKS project downloads site was created
with min-score=10.

I have indexed english resources and labels from other languages, as this
is what I currently need.

Cheers
Andrea

2012/11/2 harish suvarna <hs...@gmail.com>

> Andrea,
> Thanks for the update. I was also trying to create the Chinese and English
> dbpedia3.8 indexes. But ranout hardware power.
> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9 GB
> (zip file). But I guess that contained labels from all languages.
>
> Did you index English only?
>
> -harish
>
> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
> >wrote:
>
> > Hi all,
> >
> > I have created a EntityHub Solr index from dbpedia 3.8 using the default
> > settings for the dbpedia indexing tool.
> > The index was created successfully.
> >
> > Now that I working on it I am noticing that wikipedia redirects are
> > completely missing from the EntityHub.
> >
> > I have used the fetch_prepare.sh tool to download data from DBpedia, and
> > among the resources there is also redirects_en.nt.bz2
> > There is a rule in the mappings.txt file to map dbp-ont:wikiPageRedirects
> > to rdfs:seeAlso.
> >
> > From what I can see, the problems seems to be that the indexing tool is
> > only taking into account the resources listed in the incoming_links.txt
> > file.
> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
> > basis of the incoming links.
> > Page redirects will never have incoming links hence will not be listed in
> > incoming_links.txt
> >
> > Is my understanding correct or am I missing anything?
> > Should I forcibly insert page redirects entities in the incoming_links
> file
> > to get them included in the Solr index?
> >
> > Thank you very much for your time
> >
> > --
> > Andrea Di Menna
> >
> >
> >
> >
> > This e-mail is only intended for the person(s) to whom it is addressed
> and
> > may contain CONFIDENTIAL information. Any opinions or views are personal
> to
> > the writer and do not represent those of INQ Mobile Limited, Hutchison
> > Whampoa Limited or its group companies.  If you  are not the intended
> > recipient, you are hereby notified that any use, retention, disclosure,
> > copying, printing, forwarding or dissemination of this communication is
> > strictly prohibited. If you have received this  communication in error,
> > please erase all copies of the message and its  attachments and notify
> the
> > sender immediately. INQ Mobile Limited is  a company registered in the
> > British Virgin Islands. www.inqmobile.com.
> >
> >
>
>
> --
> Thanks
> Harish
>




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.


Re: EntityHub Referenced Site and redirects

Posted by harish suvarna <hs...@gmail.com>.
Andrea,
Thanks for the update. I was also trying to create the Chinese and English
dbpedia3.8 indexes. But ranout hardware power.
What is the size of the dbpedia.solr.index.zip file? It used to be 1.9 GB
(zip file). But I guess that contained labels from all languages.

Did you index English only?

-harish

On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <an...@inqmobile.com>wrote:

> Hi all,
>
> I have created a EntityHub Solr index from dbpedia 3.8 using the default
> settings for the dbpedia indexing tool.
> The index was created successfully.
>
> Now that I working on it I am noticing that wikipedia redirects are
> completely missing from the EntityHub.
>
> I have used the fetch_prepare.sh tool to download data from DBpedia, and
> among the resources there is also redirects_en.nt.bz2
> There is a rule in the mappings.txt file to map dbp-ont:wikiPageRedirects
> to rdfs:seeAlso.
>
> From what I can see, the problems seems to be that the indexing tool is
> only taking into account the resources listed in the incoming_links.txt
> file.
> This file is built upon page_links_en.nt.bz2 and ranks entities on the
> basis of the incoming links.
> Page redirects will never have incoming links hence will not be listed in
> incoming_links.txt
>
> Is my understanding correct or am I missing anything?
> Should I forcibly insert page redirects entities in the incoming_links file
> to get them included in the Solr index?
>
> Thank you very much for your time
>
> --
> Andrea Di Menna
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and
> may contain CONFIDENTIAL information. Any opinions or views are personal to
> the writer and do not represent those of INQ Mobile Limited, Hutchison
> Whampoa Limited or its group companies.  If you  are not the intended
> recipient, you are hereby notified that any use, retention, disclosure,
> copying, printing, forwarding or dissemination of this communication is
> strictly prohibited. If you have received this  communication in error,
> please erase all copies of the message and its  attachments and notify the
> sender immediately. INQ Mobile Limited is  a company registered in the
> British Virgin Islands. www.inqmobile.com.
>
>


-- 
Thanks
Harish