You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@any23.apache.org by Bianca Pereira <bi...@gmail.com> on 2014/07/10 13:50:33 UTC

Extracting Blank Nodes instead of IRIs

Hi all,

  I started to use any23 recently and I had one issue extracting the
information from one website (IMDB.com).

 I want to extract triples from the webpages and I faced the following
problem:

Even when there is an IRI that could be used as the identifier for a
concept it is not used and the blank node is used instead. In the following
example the actor Marco Nanini is represented by a blank node (
*_:nodec984d7c9ee5436ea92571ccd94b946*) even when he has an IRI that could
be used as the identifier (*file:/name/nm0620847/?ref_=tt_cl_t1*). After,
the blank node identification is used to link it with a Movie, which is
also identified by a blank node.

It seems that in this specific case I could use the content from the
property */Person/url* as the unique identifier (*IRI*) for the entity. I
suppose it is not a problem of the extractor but on how the page was
created. But as many people are using schema.org I was wondering if there
is any solution for this case. I would be very glad if someone has any idea
of a solution.

<file:index.html%3Fref_=fn_al_tt_4> <http://purl.org/dc/terms/title>
"Copacabana (2001) - IMDb" .
_:nodee59ff091c1fa911a94a42244c38ab99a <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Movie> .

*_:nodec984d7c9ee5436ea92571ccd94b946 <*
*http://www.w3.org/1999/02/22-rdf-syntax-ns#type*
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>*> <*
*http://schema.org/Person* <http://schema.org/Person>
*> . **_:nodec984d7c9ee5436ea92571ccd94b946 <*
*http://schema.org/Person/name* <http://schema.org/Person/name>
*> "Marco Nanini" .**_:nodec984d7c9ee5436ea92571ccd94b946 <*
*http://schema.org/Person/url* <http://schema.org/Person/url>
*> <file:/name/nm0620847/?ref_=tt_cl_t1> .
**_:nodee59ff091c1fa911a94a42244c38ab99a
<**http://schema.org/Movie/actor* <http://schema.org/Movie/actor>*>
_:nodec984d7c9ee5436ea92571ccd94b946 .*

_:nodebf90e351418e786432aede35cceb807 <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person>
.
_:nodebf90e351418e786432aede35cceb807 <http://schema.org/Person/name>
"Walderez de Barros" .
_:nodebf90e351418e786432aede35cceb807 <http://schema.org/Person/url>
<file:/name/nm0207281/?ref_=tt_cl_t2> .
_:nodee59ff091c1fa911a94a42244c38ab99a <http://schema.org/Movie/actor>
_:nodebf90e351418e786432aede35cceb807 .

Best Regards,

Bianca Pereira