You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by b m <bm...@gmail.com> on 2011/11/21 03:20:51 UTC

Extracting Text from XML

I'm trying to extract text using the XMLParser from the following XML
document with no success:

<?xml version="1.0"?>
<RDF:RDF xmlns:MAF="http://maf.mozdev.org/metadata/rdf#"
         xmlns:NC="http://home.netscape.com/NC-rdf#"
         xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <RDF:Description RDF:about="urn:root">
    <MAF:originalurl RDF:resource="http://news.google.com/"/>
    <MAF:title RDF:resource="Google News"/>
    <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
    <MAF:indexfilename RDF:resource="index.html"/>
    <MAF:charset RDF:resource="UTF-8"/>
  </RDF:Description>
</RDF:RDF>


Here's my source code...

File file = new File("/tmp/test.rdf");
InputStream is = new FileInputStream(file);
Metadata metaData = new Metadata();
AbstractParser parser = new RdfParser();
DefaultHandler handler = new ToTextContentHandler();
parser.parse(is, handler, metaData, new ParseContext());
System.out.println("handler: " + handler);
System.out.println("metadata: " + metaData);



And the result is no parsed text.   What am I doing wrong?

=======================sample code output==================

handler:






















metadata: Content-Type=application/xml

===========================================================

Re: Extracting Text from XML

Posted by Nick Burch <ni...@alfresco.com>.

On Mon, 21 Nov 2011, b m wrote:
> I'm a little suprised that Tika doesn't have something that reads extracts
> element attribute values.

If you define your own content handler, I think you should be able to get 
at all of them. However, the ToTextContentHandler only does the textual 
content, which is why your code wasn't working - you need to use a 
different handler for your case

Nick

Re: Extracting Text from XML

Posted by b m <bm...@gmail.com>.

I'm a little suprised that Tika doesn't have something that reads extracts
element attribute values.  The only thing I've found that does something
close to what I want is by extending the XMLParser's getcontentHandler
method:


public class RdfParser extends XMLParser
{
  private static String RDF_NS = "
http://www.w3.org/1999/02/22-rdf-syntax-ns#";

  private static ContentHandler getRdfHandler(Metadata metadata, String
name,
      String localName)
  {
    return new AttributeMetadataHandler(RDF_NS, localName, metadata, name);
  }

  @Override
  protected ContentHandler getContentHandler(ContentHandler handler,
      Metadata metadata, ParseContext context)
  {
    return new TeeContentHandler(super.getContentHandler(handler, metadata,
        context), getRdfHandler(metadata, "about", "about"), getRdfHandler(
        metadata, "resource", "resource"));
  }
}


But that doesn't even give me what I want, as it sticks everything in the
metaData's "resource" property.  I'd like a way to extract the resource for
the <Title> element, as well as the resource attribute value for the
<originalurl> element as well.


=============================output================================

handler:





metadata: resource=http://news.google.com/, Google News, Thu, 17 Nov 2011
09:12:39 -0500, index.html, UTF-8 about=urn:root
Content-Type=application/xml
======================================================================

On Mon, Nov 21, 2011 at 11:57 AM, Nick Burch <ni...@alfresco.com>wrote:

> On Sun, 20 Nov 2011, b m wrote:
>
>> <?xml version="1.0"?>
>> <RDF:RDF xmlns:MAF="http://maf.mozdev.**org/metadata/rdf#<http://maf.mozdev.org/metadata/rdf#>
>> "
>>        xmlns:NC="http://home.**netscape.com/NC-rdf#<http://home.netscape.com/NC-rdf#>
>> "
>>        xmlns:RDF="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> ">
>>  <RDF:Description RDF:about="urn:root">
>>   <MAF:originalurl RDF:resource="http://news.**google.com/<http://news.google.com/>
>> "/>
>>   <MAF:title RDF:resource="Google News"/>
>>   <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
>>   <MAF:indexfilename RDF:resource="index.html"/>
>>   <MAF:charset RDF:resource="UTF-8"/>
>>  </RDF:Description>
>> </RDF:RDF>
>>
>
> This xml doesn't have any text nodes, it only has attributes. (i.e.
> nothing like <foo>this is text</foo>)
>
>
>  Here's my source code...
>>
>> File file = new File("/tmp/test.rdf");
>> InputStream is = new FileInputStream(file);
>> Metadata metaData = new Metadata();
>> AbstractParser parser = new RdfParser();
>> DefaultHandler handler = new ToTextContentHandler();
>>
>
> This handler will only give you the contents of text nodes, but you don't
> have any!
>
> Nick
>

Re: Extracting Text from XML

Posted by Nick Burch <ni...@alfresco.com>.

On Sun, 20 Nov 2011, b m wrote:
> <?xml version="1.0"?>
> <RDF:RDF xmlns:MAF="http://maf.mozdev.org/metadata/rdf#"
>         xmlns:NC="http://home.netscape.com/NC-rdf#"
>         xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>  <RDF:Description RDF:about="urn:root">
>    <MAF:originalurl RDF:resource="http://news.google.com/"/>
>    <MAF:title RDF:resource="Google News"/>
>    <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
>    <MAF:indexfilename RDF:resource="index.html"/>
>    <MAF:charset RDF:resource="UTF-8"/>
>  </RDF:Description>
> </RDF:RDF>

This xml doesn't have any text nodes, it only has attributes. (i.e. 
nothing like <foo>this is text</foo>)

> Here's my source code...
>
> File file = new File("/tmp/test.rdf");
> InputStream is = new FileInputStream(file);
> Metadata metaData = new Metadata();
> AbstractParser parser = new RdfParser();
> DefaultHandler handler = new ToTextContentHandler();

This handler will only give you the contents of text nodes, but you don't 
have any!

Nick