You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by b m <bm...@gmail.com> on 2011/11/21 03:20:51 UTC
Extracting Text from XML
I'm trying to extract text using the XMLParser from the following XML
document with no success:
<?xml version="1.0"?>
<RDF:RDF xmlns:MAF="http://maf.mozdev.org/metadata/rdf#"
xmlns:NC="http://home.netscape.com/NC-rdf#"
xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<RDF:Description RDF:about="urn:root">
<MAF:originalurl RDF:resource="http://news.google.com/"/>
<MAF:title RDF:resource="Google News"/>
<MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
<MAF:indexfilename RDF:resource="index.html"/>
<MAF:charset RDF:resource="UTF-8"/>
</RDF:Description>
</RDF:RDF>
Here's my source code...
File file = new File("/tmp/test.rdf");
InputStream is = new FileInputStream(file);
Metadata metaData = new Metadata();
AbstractParser parser = new RdfParser();
DefaultHandler handler = new ToTextContentHandler();
parser.parse(is, handler, metaData, new ParseContext());
System.out.println("handler: " + handler);
System.out.println("metadata: " + metaData);
And the result is no parsed text. What am I doing wrong?
=======================sample code output==================
handler:
metadata: Content-Type=application/xml
===========================================================
Re: Extracting Text from XML
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 21 Nov 2011, b m wrote:
> I'm a little suprised that Tika doesn't have something that reads extracts
> element attribute values.
If you define your own content handler, I think you should be able to get
at all of them. However, the ToTextContentHandler only does the textual
content, which is why your code wasn't working - you need to use a
different handler for your case
Nick
Re: Extracting Text from XML
Posted by b m <bm...@gmail.com>.
I'm a little suprised that Tika doesn't have something that reads extracts
element attribute values. The only thing I've found that does something
close to what I want is by extending the XMLParser's getcontentHandler
method:
public class RdfParser extends XMLParser
{
private static String RDF_NS = "
http://www.w3.org/1999/02/22-rdf-syntax-ns#";
private static ContentHandler getRdfHandler(Metadata metadata, String
name,
String localName)
{
return new AttributeMetadataHandler(RDF_NS, localName, metadata, name);
}
@Override
protected ContentHandler getContentHandler(ContentHandler handler,
Metadata metadata, ParseContext context)
{
return new TeeContentHandler(super.getContentHandler(handler, metadata,
context), getRdfHandler(metadata, "about", "about"), getRdfHandler(
metadata, "resource", "resource"));
}
}
But that doesn't even give me what I want, as it sticks everything in the
metaData's "resource" property. I'd like a way to extract the resource for
the <Title> element, as well as the resource attribute value for the
<originalurl> element as well.
=============================output================================
handler:
metadata: resource=http://news.google.com/, Google News, Thu, 17 Nov 2011
09:12:39 -0500, index.html, UTF-8 about=urn:root
Content-Type=application/xml
======================================================================
On Mon, Nov 21, 2011 at 11:57 AM, Nick Burch <ni...@alfresco.com>wrote:
> On Sun, 20 Nov 2011, b m wrote:
>
>> <?xml version="1.0"?>
>> <RDF:RDF xmlns:MAF="http://maf.mozdev.**org/metadata/rdf#<http://maf.mozdev.org/metadata/rdf#>
>> "
>> xmlns:NC="http://home.**netscape.com/NC-rdf#<http://home.netscape.com/NC-rdf#>
>> "
>> xmlns:RDF="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> ">
>> <RDF:Description RDF:about="urn:root">
>> <MAF:originalurl RDF:resource="http://news.**google.com/<http://news.google.com/>
>> "/>
>> <MAF:title RDF:resource="Google News"/>
>> <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
>> <MAF:indexfilename RDF:resource="index.html"/>
>> <MAF:charset RDF:resource="UTF-8"/>
>> </RDF:Description>
>> </RDF:RDF>
>>
>
> This xml doesn't have any text nodes, it only has attributes. (i.e.
> nothing like <foo>this is text</foo>)
>
>
> Here's my source code...
>>
>> File file = new File("/tmp/test.rdf");
>> InputStream is = new FileInputStream(file);
>> Metadata metaData = new Metadata();
>> AbstractParser parser = new RdfParser();
>> DefaultHandler handler = new ToTextContentHandler();
>>
>
> This handler will only give you the contents of text nodes, but you don't
> have any!
>
> Nick
>
Re: Extracting Text from XML
Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 20 Nov 2011, b m wrote:
> <?xml version="1.0"?>
> <RDF:RDF xmlns:MAF="http://maf.mozdev.org/metadata/rdf#"
> xmlns:NC="http://home.netscape.com/NC-rdf#"
> xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> <RDF:Description RDF:about="urn:root">
> <MAF:originalurl RDF:resource="http://news.google.com/"/>
> <MAF:title RDF:resource="Google News"/>
> <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
> <MAF:indexfilename RDF:resource="index.html"/>
> <MAF:charset RDF:resource="UTF-8"/>
> </RDF:Description>
> </RDF:RDF>
This xml doesn't have any text nodes, it only has attributes. (i.e.
nothing like <foo>this is text</foo>)
> Here's my source code...
>
> File file = new File("/tmp/test.rdf");
> InputStream is = new FileInputStream(file);
> Metadata metaData = new Metadata();
> AbstractParser parser = new RdfParser();
> DefaultHandler handler = new ToTextContentHandler();
This handler will only give you the contents of text nodes, but you don't
have any!
Nick