You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Rupak Khurana <kh...@gmail.com> on 2014/02/10 20:22:50 UTC

Parse out Name:Value pairs

Hello

I have a plain text file that has several "Name : Value"  pairs that I want
to parse out. Note this is not a XML or HTML file. Hoping that the
startElement SAX event is fired whenever any "Name" element is encountered.
Is there any ContentHandler that can do this? Currently with
BodyContentHandler, I just get <body> All Name:Value pairs </body>. I am
not sure it ElementMappingContentHandler can do the trick and how to use
it? Any pointers please.

thanks
RK

Re: Parse out Name:Value pairs

Posted by Rupak Khurana <kh...@gmail.com>.
Ignore the previous request for ContentHandler examples. I looked at the
Interface spec but I guess what is more important is to know if Tika can
really parse "Name" and fire an event containing its "Value"? If the entire
text content of the input file is outputted under the <body> tag of XHTML,
then it does not serve my purpose.


On Mon, Feb 10, 2014 at 2:35 PM, Rupak Khurana <kh...@gmail.com>wrote:

> I am trying to parse out  JIL(Job Information Language) scripts that
> happen to have Name:Value pairs. Perhaps Tika is an overkill but wanted to
> use its parsing ability and SAX event firing to make life easier. Could you
> please point me to some examples of custom ContentHandler if you happen to
> know.
>
> thanks
>
>
> On Mon, Feb 10, 2014 at 2:27 PM, Ken Krugler <kk...@transpac.com>wrote:
>
>>
>> On Feb 10, 2014, at 11:22am, Rupak Khurana <kh...@gmail.com>
>> wrote:
>>
>> Hello
>>
>> I have a plain text file that has several "Name : Value"  pairs that I
>> want to parse out. Note this is not a XML or HTML file. Hoping that the
>> startElement SAX event is fired whenever any "Name" element is encountered.
>> Is there any ContentHandler that can do this? Currently with
>> BodyContentHandler, I just get <body> All Name:Value pairs </body>. I am
>> not sure it ElementMappingContentHandler can do the trick and how to use
>> it? Any pointers please.
>>
>>
>> If it's just plain text, then why do you want to deal with SAX events? Is
>> it that the file is too big?
>>
>> In any case, I imagine you could get the desired behavior by implementing
>> your own ContentHandler.
>>
>> -- Ken
>>
>>
>>    --------------------------
>>     Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>>
>

Re: Parse out Name:Value pairs

Posted by Ken Krugler <kk...@transpac.com>.
Hi Rupak,

You're parsing XML? That's an important bit of information.

In that case, you don't want to be using Tika - just use Dom4J or any one of the other many XML parsers.

Tika is designed to extract "text content" from a variety of input sources. Its parse output is designed to be XHTML 1.0-compatible, which means it's not what you want to be using for precise extraction of XML data.

-- Ken

On Feb 11, 2014, at 12:13pm, Rupak Khurana <kh...@gmail.com> wrote:

> Hello,
> 
> I have a small XML document that I want to parse using Tika and expect to get SAX events for each element in the input XML file. However I get the output only for html, head, meta, body & p.  I dont get the events for each element in the XML file. See the Code for ContentHandler further below. Please advise..
> 
> **** Output *****
> 
> StartDocument
> StartElement html
> StartElement head
> StartElement meta
> EndElement meta
> StartElement title
> EndElement title
> EndElement head
> StartElement body
> StartElement p
> EndElement p
> EndElement body
> EndElement html
> EndDocument
> 
> 
> ***** sample.xml ******
> 
> <transformation>
>   <info>
>     <name>sample_normalize</name>
>     <description/>
>     <parameters>
>        <parameter>
>             <name>AS_OF_DATE</name>
>             <default_value>2012-06-01</default_value>
>             <description/>
>         </parameter>
>     </parameters>
>   </info>
> </transformation>
> 
> 
> ***************** XYZContentHandler ****************
> 
> public class XYZContentHandler extends DefaultHandler {
> 
>     public XYZContentHandler() {
>     }
>     
>     @Override
>     public void startElement(String uri, String localName, String qName, Attributes attributes)
>              throws SAXException {        
>         System.out.println("StartElement "+qName);
>     }
>     
>     @Override
>     public void endElement(String uri, String local, String name) throws SAXException {        
>         System.out.println("EndElement "+name);
>     }
> 
>     @Override
>     public void startDocument() throws SAXException {
>         System.out.println("StartDocument");
>     }
> 
>     @Override
>     public void endDocument() throws SAXException {
>         System.out.println("EndDocument");
>     }
> }
> 
> 
> ****** Actual Code *******
> 
>            stream = new FileInputStream(new File(filename));
>            Metadata metadata = new Metadata();            
>            metadata.set(Metadata.CONTENT_TYPE, "application/xml");
> 
>             XYZContentHandler handler = new XYZContentHandler();
>             ParseContext context = new ParseContext();
> 
>             //Parser parser = new AutoDetectParser();
>             Parser parser = new XMLParser();            
>             parser.parse(stream, handler, metadata, context);
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Feb 10, 2014 at 3:30 PM, Nick Burch <ap...@gagravarr.org> wrote:
> On Mon, 10 Feb 2014, Rupak Khurana wrote:
> I am trying to parse out  JIL(Job Information Language) scripts that happen
> to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use its
> parsing ability and SAX event firing to make life easier.
> 
> Sounds like you'll want to define / identify a suitable mimetype for these, add some mime magic so they get detected, then write your own parser that spots these name/value pairs and emmits suitable sax events for you to consume
> 
> See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to do all of that
> 
> Nick
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Parse out Name:Value pairs

Posted by Rupak Khurana <kh...@gmail.com>.
Hello,

I have a small XML document that I want to parse using Tika and expect to
get SAX events for each element in the input XML file. However I get the
output only for html, head, meta, body & p.  I dont get the events for each
element in the XML file. See the Code for ContentHandler further below.
Please advise..

**** Output *****

StartDocument
StartElement html
StartElement head
StartElement meta
EndElement meta
StartElement title
EndElement title
EndElement head
StartElement body
StartElement p
EndElement p
EndElement body
EndElement html
EndDocument


***** sample.xml ******

<transformation>
  <info>
    <name>sample_normalize</name>
    <description/>
    <parameters>
       <parameter>
            <name>AS_OF_DATE</name>
            <default_value>2012-06-01</default_value>
            <description/>
        </parameter>
    </parameters>
  </info>
</transformation>


***************** XYZContentHandler ****************

public class XYZContentHandler extends DefaultHandler {

    public XYZContentHandler() {
    }

    @Override
    public void startElement(String uri, String localName, String qName,
Attributes attributes)
             throws SAXException {
        System.out.println("StartElement "+qName);
    }

    @Override
    public void endElement(String uri, String local, String name) throws
SAXException {
        System.out.println("EndElement "+name);
    }

    @Override
    public void startDocument() throws SAXException {
        System.out.println("StartDocument");
    }

    @Override
    public void endDocument() throws SAXException {
        System.out.println("EndDocument");
    }
}


****** Actual Code *******

           stream = new FileInputStream(new File(filename));
           Metadata metadata = new Metadata();
           metadata.set(Metadata.CONTENT_TYPE, "application/xml");

            XYZContentHandler handler = new XYZContentHandler();
            ParseContext context = new ParseContext();

            //Parser parser = new AutoDetectParser();
            Parser parser = new XMLParser();
            parser.parse(stream, handler, metadata, context);








On Mon, Feb 10, 2014 at 3:30 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 10 Feb 2014, Rupak Khurana wrote:
>
>> I am trying to parse out  JIL(Job Information Language) scripts that
>> happen
>> to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use
>> its
>> parsing ability and SAX event firing to make life easier.
>>
>
> Sounds like you'll want to define / identify a suitable mimetype for
> these, add some mime magic so they get detected, then write your own parser
> that spots these name/value pairs and emmits suitable sax events for you to
> consume
>
> See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to
> do all of that
>
> Nick
>

Re: Parse out Name:Value pairs

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 10 Feb 2014, Rupak Khurana wrote:
> I am trying to parse out  JIL(Job Information Language) scripts that happen
> to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use its
> parsing ability and SAX event firing to make life easier.

Sounds like you'll want to define / identify a suitable mimetype for 
these, add some mime magic so they get detected, then write your own 
parser that spots these name/value pairs and emmits suitable sax events 
for you to consume

See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to 
do all of that

Nick

Re: Parse out Name:Value pairs

Posted by Rupak Khurana <kh...@gmail.com>.
I am trying to parse out  JIL(Job Information Language) scripts that happen
to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use its
parsing ability and SAX event firing to make life easier. Could you please
point me to some examples of custom ContentHandler if you happen to know.

thanks


On Mon, Feb 10, 2014 at 2:27 PM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Feb 10, 2014, at 11:22am, Rupak Khurana <kh...@gmail.com>
> wrote:
>
> Hello
>
> I have a plain text file that has several "Name : Value"  pairs that I
> want to parse out. Note this is not a XML or HTML file. Hoping that the
> startElement SAX event is fired whenever any "Name" element is encountered.
> Is there any ContentHandler that can do this? Currently with
> BodyContentHandler, I just get <body> All Name:Value pairs </body>. I am
> not sure it ElementMappingContentHandler can do the trick and how to use
> it? Any pointers please.
>
>
> If it's just plain text, then why do you want to deal with SAX events? Is
> it that the file is too big?
>
> In any case, I imagine you could get the desired behavior by implementing
> your own ContentHandler.
>
> -- Ken
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Re: Parse out Name:Value pairs

Posted by Ken Krugler <kk...@transpac.com>.
On Feb 10, 2014, at 11:22am, Rupak Khurana <kh...@gmail.com> wrote:

> Hello
> 
> I have a plain text file that has several "Name : Value"  pairs that I want to parse out. Note this is not a XML or HTML file. Hoping that the startElement SAX event is fired whenever any "Name" element is encountered. Is there any ContentHandler that can do this? Currently with BodyContentHandler, I just get <body> All Name:Value pairs </body>. I am not sure it ElementMappingContentHandler can do the trick and how to use it? Any pointers please.

If it's just plain text, then why do you want to deal with SAX events? Is it that the file is too big?

In any case, I imagine you could get the desired behavior by implementing your own ContentHandler.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr