You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Sven Davison <sv...@gmail.com> on 2016/05/31 22:32:55 UTC

RegEx not catching all tags

http://prntscr.com/basrzy

the above is a screenshot showing a hashtags var only containing the first
instance of a hashtag. i want to get a list of ALL hashtags from
twitter.text not just the first one. i'm fairly sure my RegEx is wrong...
here's what i have.

(#{1}[a-zA-Z0-9_]*)

i'm using https://regex101.com/ to simulate traffic and tests.. but i can't
get it to recognize more than the first instance of the regex.

Re: RegEx not catching all tags

Posted by Andy LoPresto <al...@apache.org>.
Thanks Sven. Could I ask you to open a Jira [1] requesting a boolean option in the ExtractText processor properties that allows for global results?

[1] https://issues.apache.org/jira/browse/NIFI

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Jun 1, 2016, at 3:23 AM, Sven Davison <sv...@gmail.com> wrote:
> 
> Thanks. I did some more reading in the documentation and Nifi's documentation says it only returns the first one. HOWEVER... The Jain object returned had an element of tags already!
> 
> $.entities.hashtags.*.text or... Something. I got it working late last night!
> 
> 
> 
> -Sven Davison
> (sent from my iPhone)
> 
> On May 31, 2016, at 10:47 PM, Andy LoPresto <alopresto@apache.org <ma...@apache.org>> wrote:
> 
>> Hi Sven,
>> 
>> Are you using an ExtractText processor [1] here? If so, you can extract multiple capture groups which will be stored in flowfile attributes such as “regexattr.1”, “regexattr.2”, etc. when assigned to the regular expression name “regexattr”.
>> 
>> Try the regular expression I’ve provided here [2] (explanation available on the site). This captures a literal ‘#’, any “word” character one or more times until a word boundary, and does this “globally”, aka does not stop searching after the first result. I didn’t check exhaustively if hashtags can contain special characters like ‘-‘, etc. but that should be well-documented by Twitter.
>> 
>> /(#[\w]+\b)/g
>> 
>> [1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html>
>> [2] https://regex101.com/r/gV3mO5/1 <https://regex101.com/r/gV3mO5/1>
>> 
>> 
>> Andy LoPresto
>> alopresto@apache.org <ma...@apache.org>
>> alopresto.apache@gmail.com <ma...@gmail.com>
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>>> On May 31, 2016, at 3:32 PM, Sven Davison <svendavison@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> http://prntscr.com/basrzy <http://prntscr.com/basrzy>
>>> 
>>> the above is a screenshot showing a hashtags var only containing the first instance of a hashtag. i want to get a list of ALL hashtags from twitter.text not just the first one. i'm fairly sure my RegEx is wrong... here's what i have.
>>> 
>>> (#{1}[a-zA-Z0-9_]*)
>>> 
>>> i'm using https://regex101.com/ <https://regex101.com/> to simulate traffic and tests.. but i can't get it to recognize more than the first instance of the regex.
>> 


Re: RegEx not catching all tags

Posted by Sven Davison <sv...@gmail.com>.
Thanks. I did some more reading in the documentation and Nifi's documentation says it only returns the first one. HOWEVER... The Jain object returned had an element of tags already!

$.entities.hashtags.*.text or... Something. I got it working late last night!



-Sven Davison 
(sent from my iPhone)

> On May 31, 2016, at 10:47 PM, Andy LoPresto <al...@apache.org> wrote:
> 
> Hi Sven,
> 
> Are you using an ExtractText processor [1] here? If so, you can extract multiple capture groups which will be stored in flowfile attributes such as “regexattr.1”, “regexattr.2”, etc. when assigned to the regular expression name “regexattr”. 
> 
> Try the regular expression I’ve provided here [2] (explanation available on the site). This captures a literal ‘#’, any “word” character one or more times until a word boundary, and does this “globally”, aka does not stop searching after the first result. I didn’t check exhaustively if hashtags can contain special characters like ‘-‘, etc. but that should be well-documented by Twitter. 
> 
> /(#[\w]+\b)/g
> 
> [1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
> [2] https://regex101.com/r/gV3mO5/1
> 
>  
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On May 31, 2016, at 3:32 PM, Sven Davison <sv...@gmail.com> wrote:
>> 
>> 
>> http://prntscr.com/basrzy
>> 
>> the above is a screenshot showing a hashtags var only containing the first instance of a hashtag. i want to get a list of ALL hashtags from twitter.text not just the first one. i'm fairly sure my RegEx is wrong... here's what i have. 
>> 
>> (#{1}[a-zA-Z0-9_]*)
>> 
>> i'm using https://regex101.com/ to simulate traffic and tests.. but i can't get it to recognize more than the first instance of the regex.
> 

Re: RegEx not catching all tags

Posted by Andy LoPresto <al...@apache.org>.
Hi Sven,

Are you using an ExtractText processor [1] here? If so, you can extract multiple capture groups which will be stored in flowfile attributes such as “regexattr.1”, “regexattr.2”, etc. when assigned to the regular expression name “regexattr”.

Try the regular expression I’ve provided here [2] (explanation available on the site). This captures a literal ‘#’, any “word” character one or more times until a word boundary, and does this “globally”, aka does not stop searching after the first result. I didn’t check exhaustively if hashtags can contain special characters like ‘-‘, etc. but that should be well-documented by Twitter.

/(#[\w]+\b)/g

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
[2] https://regex101.com/r/gV3mO5/1


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On May 31, 2016, at 3:32 PM, Sven Davison <sv...@gmail.com> wrote:
> 
> 
> http://prntscr.com/basrzy <http://prntscr.com/basrzy>
> 
> the above is a screenshot showing a hashtags var only containing the first instance of a hashtag. i want to get a list of ALL hashtags from twitter.text not just the first one. i'm fairly sure my RegEx is wrong... here's what i have.
> 
> (#{1}[a-zA-Z0-9_]*)
> 
> i'm using https://regex101.com/ <https://regex101.com/> to simulate traffic and tests.. but i can't get it to recognize more than the first instance of the regex.


Re: Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Thanks Brian,


Using many sites that provide online conversion between the two format such as this:

http://www.utilities-online.info/xmltojson/#.V03_F2grIuU

yield the correct result.


{
  "record": {
    "property1": {
      "#text": [
        "Lake",
        "River",
        "National_State Park"
      ]
    },
    "property2": {
      "#text": [
        "A:Value1",
        "B:Value2",
        "C:Value3",
        "D:Value4",
        "E:Value5"
      ]
    }
  }
}


I am wondering how much more is missing from the stylesheet to enable that, I guess I have to do some reading now to familiarize myself with that.


Thanks,
Keith

________________________________
From: Keith Lim <Ke...@ds-iq.com>
Sent: Tuesday, May 31, 2016 4:07 PM
To: users@nifi.apache.org
Subject: Which processor to use to cleanly convert xml to json?


Which processor should I use to cleanly convert from xml to json?

This article illustrates using TransformXML with a stylesheet to convert xml to json.

https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html

However, I am seeing that it does not convert values with special tag such the embedded <br> tag as below:


XML fragment:

<record>

    <property1>Lake<BR/>River<BR/>National_State Park</property1>
    <property2>A:Value1<BR/>B:Value2<BR/>C:Value3<BR/>D:Value4<BR/>E:Value5</property2>

</record>


Converted Json:


{ "record" : {

        "property1" : { "BR" :["",""] },
        "property2" : { "BR" :["","","",""] }

}}

I may need to readup on stylesheet however, this is just the problem I am seeing, and don't know what other issue may crop up using this script.

Thanks,
Keith

Re: Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Thanks for pointing out.   The assumptions stated work for my data, luckily as long as the data is captured, I don't need to differentiate if they are attribute or element and ordering is not required.  Those ignored are non pertinent data in my context.


Thanks,
Keith

________________________________
From: Thad Guidry <th...@gmail.com>
Sent: Wednesday, June 01, 2016 12:14:32 PM
To: users@nifi.apache.org
Subject: Re: Which processor to use to cleanly convert xml to json?

Keith,

Hopefully you are aware of some of the pitfalls that you might run into with that approach.  But it might be good enough for your particular use case :)

From org.json.XML

Convert a well-formed (but not necessarily valid) XML string into a JSONObject. Some information may be lost in this transformation because JSON is a data format and XML is a document format. XML uses elements, attributes, and content text, while JSON uses unordered collections of name/value pairs and arrays of values. JSON does not does not like to distinguish between elements and attributes. Sequences of similar elements are represented as JSONArrays. Content text may be placed in a "content" member. Comments, prologs, DTDs, and <[ [ ]]> are ignored.

Thad
+ThadGuidry<https://www.google.com/+ThadGuidry>


Re: Which processor to use to cleanly convert xml to json?

Posted by Thad Guidry <th...@gmail.com>.
Keith,

Hopefully you are aware of some of the pitfalls that you might run into
with that approach.  But it might be good enough for your particular use
case :)

From org.json.XML

Convert a well-formed (but not necessarily valid) XML string into a
JSONObject. Some information may be lost in this transformation because
JSON is a data format and XML is a document format. XML uses elements,
attributes, and content text, while JSON uses unordered collections of
name/value pairs and arrays of values. JSON does not does not like to
distinguish between elements and attributes. Sequences of similar elements
are represented as JSONArrays. Content text may be placed in a "content"
member. Comments, prologs, DTDs, and <[ [ ]]> are ignored.

Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

Re: Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Rather than parsing the structure myself, I have decided to go with the XML library that converts to JSON for me.


import org.json.JSONObject
import org.json.XML


def xml = '<record><property1>Lake<BR/>River<BR/>National_State_Park</property1><property2>A:Value1<BR/>B:Value2<BR/>C:Value3<BR/>D:Value4<BR/>E:Value5</property2></record>'

def textIndent = 2
def xmlJSONObj = XML.toJSONObject(xml)
xmlJSONObj.toString(textIndent)


Thanks,
Keith


________________________________
From: Keith Lim
Sent: Wednesday, June 01, 2016 9:44:11 AM
To: users@nifi.apache.org
Subject: Re: Which processor to use to cleanly convert xml to json?


Thanks Bryan and Thad for the quick response,


I like these more established libraries.  I will go with the Groovy example.


Thanks,
Keith

________________________________

From: Thad Guidry <th...@gmail.com>
Sent: Wednesday, June 01, 2016 9:35 AM
To: users@nifi.apache.org
Subject: Re: Which processor to use to cleanly convert xml to json?

You can use the ExecuteScript processor with Groovy to easily slurp XML and then build the Json.

http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder
[http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a]<http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder>

XML to JSON with Groovy XmlSlurper and JsonBuilder - Stack ...<http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder>
stackoverflow.com
I am trying to take an XML file and convert it into a JSON document using Groovy, specifically with XmlSlurper and JsonBuilder. I can do this fairly easily if I hard ...




http://funnifi.blogspot.com/2016/02/executescript-explained-split-fields.html

Thad
+ThadGuidry<https://www.google.com/+ThadGuidry>


Re: Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Thanks Bryan and Thad for the quick response,


I like these more established libraries.  I will go with the Groovy example.


Thanks,
Keith

________________________________

From: Thad Guidry <th...@gmail.com>
Sent: Wednesday, June 01, 2016 9:35 AM
To: users@nifi.apache.org
Subject: Re: Which processor to use to cleanly convert xml to json?

You can use the ExecuteScript processor with Groovy to easily slurp XML and then build the Json.

http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder
[http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a]<http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder>

XML to JSON with Groovy XmlSlurper and JsonBuilder - Stack ...<http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder>
stackoverflow.com
I am trying to take an XML file and convert it into a JSON document using Groovy, specifically with XmlSlurper and JsonBuilder. I can do this fairly easily if I hard ...




http://funnifi.blogspot.com/2016/02/executescript-explained-split-fields.html

Thad
+ThadGuidry<https://www.google.com/+ThadGuidry>


Re: Which processor to use to cleanly convert xml to json?

Posted by Thad Guidry <th...@gmail.com>.
You can use the ExecuteScript processor with Groovy to easily slurp XML and
then build the Json.

http://stackoverflow.com/questions/23374652/xml-to-json-with-groovy-xmlslurper-and-jsonbuilder

http://funnifi.blogspot.com/2016/02/executescript-explained-split-fields.html

Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

Re: Which processor to use to cleanly convert xml to json?

Posted by Bryan Bende <bb...@gmail.com>.
Hi Keith,

There is currently no built in processor that directly transforms XML to
JSON.

TransformXML leverages XSLT to transform and XML document into some other
format.
In that post, the XSLT happens to transform into JSON, but it looks like
maybe it only handles top-level elements and not nesting.

I would say your options would be to modify that stylesheet to support
nested elements, or if you have a specific well-defined XML format you
could write a custom processor that is specific to your format.
For a custom processor your possibly generate JAXB objects from your XML
schema, unmarshall the XML into those objects, then remarshall them as JSON.

Others may have additional suggestions of something that could be done
through ExecuteScript.

-Bryan


On Wed, Jun 1, 2016 at 12:10 PM, Keith Lim <Ke...@ds-iq.com> wrote:

> Any help guidance much appreciated.
>
> Thanks,
> Keith
> ------------------------------
> From: Keith Lim <Ke...@ds-iq.com>
> Sent: ‎5/‎31/‎2016 4:07 PM
> To: users@nifi.apache.org
> Subject: Which processor to use to cleanly convert xml to json?
>
> Which processor should I use to cleanly convert from xml to json?
>
> This article illustrates using TransformXML with a stylesheet to convert
> xml to json.
>
>
> https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html
>
> However, I am seeing that it does not convert values with special tag such
> the embedded <br> tag as below:
>
>
> XML fragment:
>
> <record>
>
>     <property1>Lake<BR/>River<BR/>National_State Park</property1>
>
> <property2>A:Value1<BR/>B:Value2<BR/>C:Value3<BR/>D:Value4<BR/>E:Value5</property2>
>
> </record>
>
>
> Converted Json:
>
>
> { "record" : {
>
>         "property1" : { "BR" :["",""] },
>         "property2" : { "BR" :["","","",""] }
>
> }}
>
> I may need to readup on stylesheet however, this is just the problem I am
> seeing, and don't know what other issue may crop up using this script.
>
> Thanks,
> Keith
>
>

RE: Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Any help guidance much appreciated.

Thanks,
Keith
________________________________
From: Keith Lim<ma...@ds-iq.com>
Sent: ‎5/‎31/‎2016 4:07 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Which processor to use to cleanly convert xml to json?


Which processor should I use to cleanly convert from xml to json?

This article illustrates using TransformXML with a stylesheet to convert xml to json.

https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html

However, I am seeing that it does not convert values with special tag such the embedded <br> tag as below:


XML fragment:

<record>

    <property1>Lake<BR/>River<BR/>National_State Park</property1>
    <property2>A:Value1<BR/>B:Value2<BR/>C:Value3<BR/>D:Value4<BR/>E:Value5</property2>

</record>


Converted Json:


{ "record" : {

        "property1" : { "BR" :["",""] },
        "property2" : { "BR" :["","","",""] }

}}

I may need to readup on stylesheet however, this is just the problem I am seeing, and don't know what other issue may crop up using this script.

Thanks,
Keith

Which processor to use to cleanly convert xml to json?

Posted by Keith Lim <Ke...@ds-iq.com>.
Which processor should I use to cleanly convert from xml to json?

This article illustrates using TransformXML with a stylesheet to convert xml to json.

https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html

However, I am seeing that it does not convert values with special tag such the embedded <br> tag as below:


XML fragment:

<record>

    <property1>Lake<BR/>River<BR/>National_State Park</property1>
    <property2>A:Value1<BR/>B:Value2<BR/>C:Value3<BR/>D:Value4<BR/>E:Value5</property2>

</record>


Converted Json:


{ "record" : {

        "property1" : { "BR" :["",""] },
        "property2" : { "BR" :["","","",""] }

}}

I may need to readup on stylesheet however, this is just the problem I am seeing, and don't know what other issue may crop up using this script.

Thanks,
Keith