You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Sven Davison <sv...@gmail.com> on 2017/04/11 16:38:13 UTC

parsing html

I'm looking to parse some HTML. It's not the cleanest but i know that my
content is always on line 10 of the file. I could use splittext then
compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
looking for something w/ less overhead. Especially knowing the content is
always on line 10.

Anyone have other/cleaner ideas on how to get the content of line 10?

Re: parsing html

Posted by Jeremy Dyer <jd...@gmail.com>.

No problem Sven. Just curious which version do you have? If I recall
correctly i believe it was in as early a version as 0.5.1

On Thu, Apr 13, 2017 at 10:32 AM, Sven Davison <sv...@gmail.com>
wrote:

> thanks for the ideas guys! For reasons beyond my control, I can't update
> to the newest nifi to get the GetHTML processor @ this time. Maybe some
> day. I'll look into the ExecuteScript or and SplitText more.
>
> On Tue, Apr 11, 2017 at 1:14 PM, Jeremy Dyer <jd...@gmail.com> wrote:
>
>> Sven,
>>
>> There is also the GetHTML processor I added awhile back. If the input is
>> valid HTML you should always be able to use a CSS selector to extract that
>> HTML value. If you can provide a sample of the HTML I would be glad to make
>> a flow for you doing so as an example
>>
>> Jeremy
>>
>> Sent from my iPhone
>>
>> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
>>
>> Sven,
>>
>> Currently I would recommend using ExecuteScript and simply streaming &
>> slicing the content bytes at line 10 (a one-line operation in Groovy, I
>> believe the same in Ruby and Python).
>>
>> This isn’t the first time I’ve heard of a similar request though, so I
>> think if you were to open a Jira requesting a “GetLine(s)” or “SliceText”
>> processor, it could be valuable to the community. The current component
>> solution would probably involve SplitText/SplitContent and as you said,
>> decent overhead, especially if the desired content is early in the
>> flowfile.
>>
>> Andy LoPresto
>> alopresto@apache.org
>> *alopresto.apache@gmail.com <al...@gmail.com>*
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>
>> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>>
>> I'm looking to parse some HTML. It's not the cleanest but i know that my
>> content is always on line 10 of the file. I could use splittext then
>> compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
>> looking for something w/ less overhead. Especially knowing the content is
>> always on line 10.
>>
>> Anyone have other/cleaner ideas on how to get the content of line 10?
>>
>>
>>
>

Re: parsing html

Posted by Sven Davison <sv...@gmail.com>.

thanks for the ideas guys! For reasons beyond my control, I can't update to
the newest nifi to get the GetHTML processor @ this time. Maybe some day.
I'll look into the ExecuteScript or and SplitText more.

On Tue, Apr 11, 2017 at 1:14 PM, Jeremy Dyer <jd...@gmail.com> wrote:

> Sven,
>
> There is also the GetHTML processor I added awhile back. If the input is
> valid HTML you should always be able to use a CSS selector to extract that
> HTML value. If you can provide a sample of the HTML I would be glad to make
> a flow for you doing so as an example
>
> Jeremy
>
> Sent from my iPhone
>
> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
>
> Sven,
>
> Currently I would recommend using ExecuteScript and simply streaming &
> slicing the content bytes at line 10 (a one-line operation in Groovy, I
> believe the same in Ruby and Python).
>
> This isn’t the first time I’ve heard of a similar request though, so I
> think if you were to open a Jira requesting a “GetLine(s)” or “SliceText”
> processor, it could be valuable to the community. The current component
> solution would probably involve SplitText/SplitContent and as you said,
> decent overhead, especially if the desired content is early in the
> flowfile.
>
> Andy LoPresto
> alopresto@apache.org
> *alopresto.apache@gmail.com <al...@gmail.com>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>
> I'm looking to parse some HTML. It's not the cleanest but i know that my
> content is always on line 10 of the file. I could use splittext then
> compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
> looking for something w/ less overhead. Especially knowing the content is
> always on line 10.
>
> Anyone have other/cleaner ideas on how to get the content of line 10?
>
>
>

Re: parsing html

Posted by Jeremy Dyer <jd...@gmail.com>.

Sven,

There is also the GetHTML processor I added awhile back. If the input is valid HTML you should always be able to use a CSS selector to extract that HTML value. If you can provide a sample of the HTML I would be glad to make a flow for you doing so as an example

Jeremy

Sent from my iPhone

> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
> 
> Sven,
> 
> Currently I would recommend using ExecuteScript and simply streaming & slicing the content bytes at line 10 (a one-line operation in Groovy, I believe the same in Ruby and Python). 
> 
> This isn’t the first time I’ve heard of a similar request though, so I think if you were to open a Jira requesting a “GetLine(s)” or “SliceText” processor, it could be valuable to the community. The current component solution would probably involve SplitText/SplitContent and as you said, decent overhead, especially if the desired content is early in the flowfile. 
> 
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>> 
>> I'm looking to parse some HTML. It's not the cleanest but i know that my content is always on line 10 of the file. I could use splittext then compare it to ensure it starts with XYZBeginningString, i supose.. but i'm looking for something w/ less overhead. Especially knowing the content is always on line 10.
>> 
>> Anyone have other/cleaner ideas on how to get the content of line 10?
>

Re: parsing html

Posted by Andy LoPresto <al...@apache.org>.

Sven,

Currently I would recommend using ExecuteScript and simply streaming & slicing the content bytes at line 10 (a one-line operation in Groovy, I believe the same in Ruby and Python).

This isn’t the first time I’ve heard of a similar request though, so I think if you were to open a Jira requesting a “GetLine(s)” or “SliceText” processor, it could be valuable to the community. The current component solution would probably involve SplitText/SplitContent and as you said, decent overhead, especially if the desired content is early in the flowfile.

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
> 
> I'm looking to parse some HTML. It's not the cleanest but i know that my content is always on line 10 of the file. I could use splittext then compare it to ensure it starts with XYZBeginningString, i supose.. but i'm looking for something w/ less overhead. Especially knowing the content is always on line 10.
> 
> Anyone have other/cleaner ideas on how to get the content of line 10?