You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Sven Davison <sv...@gmail.com> on 2017/04/11 16:38:13 UTC
parsing html
I'm looking to parse some HTML. It's not the cleanest but i know that my
content is always on line 10 of the file. I could use splittext then
compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
looking for something w/ less overhead. Especially knowing the content is
always on line 10.
Anyone have other/cleaner ideas on how to get the content of line 10?
Re: parsing html
Posted by Jeremy Dyer <jd...@gmail.com>.
No problem Sven. Just curious which version do you have? If I recall
correctly i believe it was in as early a version as 0.5.1
On Thu, Apr 13, 2017 at 10:32 AM, Sven Davison <sv...@gmail.com>
wrote:
> thanks for the ideas guys! For reasons beyond my control, I can't update
> to the newest nifi to get the GetHTML processor @ this time. Maybe some
> day. I'll look into the ExecuteScript or and SplitText more.
>
> On Tue, Apr 11, 2017 at 1:14 PM, Jeremy Dyer <jd...@gmail.com> wrote:
>
>> Sven,
>>
>> There is also the GetHTML processor I added awhile back. If the input is
>> valid HTML you should always be able to use a CSS selector to extract that
>> HTML value. If you can provide a sample of the HTML I would be glad to make
>> a flow for you doing so as an example
>>
>> Jeremy
>>
>> Sent from my iPhone
>>
>> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
>>
>> Sven,
>>
>> Currently I would recommend using ExecuteScript and simply streaming &
>> slicing the content bytes at line 10 (a one-line operation in Groovy, I
>> believe the same in Ruby and Python).
>>
>> This isn’t the first time I’ve heard of a similar request though, so I
>> think if you were to open a Jira requesting a “GetLine(s)” or “SliceText”
>> processor, it could be valuable to the community. The current component
>> solution would probably involve SplitText/SplitContent and as you said,
>> decent overhead, especially if the desired content is early in the
>> flowfile.
>>
>> Andy LoPresto
>> alopresto@apache.org
>> *alopresto.apache@gmail.com <al...@gmail.com>*
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
>>
>> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>>
>> I'm looking to parse some HTML. It's not the cleanest but i know that my
>> content is always on line 10 of the file. I could use splittext then
>> compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
>> looking for something w/ less overhead. Especially knowing the content is
>> always on line 10.
>>
>> Anyone have other/cleaner ideas on how to get the content of line 10?
>>
>>
>>
>
Re: parsing html
Posted by Sven Davison <sv...@gmail.com>.
thanks for the ideas guys! For reasons beyond my control, I can't update to
the newest nifi to get the GetHTML processor @ this time. Maybe some day.
I'll look into the ExecuteScript or and SplitText more.
On Tue, Apr 11, 2017 at 1:14 PM, Jeremy Dyer <jd...@gmail.com> wrote:
> Sven,
>
> There is also the GetHTML processor I added awhile back. If the input is
> valid HTML you should always be able to use a CSS selector to extract that
> HTML value. If you can provide a sample of the HTML I would be glad to make
> a flow for you doing so as an example
>
> Jeremy
>
> Sent from my iPhone
>
> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
>
> Sven,
>
> Currently I would recommend using ExecuteScript and simply streaming &
> slicing the content bytes at line 10 (a one-line operation in Groovy, I
> believe the same in Ruby and Python).
>
> This isn’t the first time I’ve heard of a similar request though, so I
> think if you were to open a Jira requesting a “GetLine(s)” or “SliceText”
> processor, it could be valuable to the community. The current component
> solution would probably involve SplitText/SplitContent and as you said,
> decent overhead, especially if the desired content is early in the
> flowfile.
>
> Andy LoPresto
> alopresto@apache.org
> *alopresto.apache@gmail.com <al...@gmail.com>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
>
> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>
> I'm looking to parse some HTML. It's not the cleanest but i know that my
> content is always on line 10 of the file. I could use splittext then
> compare it to ensure it starts with XYZBeginningString, i supose.. but i'm
> looking for something w/ less overhead. Especially knowing the content is
> always on line 10.
>
> Anyone have other/cleaner ideas on how to get the content of line 10?
>
>
>
Re: parsing html
Posted by Jeremy Dyer <jd...@gmail.com>.
Sven,
There is also the GetHTML processor I added awhile back. If the input is valid HTML you should always be able to use a CSS selector to extract that HTML value. If you can provide a sample of the HTML I would be glad to make a flow for you doing so as an example
Jeremy
Sent from my iPhone
> On Apr 11, 2017, at 1:01 PM, Andy LoPresto <al...@apache.org> wrote:
>
> Sven,
>
> Currently I would recommend using ExecuteScript and simply streaming & slicing the content bytes at line 10 (a one-line operation in Groovy, I believe the same in Ruby and Python).
>
> This isn’t the first time I’ve heard of a similar request though, so I think if you were to open a Jira requesting a “GetLine(s)” or “SliceText” processor, it could be valuable to the community. The current component solution would probably involve SplitText/SplitContent and as you said, decent overhead, especially if the desired content is early in the flowfile.
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
>
>> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>>
>> I'm looking to parse some HTML. It's not the cleanest but i know that my content is always on line 10 of the file. I could use splittext then compare it to ensure it starts with XYZBeginningString, i supose.. but i'm looking for something w/ less overhead. Especially knowing the content is always on line 10.
>>
>> Anyone have other/cleaner ideas on how to get the content of line 10?
>
Re: parsing html
Posted by Andy LoPresto <al...@apache.org>.
Sven,
Currently I would recommend using ExecuteScript and simply streaming & slicing the content bytes at line 10 (a one-line operation in Groovy, I believe the same in Ruby and Python).
This isn’t the first time I’ve heard of a similar request though, so I think if you were to open a Jira requesting a “GetLine(s)” or “SliceText” processor, it could be valuable to the community. The current component solution would probably involve SplitText/SplitContent and as you said, decent overhead, especially if the desired content is early in the flowfile.
Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> On Apr 11, 2017, at 9:38 AM, Sven Davison <sv...@gmail.com> wrote:
>
> I'm looking to parse some HTML. It's not the cleanest but i know that my content is always on line 10 of the file. I could use splittext then compare it to ensure it starts with XYZBeginningString, i supose.. but i'm looking for something w/ less overhead. Especially knowing the content is always on line 10.
>
> Anyone have other/cleaner ideas on how to get the content of line 10?