You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "D. Palmatier" <dp...@gmail.com> on 2022/08/31 16:33:59 UTC

XML Record Reading for Large File with Records as 3rd Level

Hello.

I'm trying to query the records within a large, ~15GB, XML file. The format
of the file is:

<xmlfeed version="1" generated="2022-08-11 13:00:00">
    <records>
        <record>
            <field1></field1>
            <field2></field2>
        </record>
        <record>
            <field1></field1>
            <field2></field2>
        </record>
    </records>
</xmlfeed>

Unfortunately the records I want to query are at the third level and the
XMLReader expects records at the second level.

I don't have any control over the format of the source file. Is there a way
I can get to these inner records for my queries without having to load the
entire file?

Thank you for your time.
David

Re: XML Record Reading for Large File with Records as 3rd Level

Posted by Andrew McDonald <am...@ccri.com>.
Avro supplies the alias keyword.

So for the example below the following schema works for the namespace

{
   "type":"record",
   "name":"example",
   "fields": [
     {"name":"attr1","type":"int" },
     {"name":"attr2","type":"int" },
     {"name":"attr3","type":"int" },
     {
       "name":"third",
       "type": {"type":"record","name":"thirdType","fields": [
           {"name":"att1","type":"string","aliases": ["{urn:us:gov:ic:ism:v2}att1" ] },
           {"name":"att2","type":"string","aliases": ["{urn:us:gov:ic:ism:v2}att2" ] },
           {"name":"att3","type":"string","aliases": ["{urn:us:gov:ic:ism:v2}att3" ] }
         ] }
     }
   ]
}

On 9/19/22 17:07, Andrew McDonald wrote:
> Somehow the formatting got squished for my `third` level
>
> <root xmlns:ICISM="urn:us:gov:ic:ism:v2" >
>   <data attr1="val" attr2="val" attr3="val">
>       <third  ICISM:att1="cannot_get_val" ICISM:att2="cannot_get_val"  
> ICISM:att3="cannot_get_val">
>   </data>
> <root>
>
> And sorry D. Palmatier, I see you wrote 3rd level but meant 4th level 
> by the example you provided. And I don't know if 4th level is possible.
>
> Regards, Andrew
>
> On 9/19/22 16:56, Andrew McDonald wrote:
>> Yes, you can get the 3rd level fields, at least with 1.12.1 I have 
>> been able to.
>>
>> The TestXMLReader uses:
>>
>> https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/people.xml 
>>
>>
>> With the schema,
>>
>> https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/testschema 
>>
>>
>> What I've just found out, like minutes ago, and would like some help 
>> is how do deal with name spaced attributes on the 3rd level.
>>
>> For my situation
>>
>> <root xmlns:ICISM="urn:us:gov:ic:ism:v2" >
>>   <data attr1="val" attr2="val" attr3="val">
>>       <third  ICISM:att1="cannot_get_val" 
>> ICISM:att2="cannot_get_val"  ICISM:att3="cannot_get_val">
>>   </data>
>> <root>
>>
>>
>> The namespaced decorated attribute in the third tag is not being 
>> populated.  In my test xml, if I remove the namespacing from 
>> att{1,2,3) then the Json data is populated.
>>
>> I do see a people_namespace.xml that is used in the 
>> TestXMLRecordReader but that is only for tags.
>>
>> I'm hoping there is a patch I could apply to 1.12.1 b/c we are bound 
>> to this version for a while.
>>
>> Regards, Andrew
>>
>>
>> On 8/31/22 12:33, D. Palmatier wrote:
>>> Hello.
>>>
>>> I'm trying to query the records within a large, ~15GB, XML file. The 
>>> format of the file is:
>>>
>>> <xmlfeed version="1" generated="2022-08-11 13:00:00">
>>>     <records>
>>>         <record>
>>>             <field1></field1>
>>>             <field2></field2>
>>>         </record>
>>>         <record>
>>>             <field1></field1>
>>>             <field2></field2>
>>>         </record>
>>>     </records>
>>> </xmlfeed>
>>>
>>> Unfortunately the records I want to query are at the third level and 
>>> the XMLReader expects records at the second level.
>>>
>>> I don't have any control over the format of the source file. Is 
>>> there a way I can get to these inner records for my queries without 
>>> having to load the entire file?
>>>
>>> Thank you for your time.
>>> David

Re: XML Record Reading for Large File with Records as 3rd Level

Posted by Andrew McDonald <am...@ccri.com>.
Somehow the formatting got squished for my `third` level

<root xmlns:ICISM="urn:us:gov:ic:ism:v2" >
   <data attr1="val" attr2="val" attr3="val">
       <third  ICISM:att1="cannot_get_val" ICISM:att2="cannot_get_val"  
ICISM:att3="cannot_get_val">
   </data>
<root>

And sorry D. Palmatier, I see you wrote 3rd level but meant 4th level by 
the example you provided. And I don't know if 4th level is possible.

Regards, Andrew

On 9/19/22 16:56, Andrew McDonald wrote:
> Yes, you can get the 3rd level fields, at least with 1.12.1 I have 
> been able to.
>
> The TestXMLReader uses:
>
> https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/people.xml 
>
>
> With the schema,
>
> https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/testschema 
>
>
> What I've just found out, like minutes ago, and would like some help 
> is how do deal with name spaced attributes on the 3rd level.
>
> For my situation
>
> <root xmlns:ICISM="urn:us:gov:ic:ism:v2" >
>   <data attr1="val" attr2="val" attr3="val">
>       <third  ICISM:att1="cannot_get_val" ICISM:att2="cannot_get_val"  
> ICISM:att3="cannot_get_val">
>   </data>
> <root>
>
>
> The namespaced decorated attribute in the third tag is not being 
> populated.  In my test xml, if I remove the namespacing from 
> att{1,2,3) then the Json data is populated.
>
> I do see a people_namespace.xml that is used in the 
> TestXMLRecordReader but that is only for tags.
>
> I'm hoping there is a patch I could apply to 1.12.1 b/c we are bound 
> to this version for a while.
>
> Regards, Andrew
>
>
> On 8/31/22 12:33, D. Palmatier wrote:
>> Hello.
>>
>> I'm trying to query the records within a large, ~15GB, XML file. The 
>> format of the file is:
>>
>> <xmlfeed version="1" generated="2022-08-11 13:00:00">
>>     <records>
>>         <record>
>>             <field1></field1>
>>             <field2></field2>
>>         </record>
>>         <record>
>>             <field1></field1>
>>             <field2></field2>
>>         </record>
>>     </records>
>> </xmlfeed>
>>
>> Unfortunately the records I want to query are at the third level and 
>> the XMLReader expects records at the second level.
>>
>> I don't have any control over the format of the source file. Is there 
>> a way I can get to these inner records for my queries without having 
>> to load the entire file?
>>
>> Thank you for your time.
>> David

Re: XML Record Reading for Large File with Records as 3rd Level

Posted by Andrew McDonald <am...@ccri.com>.
Yes, you can get the 3rd level fields, at least with 1.12.1 I have been 
able to.

The TestXMLReader uses:

https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/people.xml

With the schema,

https://github.com/apache/nifi/blob/rel/nifi-1.12.1/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/test/resources/xml/testschema

What I've just found out, like minutes ago, and would like some help is 
how do deal with name spaced attributes on the 3rd level.

For my situation

<root xmlns:ICISM="urn:us:gov:ic:ism:v2" >
   <data attr1="val" attr2="val" attr3="val">
       
<thirdICISM:att1="cannot_get_val"ICISM:att2="cannot_get_val"ICISM:att3="cannot_get_val">
   </data>
<root>


The namespaced decorated attribute in the third tag is not being 
populated.  In my test xml, if I remove the namespacing from att{1,2,3) 
then the Json data is populated.

I do see a people_namespace.xml that is used in the TestXMLRecordReader 
but that is only for tags.

I'm hoping there is a patch I could apply to 1.12.1 b/c we are bound to 
this version for a while.

Regards, Andrew


On 8/31/22 12:33, D. Palmatier wrote:
> Hello.
>
> I'm trying to query the records within a large, ~15GB, XML file. The 
> format of the file is:
>
> <xmlfeed version="1" generated="2022-08-11 13:00:00">
>     <records>
>         <record>
>             <field1></field1>
>             <field2></field2>
>         </record>
>         <record>
>             <field1></field1>
>             <field2></field2>
>         </record>
>     </records>
> </xmlfeed>
>
> Unfortunately the records I want to query are at the third level and 
> the XMLReader expects records at the second level.
>
> I don't have any control over the format of the source file. Is there 
> a way I can get to these inner records for my queries without having 
> to load the entire file?
>
> Thank you for your time.
> David