You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Adam Estrada <es...@gmail.com> on 2015/04/30 20:20:56 UTC

Maintain character encoding in workflow

All,

I am coming across an issue where my unicode characters are being converted
to their unicode point representations (as javascript escapes) like this
"\u0432\u0430\u0436\u043d\u0435\u0435". This is happening with Twitter data
that is collected using the Twitter processor. How can I debug my workflow
to figure out where the characters are being converted?

Thanks,
Adam

Re: Maintain character encoding in workflow

Posted by Joe Witt <jo...@gmail.com>.
Adam,

If you believe that a process in the flow is manipulating the
characters you can use the built in provenance, archive, and data
viewer functions.  We need to document how to set this stuff up.  But
for now if you configure the nifi.properties as follows and restart
you'll have the good stuff.  This is all assuming you're on the latest
develop branch codebase:

Set the following properties to the following values (these are just examples):

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=10 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.pub1=/your/path/for/content
nifi.content.repository.archive.max.retention.period=3 hours
nifi.content.repository.archive.max.usage.percentage=30%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=/nifi-content-viewer/

nifi.provenance.repository.directory.prov1=/your/path/for/prov
nifi.provenance.repository.max.storage.time=24 hours
nifi.provenance.repository.max.storage.size=1 GB
nifi.provenance.repository.rollover.time=30 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=6
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.journal.count=16
# Comma-separated list of fields. Fields that are not indexed will not
be searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
AlternateIdentifierURI, ContentType, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
Filename, ProcessorID
# FlowFile Attributes that should be indexed and made searchable
nifi.provenance.repository.indexed.attributes=
# Large values for the shard size will result in more Java heap usage
when searching the Provenance Repository
# but should provide better performance
nifi.provenance.repository.index.shard.size=500 MB

Basically the things different from default here would be:
nifi.content.viewer.url=/nifi-content-viewer/
nifi.content.repository.archive.max.retention.period=3 hours
nifi.content.repository.archive.max.usage.percentage=30%
nifi.content.repository.archive.enabled=true


Anyway what this does is tells nifi to hang onto the content until it
has to actually delete it from disk.  It then allows you to look at
the provenance trail of any object and then you can 'view content' in
our built-in content viewer.  You can use that to step by step review
the content as it goes through the flow.

We must make a nice blog out of this with screenshots.  It is a really
powerful feature.

If that doesn't get you the info you need please let us know.

Thanks
Joe

On Thu, Apr 30, 2015 at 2:20 PM, Adam Estrada <es...@gmail.com> wrote:
> All,
>
> I am coming across an issue where my unicode characters are being converted
> to their unicode point representations (as javascript escapes) like this
> "\u0432\u0430\u0436\u043d\u0435\u0435". This is happening with Twitter data
> that is collected using the Twitter processor. How can I debug my workflow
> to figure out where the characters are being converted?
>
> Thanks,
> Adam

Re: Maintain character encoding in workflow

Posted by Mark Payne <ma...@hotmail.com>.
Adam,

Joe got his answer out before I this, I realize :) I'll try to go into a 
bit more detail on some of things here, in case it's helpful.


The easiest thing to do would be to make the following changes in 
nifi.properties:

nifi.provenance.repository.rollover.time=30 secs
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, 
Filename, ProcessorID
nifi.content.viewer.url=/nifi-content-viewer/

nifi.content.repository.archive.max.retention.period=24 hours
nifi.content.repository.archive.max.usage.percentage=80%


The first property says to index all provenance events after 30 seconds 
instead of waiting 5 mins (the default).
Second property says to index those specific fields for all provenance 
events.
Third property enables the Provenance Data content viewer.
The other 2 properties indicate that the content should be kept on the 
box for up to 24 hours, but to delete content if the disk is 80% full.

After changing those, you'd need to restart your system.


So I'm suggesting that you do that so that you can make use of NiFi's 
data provenance to debug workflows. It's a super powerful feature.

Then, you can click on the Data Provenance icon in the UI (4th icon in 
the toolbar in the very right-hand side). Then click "Search". You can 
search by filename or whatever. If you just want to find data coming 
from the twitter processor, you can enter that for the "Component ID" 
(to get the id of that processor, right-click on it and choose 
configure. it's in the Settings tab.)

Then when you search you can see up to 1000 results. Click the little 
icon on the right-hand side that looks a bit like a propeller (it's 
actually intended to show a graph/tree). From there you can see what 
happened to the data as it went through your flow. For any of those 
events, you can right-click and "View Details". This will show you all 
sorts of info about the event. In the Content tab, you can click "View" 
to see what the content looked like at that point in time. You can then 
go back to the lineage view and look at the next or previous event and 
do the same thing until you know exactly where it changed.

Hope this helps!

Let us know if you have any further questions.

Thanks
-Mark

------ Original Message ------
From: "Adam Estrada" <es...@gmail.com>
To: dev@nifi.incubator.apache.org
Sent: 4/30/2015 2:20:56 PM
Subject: Maintain character encoding in workflow

>All,
>
>I am coming across an issue where my unicode characters are being 
>converted
>to their unicode point representations (as javascript escapes) like 
>this
>"\u0432\u0430\u0436\u043d\u0435\u0435". This is happening with Twitter 
>data
>that is collected using the Twitter processor. How can I debug my 
>workflow
>to figure out where the characters are being converted?
>
>Thanks,
>Adam