You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by joemeszaros <gi...@git.apache.org> on 2015/09/23 14:35:33 UTC

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

GitHub user joemeszaros opened a pull request:

    https://github.com/apache/nifi/pull/92

    NIFI-988: PutDistributedMapCache processor

    There is a standard controller service, called DistributedMapCacheServer, which provides a distributed cache, and an associated DistributedMapCacheClientService to interact with the cache. But there is not any standard processor, which puts data into the cache, and helps the user to leverage the distributed cache capabilities.
    
    The purpose of PutDistributedMapCache is very similar to the egress processors: it gets the content of a FlowFile and puts it to a distributed map cache, using a cache key computed from FlowFile attributes. If the cache already contains the entry and the cache update strategy is 'keep original' the entry is not replaced.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ImpressTV/nifi NIFI-988

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #92
    
----
commit 6b1328f3f181a27a5856d26983ed3329ee317522
Author: Joe <jo...@impresstv.com>
Date:   2015-09-23T11:16:02Z

    NIFI-988: PutDisributedMapCache processor implementation

commit ee7d89cb01d4661cfff2c4f0d093e38758680a56
Author: Joe <jo...@impresstv.com>
Date:   2015-09-23T12:32:37Z

    NIFI-988: Test cases for PutDistributedMapCache

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142654373
  
    I have several tracking event files, containing user interactions, e.g. user.x liked item.y in the following format:
    
    |UserId  | Action | ItemId |
    | ------------- | ------------- | ------------- |
    | user.x | like  | item.y |
    | user.xx | like  | item.z |
    |...||
    
    I need to enrich these event files e.g. with the title of the associated item from a separate item file, containing the item metadata:
    
    |ItemId  | Title |
    | ------------- | ------------- |
    | item.y | Title for item.y  |
    | item.z | Title for item.z  |
    |...||
    
    and the enriched event file should like this:
    
    |UserId  | Action | ItemId | Title
    | ------------- | ------------- | ------------- | ------------- |
    | user.x | like  | item.y | Title for item.y|
    | user.xx | like  | item.z | Title for item.z|
    
    My idea was to cache the item file in a distributed cache, because it is a typical controller service functionality, and use the same cache to extend the event files one-by-one, when looking for a title, based on the ItemId. In that case I need to read the item file only once. I created a workflow, which grabs the item file, creates a flow file for each item (each line), where the ItemId is added as a custom flow file attribute and puts those flow files into the distributed cache, using the PutDistributedMapCache processor. The cache key is the custom ItemId attribute, and the metadata is the cache value. During the event file enrichment I use this item catalogue cache to look for an ItemId and get e.g. the title. 
    
    (My workflow is not so simple, because I use JSON conversion, and additional processors as well)
    
    The DetectDuplicate was not an appropriate processor for me, because (as it names suggests) it is used for duplicate detection and caches a custom flow file attribute, not the flow file content.
    
    I hope I was able to highlight my rationality behind this new processor  :-)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-150028829
  
    Thanks for merging!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by markap14 <gi...@git.apache.org>.
Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-144783590
  
    Sorry, you are right - I posted to the wrong PR. Will re-post this comment on the other one for clarity. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/nifi/pull/92


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by markap14 <gi...@git.apache.org>.
Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-149919219
  
    @joemeszaros I very much appreciate the contribution back to the NiFi community. Very sorry it took so long. I have now merged it into master.
    
    Thanks again!
    -Mark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-149458834
  
    @markap14 Do you have any concern with this new processor? It is still open for a while and it would be grateful, if we could finish it by merging or closing the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by markap14 <gi...@git.apache.org>.
Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-144378199
  
    @joemeszaros the concern that I have with the notion of the ExtendedDistributedMapCacheClient is that once that is released, it will have the same caveats as the DistributedMapCacheClient - others can extend it, so we cannot change the interface in a non-backward-compatible way once we release it.
    
    I think if there are specific methods that we think will be added, then we need to implement those before we release the service. Otherwise, in order to add new methods to the ExtendedDistributedMapCacheClient we would need to create yet another interface that extends that one. This is all done because we consider the Standard Services API to be "public interfaces" so that once they are out there, we have no idea who has implemented them. As a result, if we change them, even if we update all of the code in Apache NiFi, we may be breaking someone else's "private" implementation.
    
    In theory, though, this will become a lot less painful once we move to 1.0.0 because I believe the intent is to move to Java 8, which means that we can include default implementations in the interfaces. As a result, we could potentially add new methods to interfaces as long as they can be implemented using existing methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by markap14 <gi...@git.apache.org>.
Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142647457
  
    @joemeszaros Can you explain the use case here a little bit more? The DistributedMapCache services were originally developed in order to be used in the DetectDuplicate processor. It has since found a couple of other uses. I'm trying to envision how the PutDistributedMapCache processors might be used.
    
    Thanks
    -Mark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-144332289
  
    @markap14 Did you get a proper answer for your cache related questions? If you do not have any concern with this new processor, it would be reasonable to implement the PutDistributedSetCache processor. It should be very similar to this new processor, but I do not want to start the implementation, until this PR is closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by markap14 <gi...@git.apache.org>.
Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142689586
  
    OK so to make sure that I am understanding the use case. The idea is just to use this processor to put data into this cache, so that other custom processors can access the data in the cache using the distributed cache controller service, correct?
    
    The only concern that I might have here is the amount of Java heap that would be used up by this cache, holding the contents of FlowFiles in the cache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-143226518
  
    I started to implement nifi-tools, including a very simple map cache client. You can find the project on my github page [here](https://github.com/joemeszaros/nifi-tools)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142863157
  
    Yes, you are correct. 
    
    The PutDisributeMapCache processor relies on the existing DistributedMapCacheServer, which internally use org.apache.nifi.distributed.cache.server.map.SimpleMapCache class. If you are intrested in the implementation details, please take a look at this class. It stores cache entries in a HashMap<ByteBuffer, MapCacheRecord>. 
    
    You can control the size of the cache with two factors:
    - The max cache size option in the cache server, which controls the maximum number of cache entries that the cache can hold (default value is 10000)
    - The max cache entry size in the PutDistributedMapCache processor, controlling the maximum amount of data to put into cache (default value is 1 MB)
    
    Hope it clears your question.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI-988: PutDistributedMapCache processor

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-144703463
  
    I think your answer is related to an other pull request #94 (NIFI-989). Let me share my opinion there, not in this PutDistributedMapCache PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---