You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by Russell Jurney <ru...@gmail.com> on 2014/08/05 18:14:20 UTC

Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/
-----------------------------------------------------------

(Updated Aug. 5, 2014, 4:14 p.m.)


Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.


Repository: datafu


Description
-------

See DATAFU-65


Diffs
-----

  datafu-pig/build.gradle e21a5b1 
  datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION 
  gradle/dependency-versions.gradle eb24e4a 

Diff: https://reviews.apache.org/r/24309/diff/


Testing
-------


Thanks,

Russell Jurney


Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF

Posted by Russell Jurney <ru...@gmail.com>.

> On Aug. 5, 2014, 9:14 p.m., Matthew Hayes wrote:
> > How do you expect this to be used in practice?  Would one large dictionary be applied to a large collection of strings to identify the matches within each string?  Or, do you expect a different dictionary to be applied to each string?  If you expect the same dictionary to be used, then it seems we miss out on the potential with this implementation to build the trie once and reuse it over and over.  Should the dictionary instead be loaded from HDFS via the distributed cache and lazy loaded on the first call to exec()?  This way you only build the trie once.

You make a good point. The way I plan to use this is to group a relation of match words ALL, then to CROSS it with my text to be matched against, and the same words will be matched against a large number of strings. Compared to your suggestion, my plan is dumb. I think I will do what you suggest.


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/#review49639
-----------------------------------------------------------


On Aug. 5, 2014, 4:14 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24309/
> -----------------------------------------------------------
> 
> (Updated Aug. 5, 2014, 4:14 p.m.)
> 
> 
> Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> See DATAFU-65
> 
> 
> Diffs
> -----
> 
>   datafu-pig/build.gradle e21a5b1 
>   datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION 
>   gradle/dependency-versions.gradle eb24e4a 
> 
> Diff: https://reviews.apache.org/r/24309/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/#review49639
-----------------------------------------------------------


How do you expect this to be used in practice?  Would one large dictionary be applied to a large collection of strings to identify the matches within each string?  Or, do you expect a different dictionary to be applied to each string?  If you expect the same dictionary to be used, then it seems we miss out on the potential with this implementation to build the trie once and reuse it over and over.  Should the dictionary instead be loaded from HDFS via the distributed cache and lazy loaded on the first call to exec()?  This way you only build the trie once.

- Matthew Hayes


On Aug. 5, 2014, 4:14 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24309/
> -----------------------------------------------------------
> 
> (Updated Aug. 5, 2014, 4:14 p.m.)
> 
> 
> Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> See DATAFU-65
> 
> 
> Diffs
> -----
> 
>   datafu-pig/build.gradle e21a5b1 
>   datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION 
>   gradle/dependency-versions.gradle eb24e4a 
> 
> Diff: https://reviews.apache.org/r/24309/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>