You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by Russell Jurney <ru...@gmail.com> on 2014/08/05 18:14:20 UTC
Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/
-----------------------------------------------------------
(Updated Aug. 5, 2014, 4:14 p.m.)
Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.
Repository: datafu
Description
-------
See DATAFU-65
Diffs
-----
datafu-pig/build.gradle e21a5b1
datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION
datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION
gradle/dependency-versions.gradle eb24e4a
Diff: https://reviews.apache.org/r/24309/diff/
Testing
-------
Thanks,
Russell Jurney
Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF
Posted by Russell Jurney <ru...@gmail.com>.
> On Aug. 5, 2014, 9:14 p.m., Matthew Hayes wrote:
> > How do you expect this to be used in practice? Would one large dictionary be applied to a large collection of strings to identify the matches within each string? Or, do you expect a different dictionary to be applied to each string? If you expect the same dictionary to be used, then it seems we miss out on the potential with this implementation to build the trie once and reuse it over and over. Should the dictionary instead be loaded from HDFS via the distributed cache and lazy loaded on the first call to exec()? This way you only build the trie once.
You make a good point. The way I plan to use this is to group a relation of match words ALL, then to CROSS it with my text to be matched against, and the same words will be matched against a large number of strings. Compared to your suggestion, my plan is dumb. I think I will do what you suggest.
- Russell
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/#review49639
-----------------------------------------------------------
On Aug. 5, 2014, 4:14 p.m., Russell Jurney wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24309/
> -----------------------------------------------------------
>
> (Updated Aug. 5, 2014, 4:14 p.m.)
>
>
> Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.
>
>
> Repository: datafu
>
>
> Description
> -------
>
> See DATAFU-65
>
>
> Diffs
> -----
>
> datafu-pig/build.gradle e21a5b1
> datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION
> datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION
> gradle/dependency-versions.gradle eb24e4a
>
> Diff: https://reviews.apache.org/r/24309/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Russell Jurney
>
>
Re: Review Request 24309: Review of DATAFU-65 - Add AhoCorasickMatch UDF
Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24309/#review49639
-----------------------------------------------------------
How do you expect this to be used in practice? Would one large dictionary be applied to a large collection of strings to identify the matches within each string? Or, do you expect a different dictionary to be applied to each string? If you expect the same dictionary to be used, then it seems we miss out on the potential with this implementation to build the trie once and reuse it over and over. Should the dictionary instead be loaded from HDFS via the distributed cache and lazy loaded on the first call to exec()? This way you only build the trie once.
- Matthew Hayes
On Aug. 5, 2014, 4:14 p.m., Russell Jurney wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24309/
> -----------------------------------------------------------
>
> (Updated Aug. 5, 2014, 4:14 p.m.)
>
>
> Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.
>
>
> Repository: datafu
>
>
> Description
> -------
>
> See DATAFU-65
>
>
> Diffs
> -----
>
> datafu-pig/build.gradle e21a5b1
> datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION
> datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java PRE-CREATION
> gradle/dependency-versions.gradle eb24e4a
>
> Diff: https://reviews.apache.org/r/24309/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Russell Jurney
>
>