You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Chris Mattmann <ch...@gmail.com> on 2015/04/22 04:48:24 UTC

Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Luke <ha...@gmail.com>
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann
<ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<to...@di.uniroma1.it>, <de...@tika.apache.org>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, "'Zimdars,
Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar
CyberInfrastructure DR Students <ns...@googlegroups.com>,
<me...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>Hi professor,
>
>
>I think it highly depends on the content being read by tika, e.g. if
>there is a sequence of bytes in the file that is being read and is the
>same as one or more of mime types being defined in our tika-mimes.xml, I
>guess that tika will put those types in its estimation list, please note
>there could be multiple estimated mime types by magic-byte detection
>approach. Now tika also considers the decision made by extension
>detection approach, if extension says the file type it believes is the
>first one in the magic type estimation list, then certainly the first one
>will be returned. (the same applies to metadata hint approach);
>Of course, tika also prefers the type that is the most specialized.
>
>let's get back to the following question, here is my guess though.
>[Prof]: Also what happens if you tweak the definition of XHTML to not
>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>Let's consider an extreme case where we only scan 10 or 1 bytes, then it
>seems that magic bytes will inevitable detect nothing, and I think it
>will return the something like" application/oct-stream" that is the most
>general type. As mentioned, tika favours the one that is the most
>specialized, if extension approach returns the one that is more
>specialized, in this extreme case I believe almost every type is a
>subclass of this "application/oct-stream".... therefore the answer in
>this extreme may be yes, I think it is very possible that CBOR type
>detected by the extension approach takes over in this case...
>
>My idea was and still is that if the cbor self-Describing tag 55799 is
>present in the cbor file, then that can be used to detect the cbor type.
>Again, the cbor type will probably be appended into the magic estimation
>list together with another one such as application/html, I guess the
>order in the list probably also matters, the first one is preferred over
>the next one. Also the decision from the extension detection approach
>also play the role the break the tie.
>e.g. if extension detection method agrees on cbor with one of the
>estimated type in the magic list, then cbor will be returned. (again,
>same thing applies to metadatahint method).
>
>I have not taken a closer look at a cbor file that has the tag 55799, but
>I expect to see its hex is something like 0xd9d9f7 or the tag should be
>present in the header with a fixed sequence of
>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
>present in the file or preferable in the header (within a reasonable
>range of bytes ), I believe it can probably be used as the magic numbers
>for the cbor type.
>
>
>There is another thing I have mentioned in the jira ticket I opened
>yesterday against the cbor parser and detection, it is also possible that
>cbor content can be imbedded inside a plain json file, the way that a
>decoder can distinguish them in that file is by looking at the tag 55799
>again. This may rarely happen but a robust parser might be able to take
>care of that, tika might need to consider the use of fastXML being used
>by the nutch tool when developing the cbor parser...
>Again let me cite the same paragraph from the rfc,
>
>" a decoder might be able to parse both CBOR and JSON.
>   Such a decoder would need to mechanically distinguish the two
>   formats.  An easy way for an encoder to help the decoder would be to
>   tag the entire CBOR item with tag 55799, the serialization of which
>   will never be found at the beginning of a JSON text."
>
>
>Thanks
>Luke
>
>
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Tuesday, April 21, 2015 9:49 PM
>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
>'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Hi Luke,
>
>Can you post the below conversation to dev@tika and summarize it there.
>Also what happens if you tweak the definition of XHTML to not scan until
>8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department University of
>Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Luke <ha...@gmail.com>
>Date: Wednesday, April 22, 2015 at 12:19 AM
>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U
>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
><Ch...@jpl.nasa.gov>
>Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, "Zimdars,
>Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, NSF Polar
>CyberInfrastructure DR Students <ns...@googlegroups.com>,
>"memex-jpl@googlegroups.com" <me...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi Professor,
>>Please see attached jpg for the difference.
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>Sent: Tuesday, April 21, 2015 5:27 PM
>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>memex-jpl@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hey Luke what happens if you do java -jar /path/to/tika-app -m
>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m <
>>/path/to/cbor/file.cbor any difference?
>>
>>------------------------
>>Chris Mattmann
>>chris.mattmann@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <ha...@gmail.com>
>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>To: 'Luke' <ha...@gmail.com>, Chris Mattmann
>><ch...@gmail.com>, 'Giuseppe Totaro' <to...@di.uniroma1.it>,
>>Chris Mattmann <Ch...@jpl.nasa.gov>
>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF
>>Polar CyberInfrastructure DR Students
>><ns...@googlegroups.com>,
>><me...@googlegroups.com>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi professor,
>>>I just sent a pull request for adding cbor extension.
>>>The interesting thing is that tika is still identifying the file
>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when I
>>>manually change the file extension to the correct one (i.e. *.cbor ).
>>>
>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>by searching for the "&lt;html" in the file content, PFA:
>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by
>>>xhtml+nutch,
>>>you see that we do have that element as part of the cbor content
>>>because the entire crawled xhtml document seems to be imbedded in the
>>>cbor json(PFA:
>>>cbor.jpg); and also in Tika, the magic detection seems to have higher
>>>priority over the glob detection, thus the type is being incorrectly
>>>detected.
>>>
>>>Therefore, I would like to please mention that adding the entry of
>>><glob pattern="*.cbor"/> is not resolving the issue as of now without
>>>some fixed magic bytes / patterns for cbor.
>>>I also would like to add that the thing will be different with our
>>>probabilistic mime detection selector, because if we know that the
>>>file extension is more reliable than magic bytes, then we can
>>>certainly add more preferential weight to the extension... this also
>>>might show the current implementation with MimeTypes detection is a
>>>bit stiff or less flexible in this scneario. :)
>>>
>>>
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Luke [mailto:hanson311biz@gmail.com]
>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>'memex-jpl@googlegroups.com'
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>Yes, let me add the cbor extension entry in tika xml, will send the
>>>pull request soon.
>>>
>>>Thanks
>>>Luke
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>>>memex-jpl@googlegroups.com
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER and
>>>tag along with adding an -extension command would be fantastic. Can
>>>you file both of those NUTCH issues, wait a day or so, and then based
>>>on feedback use your new Nutch commit karma to get those into Nutch?
>>>
>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>At that point, when those two to be defined NUTCH issues are up, Luke,
>>>in parallel can you throw up a pull request/patch in Tika for the
>>>extension along with the MIME detection?
>>>
>>>Cheers,
>>>Chris
>>>
>>>------------------------
>>>Chris Mattmann
>>>chris.mattmann@gmail.com
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>Cc: Luke <ha...@gmail.com>, Chris Mattmann
>>><ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>Students <ns...@googlegroups.com>,
>>>"memex-jpl@googlegroups.com"
>>><me...@googlegroups.com>
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>>Thanks Luke. Great work.
>>>>Chris, we wrap a single string value, representing the JSON text, for
>>>>each file into CBOR (by using serializeCBORData method). For
>>>>instance, using the Unix hex dump tool, we can see that, as expected,
>>>>the first byte of all files is "0x7F" (the first three bits are
>>>>"011", that is the major type for strings, and the following 5 bits
>>>>are "11010", meaning a uint32_t encodes the length of following
>>>>text), and the following 4 bytes (single-precision float) encodes the
>>>>right length of file (as described in RFC7049
>>>><http://tools.ietf.org/html/rfc7049>).
>>>>Therefore, a CBOR tag is currently included into the file (a list of
>>>>cbor tags is available here
>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>I did not know about CBOR "magic header". Thanks a lot Luke for this
>>>>great research. Chris, if you agree, I can add support for prepending
>>>>self-describing CBOR tag 55799 to CommonCrawldataDumper class. I
>>>>believe it is very easy because I have to enable the
>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code is
>>>>available here 
>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/src
>>>>/
>>>>m ain
>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>Then, I can comment the TIKA-1610
>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>
>>>>Regarding the file extension, in the Memex CCA format the original
>>>>file extension is used. We could add support for a -extension
>>>>command-line option allowing the user to give a file extension (e.g.,
>>>>cbor) for all files dumped out.
>>>>
>>>>Thanks a lot,
>>>>Giuseppe
>>>>
>>>>
>>>>
>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980)
>>>><ch...@jpl.nasa.gov> wrote:
>>>>
>>>>Thanks for this great research, Luke!
>>>>
>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Chris Mattmann, Ph.D.
>>>>Chief Architect
>>>>Instrument Software and Science Data Systems Section (398) NASA Jet
>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>Office: 168-519, Mailstop: 168-527
>>>>Email: chris.a.mattmann@nasa.gov
>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Adjunct Associate Professor, Computer Science Department University
>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Luke <ha...@gmail.com>
>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U
>>>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
>>>><Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>Students <ns...@googlegroups.com>,
>>>>"memex-jpl@googlegroups.com"
>>>><me...@googlegroups.com>
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks professor.
>>>>>Hi professor and all.
>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>
>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>
>>>>>It looks like there is a self describing tag that needs to be
>>>>>written in the cbor file thru which other applications might be able
>>>>>to identify the cbor type....
>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>
>>>>>I don’t see that tag being present in the cbor file dumped by the
>>>>>nutch tool, I am not very sure though.
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C
>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar
>>>>>CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>
>>>>>Nice one, Luke. If you have a second and you can open up an issue in
>>>>>Tika to make it support CBOR, then yes, by all means! :)
>>>>>
>>>>>
>>>>>------------------------
>>>>>Chris Mattmann
>>>>>chris.mattmann@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <ha...@gmail.com>
>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann
>>>>><ch...@gmail.com>, Chris Mattmann
>>>>><Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>><an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>Students <ns...@googlegroups.com>,
>>>>><me...@googlegroups.com>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of
>>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>
>>>>>>BTW, it looks like Tika might need to consider the support with
>>>>>>COBR parser and detection.
>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. PFA:
>>>>>>rfc_cbor.jpg
>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper
>>>>>>is not dumping the nutch segments with the .cbor extension, which
>>>>>>seems to be helpful for type detection.
>>>>>>
>>>>>>To professor Mattmann,
>>>>>>Tika does not support the detection of COBR, although the trunk
>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in
>>>>>>the tika-mimetypes.xml, those entries are not detecting properly
>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does not
>>>>>>have magic bytes, off the top of my head the only way we can detect
>>>>>>it is using the extension, and content byte histogram (please note,
>>>>>>this is a local optimal solution and data-dependent.)  J
>>>>>>
>>>>>>I think I am bit deviating from the main route and discussion of
>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime
>>>>>>detector selection” with polar data.
>>>>>>Anyway, I plan to repackage tika by incorporating the probabilistic
>>>>>>selection feature and replace the tika jar in nutch with the
>>>>>>repackaged one, and then run the CommonCrawlDataDumper and see how
>>>>>>it goes. If you have any specific ideas and thought with the
>>>>>>testing, please kindly let me know.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>To: Luke liu
>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C (398G-Affiliate);
>>>>>>Zimdars, Paul A (3980-Affiliate); Luke; NSF Polar
>>>>>>CyberInfrastructure DR Students; memex-jpl@googlegroups.com
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Luke,
>>>>>>
>>>>>>
>>>>>>my name is Giuseppe and I am a PhD student working under the
>>>>>>supervision of Prof. Chris Mattmann. I worked on
>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a couple
>>>>>>of your observations. My comments inline below.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <sh...@usc.edu>
>>>>>>ha
>>>>>>scritto:
>>>>>>
>>>>>>
>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending
>>>>>>some time in understanding the code repo i.e.
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping
>>>>>>the crawl segments to json files with the human readable and
>>>>>>understandable content.
>>>>>>1) I am trying to run one of the commands on my side as shown in
>>>>>>gen-common-crawl.sh, but the generated files all end with .html or
>>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude
>>>>>>to where the data is located on our nsfpolardata.dyndns.org
>>>>>><http://nsfpolardata.dyndns.org>
>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not
>>>>>>exactly correct (probably they need to be updated), part of the
>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped
>>>>>>files are all ending with html, but surprisingly inside those
>>>>>>outputted html files, the contents are present in json format;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>More in detail, using the -epochFilename command-line option (as in
>>>>>>gen-common-crawl.sh), the scraped data will be stored with a
>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where
>>>>>><filetype> is either the extension of the original file or .html as
>>>>>>default if the original file does not have an extension. This
>>>>>>schema is used for file naming and it does not depend on internal
>>>>>>output format (JSON).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>2) Another problem is that the root object is being set with some
>>>>>>garbled chars in each of the outputted json files (with extension
>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json
>>>>>>file has been also attached as an example too (PFA:
>>>>>>1423894754000.html); the json files cannot be parsed properly by
>>>>>>aggregate.py due to those garbled chars.
>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes
>>>>>>element which are being read by aggregate.py.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Text content and metadata extracted from the crawled binary data
>>>>>>are stored in a structured document format (JSON). Furthermore,
>>>>>>this document is encoded using CBOR <http://cbor.io/>
>>>>>>serialization. Each not human-readable character that you notice in
>>>>>>front and at the end of JSON data is due to CBOR-encoding. Thus, if
>>>>>>you need to read JSON data from document dumped out by
>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded
>>>>>>data structure inside the file.
>>>>>>
>>>>>>
>>>>>>
>>>>>>I hope this short overview can help in you work. I really
>>>>>>appreciate your feedback and, by the way, thanks a lot for your
>>>>>>great job in detection.
>>>>>>
>>>>>>I am available to provide you all support I can give, so you do not
>>>>>>hesitate to contact me if you may need any further information.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Giuseppe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Finally, after some research, I guess that the statistical
>>>>>>information (present in the readme of the code repo) is not being
>>>>>>collected and computed by aggregate.py from those output json files
>>>>>>but it looks like it is coming from the log.... see the following
>>>>>>as an example:
>>>>>>
>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper -
>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>TOTAL Stats:
>>>>>>[
>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>]
>>>>>>
>>>>>>It turns out that aggregate.py is not the one that produces the
>>>>>>statistical information, not sure what it does... but anyway, I
>>>>>>think I understand the whole idea and I do concur with it, might be
>>>>>>we can repackage the tika by incorporating the feature (i.e.
>>>>>>probabilistic mime
>>>>>>selection) in it and see if it can output the same information as
>>>>>>the one without it in the log.
>>>>>>
>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>selection:
>>>>>>in my pull request, I added a simple test case which might tell a
>>>>>>bit more about how the feature is called and used, it is simple
>>>>>>though.
>>>>>>Here is an example snippet
>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new
>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more information
>>>>>>with this can be found in
>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as Tika().detect()
>>>>>>is being called by commoncrawldump), we need to modify/add some
>>>>>>code in the TikaConfig which initializes a list of default
>>>>>>detectors, and we need to get rid of the detector - mimeTypes::
>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create
>>>>>>another pull request with this change for
>>>>>>TikaConfig)
>>>>>>
>>>>>>I think that is all of my initial thought with some finding and
>>>>>>plan; if you have anything you would like to please add and
>>>>>>comment, please do kindly let me know, then I will start working on
>>>>>>my 'finale'. BTW, don’t worry, even after I am graduated, the
>>>>>>graduation is not my termination with tika and this project, after
>>>>>>then I still can and want to help this polar project and tika as
>>>>>>much as possible, and correct the programming faults and bugs,
>>>>>>respond to the tika issues ,etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C
>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students;
>>>>>>memex-jpl@googlegroups.com
>>>>>>Subject: Re: this week action from luke
>>>>>>Importance: High
>>>>>>
>>>>>>Awesome Luke. I am going to work specifically on now benchmarking
>>>>>>your code in real situations. For example, it would be fantastic to
>>>>>>now run your Bayesian MIME detector over the whole NSF TREC Dynamic
>>>>>>Domain data for Polar described here:
>>>>>>
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and
>>>>>>Annie can explain it, also CC’ed.
>>>>>>
>>>>>>Can we make that your goal for the next 2 weeks to actually test it
>>>>>>and produce a real result over the whole TREC-DD data for Polar? My
>>>>>>goal will be to get your code committed and integrated into Tika.
>>>>>>The more you can write me a guide of how to build and test your
>>>>>>code with Tika so I can get it committed the better.
>>>>>>
>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is
>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s
>>>>>>existing MIME detection approach. If folks have any Memex needs to
>>>>>>try and test more accurate file identification with Tika, Luke is
>>>>>>the guy to talk to and I have him for 2 more weeks.
>>>>>>
>>>>>>Thanks!
>>>>>>
>>>>>>Cheers,
>>>>>>Chris
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>chris.mattmann@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke liu <sh...@usc.edu>
>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
>>>>>><Ch...@jpl.nasa.gov>
>>>>>>Cc: 'Luke' <ha...@gmail.com>
>>>>>>Subject: this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Professor Mattmann,
>>>>>>
>>>>>>I think I am in the final phase of the research, and last week I
>>>>>>finished the last item in the list, and hopefully everything will
>>>>>>be fine.
>>>>>>
>>>>>>For now, i probably can spend some time in verifying or optimizing
>>>>>>the codes, the majority of the research has been done…and it will
>>>>>>be also great if you can please comment on my work (the 2 pull
>>>>>>requests) when you have time.
>>>>>>
>>>>>>If you do have confusion with any of my work, please also do let me
>>>>>>know.
>>>>>>
>>>>>>Thanks and I am glad working with you, for the next a couple of
>>>>>>weeks before graduation, I am going to continue revising and
>>>>>>testing the code and features to get rid of some flaws (if any
>>>>>>)when I have time.
>>>>>>
>>>>>>Not sure if I miss out something, and if I do miss some thing
>>>>>>important, please do let me know too.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>--
>>>>>>You received this message because you are subscribed to the Google
>>>>>>Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>To unsubscribe from this group and stop receiving emails from it,
>>>>>>send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>><ma...@googlegroups.com>.
>>>>>>To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>To view this discussion on the web visit
>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b3510070
>>>>>>%
>>>>>>2
>>>>>>41
>>>>>>9f3
>>>>>>0150%24%40edu.
>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>><garbled.jpg><1423894754000.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>

RE: [memex-jpl] this week action from luke

Posted by Luke <ha...@gmail.com>.

Hi Prof,
I am actually working on that, it actually is taking a bit time (around 2 or
3 hours) to run the whole script gen-common-crawl.sh.
A couple of suspicious error also caused me to run and rerun the script a
couple of times .... I need to be careful with testing with that size of
data.

I will keep you updated on the findings and progress.

Thanks
Luke

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-jpl@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <ha...@gmail.com> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
> (3980-Affiliate)'; dev@tika.apache.org
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
> memex-jpl@googlegroups.com
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <ha...@gmail.com>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
<Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <to...@di.uniroma1.it>, <de...@tika.apache.org>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
> NSF Polar CyberInfrastructure DR Students 
> <ns...@googlegroups.com>,
> <me...@googlegroups.com>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; dev@tika.apache.org
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> memex-jpl@googlegroups.com
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <ha...@gmail.com>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann 
>> <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>> NSF Polar CyberInfrastructure DR Students 
>> <ns...@googlegroups.com>,
>> <me...@googlegroups.com>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is 
>>> the same as one or more of mime types being defined in our 
>>> tika-mimes.xml, I guess that tika will put those types in its 
>>> estimation list, please note there could be multiple estimated mime 
>>> types by magic-byte detection approach. Now tika also considers the 
>>> decision made by extension detection approach, if extension says the 
>>> file type it believes is the first one in the magic type estimation 
>>> list, then certainly the first one will be returned. (the same 
>>> applies to metadata hint approach); Of course, tika also prefers the 
>>> type that is the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to 
>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, 
>>> then it seems that magic bytes will inevitable detect nothing, and I 
>>> think it will return the something like" application/oct-stream" 
>>> that is the most general type. As mentioned, tika favours the one 
>>> that is the most specialized, if extension approach returns the one 
>>> that is more specialized, in this extreme case I believe almost 
>>> every type is a subclass of this "application/oct-stream".... 
>>> therefore the answer in this extreme may be yes, I think it is very 
>>> possible that CBOR type detected by the extension approach takes over in
this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 
>>> is present in the cbor file, then that can be used to detect the cbor
type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, 
>>> I guess the order in the list probably also matters, the first one 
>>> is preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. 
>>> (again, same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 
>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the 
>>> tag should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this 
>>> is present in the file or preferable in the header (within a 
>>> reasonable range of bytes ), I believe it can probably be used as 
>>> the magic numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at 
>>> the tag 55799 again. This may rarely happen but a robust parser 
>>> might be able to take care of that, tika might need to consider the 
>>> use of fastXML being used by the nutch tool when developing the cbor
parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two  
>>> formats.  An easy way for an encoder to help the decoder would be to  
>>> tag the entire CBOR item with tag 55799, the serialization of which  
>>> will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) 
>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; 
>>> memex-jpl@googlegroups.com
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University 
>>> of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <ha...@gmail.com>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>> <Ch...@jpl.nasa.gov>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, 
>>> NSF Polar CyberInfrastructure DR Students 
>>> <ns...@googlegroups.com>,
>>> "memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> memex-jpl@googlegroups.com
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app 
>>>> -m < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> chris.mattmann@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <ha...@gmail.com>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>>>> <ch...@gmail.com>, 'Giuseppe Totaro'
>>>> <to...@di.uniroma1.it>, Chris Mattmann 
>>>> <Ch...@jpl.nasa.gov>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <ns...@googlegroups.com>,
>>>> <me...@googlegroups.com>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even 
>>>>> when I manually change the file extension to the correct one (i.e.
*.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying
"application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this 
>>>>> also might show the current implementation with MimeTypes 
>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:hanson311biz@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> 'memex-jpl@googlegroups.com'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send 
>>>>> the pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> memex-jpl@googlegroups.com
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and 
>>>>> then based on feedback use your new Nutch commit karma to get 
>>>>> those into Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610
issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika 
>>>>> for the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> chris.mattmann@gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>> Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>>>> <ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>> Students <ns...@googlegroups.com>,
>>>>> "memex-jpl@googlegroups.com"
>>>>> <me...@googlegroups.com>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the 
>>>>>> length of following text), and the following 4 bytes 
>>>>>> (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to 
>>>>>> CommonCrawldataDumper class. I believe it is very easy because I 
>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator 
>>>>>> class (the source code is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>> /s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610 
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the 
>>>>>> original file extension is used. We could add support for a 
>>>>>> -extension command-line option allowing the user to give a file 
>>>>>> extension (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <ch...@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA 
>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department 
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <ha...@gmail.com>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe 
>>>>>> U (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>> <Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <ns...@googlegroups.com>,
>>>>>> "memex-jpl@googlegroups.com"
>>>>>> <me...@googlegroups.com>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by 
>>>>>>> the nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
>>>>>>> Polar CyberInfrastructure DR Students'; 
>>>>>>> memex-jpl@googlegroups.com
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an 
>>>>>>> issue in Tika to make it support CBOR, then yes, by all means! 
>>>>>>> :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> chris.mattmann@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>>> <ch...@gmail.com>, Chris Mattmann 
>>>>>>> <Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>> <me...@googlegroups.com>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the 
>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with 
>>>>>>>> the .cbor extension, which seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor 
>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting 
>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also 
>>>>>>>> CBOR does not have magic bytes, off the top of my head the only 
>>>>>>>> way we can detect it is using the extension, and content byte 
>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion 
>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic 
>>>>>>>> mime detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in 
>>>>>>>> nutch with the repackaged one, and then run the 
>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any 
>>>>>>>> specific ideas and thought with the testing, please kindly let me
know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <sh...@usc.edu> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was 
>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human 
>>>>>>>> readable and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown 
>>>>>>>> in gen-common-crawl.sh, but the generated files all end with 
>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems 
>>>>>>>> to be allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are 
>>>>>>>> not exactly correct (probably they need to be updated), part of 
>>>>>>>> the patterns was able to allow me to locate some similar datasets
(e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the 
>>>>>>>> dumped files are all ending with html, but surprisingly inside 
>>>>>>>> those outputted html files, the contents are present in json 
>>>>>>>> format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original
file.
>>>>>>>> More in detail, using the -epochFilename command-line option 
>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored 
>>>>>>>> with a filename of the format 
>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>> where <filetype> is either the extension of the original file 
>>>>>>>> or .html as default if the original file does not have an
extension.
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with 
>>>>>>>> some garbled chars in each of the outputted json files (with 
>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the 
>>>>>>>> outputted json file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly 
>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not 
>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary 
>>>>>>>> data are stored in a structured document format (JSON). 
>>>>>>>> Furthermore, this document is encoded using CBOR 
>>>>>>>> <http://cbor.io/> serialization. Each not human-readable 
>>>>>>>> character that you notice in front and at the end of JSON data is
due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the 
>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>> the following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>> might be we can repackage the tika by incorporating the feature
(i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>> as the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>> simple though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>> create another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>> integrated into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> chris.mattmann@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <sh...@usc.edu>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>>> <Ch...@jpl.nasa.gov>
>>>>>>>> Cc: 'Luke' <ha...@gmail.com>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>> will be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>> my work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>> let me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>> it, send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>>> <ma...@googlegroups.com>.
>>>>>>>> To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>> 00
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
>

RE: [memex-jpl] this week action from luke

Posted by Luke <ha...@gmail.com>.

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
"stats total", please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of the
"Stats total" between the log produced by the tika that has the feature and
the one without it;
 so if the string.equals(...) satisfies, the string of the line will be
dumped out. If there is a mismatch(e.g. the count for a particular mime type
is different), an error will be dumped out. Eventually, I don't see any
error in the printout, I think the feature seem to have passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so as
to make it work in a way similar to the old strategy. On the other hands, if
I know that extension is more reliable, I can certainly add more weights to
the extension approach, in this case, the prob mime selector will return
application/cbor with a higher value of weight.

> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"


Please kindly let me know if you have any confusion with the tests;


Thanks
Luke

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-jpl@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <ha...@gmail.com> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
> (3980-Affiliate)'; dev@tika.apache.org
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
> memex-jpl@googlegroups.com
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <ha...@gmail.com>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
<Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <to...@di.uniroma1.it>, <de...@tika.apache.org>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
> NSF Polar CyberInfrastructure DR Students 
> <ns...@googlegroups.com>,
> <me...@googlegroups.com>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; dev@tika.apache.org
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> memex-jpl@googlegroups.com
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <ha...@gmail.com>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann 
>> <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>> NSF Polar CyberInfrastructure DR Students 
>> <ns...@googlegroups.com>,
>> <me...@googlegroups.com>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is 
>>> the same as one or more of mime types being defined in our 
>>> tika-mimes.xml, I guess that tika will put those types in its 
>>> estimation list, please note there could be multiple estimated mime 
>>> types by magic-byte detection approach. Now tika also considers the 
>>> decision made by extension detection approach, if extension says the 
>>> file type it believes is the first one in the magic type estimation 
>>> list, then certainly the first one will be returned. (the same 
>>> applies to metadata hint approach); Of course, tika also prefers the 
>>> type that is the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to 
>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, 
>>> then it seems that magic bytes will inevitable detect nothing, and I 
>>> think it will return the something like" application/oct-stream" 
>>> that is the most general type. As mentioned, tika favours the one 
>>> that is the most specialized, if extension approach returns the one 
>>> that is more specialized, in this extreme case I believe almost 
>>> every type is a subclass of this "application/oct-stream".... 
>>> therefore the answer in this extreme may be yes, I think it is very 
>>> possible that CBOR type detected by the extension approach takes over in
this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 
>>> is present in the cbor file, then that can be used to detect the cbor
type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, 
>>> I guess the order in the list probably also matters, the first one 
>>> is preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. 
>>> (again, same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 
>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the 
>>> tag should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this 
>>> is present in the file or preferable in the header (within a 
>>> reasonable range of bytes ), I believe it can probably be used as 
>>> the magic numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at 
>>> the tag 55799 again. This may rarely happen but a robust parser 
>>> might be able to take care of that, tika might need to consider the 
>>> use of fastXML being used by the nutch tool when developing the cbor
parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two  
>>> formats.  An easy way for an encoder to help the decoder would be to  
>>> tag the entire CBOR item with tag 55799, the serialization of which  
>>> will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) 
>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; 
>>> memex-jpl@googlegroups.com
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University 
>>> of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <ha...@gmail.com>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>> <Ch...@jpl.nasa.gov>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, 
>>> NSF Polar CyberInfrastructure DR Students 
>>> <ns...@googlegroups.com>,
>>> "memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> memex-jpl@googlegroups.com
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app 
>>>> -m < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> chris.mattmann@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <ha...@gmail.com>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>>>> <ch...@gmail.com>, 'Giuseppe Totaro'
>>>> <to...@di.uniroma1.it>, Chris Mattmann 
>>>> <Ch...@jpl.nasa.gov>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <ns...@googlegroups.com>,
>>>> <me...@googlegroups.com>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even 
>>>>> when I manually change the file extension to the correct one (i.e.
*.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying
"application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this 
>>>>> also might show the current implementation with MimeTypes 
>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:hanson311biz@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> 'memex-jpl@googlegroups.com'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send 
>>>>> the pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> memex-jpl@googlegroups.com
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and 
>>>>> then based on feedback use your new Nutch commit karma to get 
>>>>> those into Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610
issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika 
>>>>> for the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> chris.mattmann@gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>> Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>>>> <ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>> Students <ns...@googlegroups.com>,
>>>>> "memex-jpl@googlegroups.com"
>>>>> <me...@googlegroups.com>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the 
>>>>>> length of following text), and the following 4 bytes 
>>>>>> (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to 
>>>>>> CommonCrawldataDumper class. I believe it is very easy because I 
>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator 
>>>>>> class (the source code is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>> /s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610 
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the 
>>>>>> original file extension is used. We could add support for a 
>>>>>> -extension command-line option allowing the user to give a file 
>>>>>> extension (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <ch...@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA 
>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department 
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <ha...@gmail.com>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe 
>>>>>> U (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>> <Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <ns...@googlegroups.com>,
>>>>>> "memex-jpl@googlegroups.com"
>>>>>> <me...@googlegroups.com>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by 
>>>>>>> the nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
>>>>>>> Polar CyberInfrastructure DR Students'; 
>>>>>>> memex-jpl@googlegroups.com
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an 
>>>>>>> issue in Tika to make it support CBOR, then yes, by all means! 
>>>>>>> :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> chris.mattmann@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>>> <ch...@gmail.com>, Chris Mattmann 
>>>>>>> <Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>> <me...@googlegroups.com>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the 
>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with 
>>>>>>>> the .cbor extension, which seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor 
>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting 
>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also 
>>>>>>>> CBOR does not have magic bytes, off the top of my head the only 
>>>>>>>> way we can detect it is using the extension, and content byte 
>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion 
>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic 
>>>>>>>> mime detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in 
>>>>>>>> nutch with the repackaged one, and then run the 
>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any 
>>>>>>>> specific ideas and thought with the testing, please kindly let me
know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <sh...@usc.edu> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was 
>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human 
>>>>>>>> readable and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown 
>>>>>>>> in gen-common-crawl.sh, but the generated files all end with 
>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems 
>>>>>>>> to be allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are 
>>>>>>>> not exactly correct (probably they need to be updated), part of 
>>>>>>>> the patterns was able to allow me to locate some similar datasets
(e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the 
>>>>>>>> dumped files are all ending with html, but surprisingly inside 
>>>>>>>> those outputted html files, the contents are present in json 
>>>>>>>> format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original
file.
>>>>>>>> More in detail, using the -epochFilename command-line option 
>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored 
>>>>>>>> with a filename of the format 
>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>> where <filetype> is either the extension of the original file 
>>>>>>>> or .html as default if the original file does not have an
extension.
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with 
>>>>>>>> some garbled chars in each of the outputted json files (with 
>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the 
>>>>>>>> outputted json file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly 
>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not 
>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary 
>>>>>>>> data are stored in a structured document format (JSON). 
>>>>>>>> Furthermore, this document is encoded using CBOR 
>>>>>>>> <http://cbor.io/> serialization. Each not human-readable 
>>>>>>>> character that you notice in front and at the end of JSON data is
due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the 
>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>> the following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>> might be we can repackage the tika by incorporating the feature
(i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>> as the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>> simple though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>> create another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>> integrated into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> chris.mattmann@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <sh...@usc.edu>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>>> <Ch...@jpl.nasa.gov>
>>>>>>>> Cc: 'Luke' <ha...@gmail.com>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>> will be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>> my work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>> let me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>> it, send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>>> <ma...@googlegroups.com>.
>>>>>>>> To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>> 00
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
>

Re: [memex-jpl] this week action from luke

Posted by Chris Mattmann <ch...@gmail.com>.

Great work Luke and both of these changes make sense.
Please send the pull request for that thank you!

Great work Giuseppe! Go team!

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Luke <ha...@gmail.com>
Date: Thursday, April 23, 2015 at 3:08 AM
To: 'Luke' <ha...@gmail.com>, Chris Mattmann
<Ch...@jpl.nasa.gov>, Chris Mattmann
<ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<to...@di.uniroma1.it>, <de...@tika.apache.org>, "'Bryant, Ann C
(398G-Affiliate)'" <an...@gmail.com>, "'Zimdars, Paul A
(3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar
CyberInfrastructure DR Students <ns...@googlegroups.com>,
<me...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>Both patches from Guiseppe all works based on my tests;  from the tests I
>was able to see the magic tag was being appended at the beginning of the
>file, and the cbor extension was being appended too when running the Nutch
>dump tool command with the "-extension cbor" option. Thanks a lot for the
>kind help, Giuseppe, highly appreciated. I want to please give a big thumb
>up to Guiseppe's work, it is thorough and considerate too.
>
>To professor, 
>with Guiseppe's two patches, we still need to make a bit change in Tika
>mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika
>as
>it does not look very common, even if it accidentally appears in some
>other
>type of files, tika will have extension and metadatahint as a fallback
>strategy). I am going to send another pull request with that change;
>But before that, it will be great to elaborate what I am going to change
>to
>avoid any confusion.
>
>Now we have two problems.
>Problem1: Magic priority 40.
>	The application/xhtml+xml has higher priority(50) than
>application/cbor (40); [I don't know who (and why) assigned 40 to cbor];
>So
>if xhtml gets read and compared first,  cbor will not even be placed in
>the
>magic estimation list because it has low priority. Based on the tests, it
>turns out that it is true that xhtml gets read and compared first with the
>input file, so any type below the priority 50 will be disregarded.
>
>
>Problem2: again magic priority with 50.
>	In Tika, given a file dumped by the nutch dumper tool,  both types
>(xhtml and cbor) will be selected as candidate mime types and they will be
>put in the magic estimation list; since xhtml type gets read first, it is
>placed atop the cbor; in order to break that tie, tika will rely on the
>decision from the extension method. If the extension method fails to
>detect
>the type(for now, let's ignore metadata hint method for simplicity but the
>same applies to it too), then xhtml gets returned eventually.
>
>My pull request to be sent : I am going to set the magic priority of cbor
>type to 50 the same as xhtml, because it would probably be risky to
>discard
>any one of the estimated types without going consult the extension method.
>
>Any comments, suggestion, thoughts will be welcomed and appreciated.
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Luke [mailto:hanson311biz@gmail.com]
>Sent: Wednesday, April 22, 2015 7:45 PM
>To: 'Mattmann, Chris A (3980)'
>Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
>'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>'memex-jpl@googlegroups.com'
>Subject: RE: [memex-jpl] this week action from luke
>
>Hi Prof,
>
>The test was finished, the result is expected.
>Both (tika with the prob feature and the one without it) produced the same
>"stats total", please see the attached matched.txt dumped by the small
>program that verbatim checks and compares each line in every section of
>the
>"Stats total" between the log produced by the tika that has the feature
>and
>the one without it;  so if the string.equals(...) satisfies, the string of
>the line will be dumped out. If there is a mismatch(e.g. the count for a
>particular mime type is different), an error will be dumped out.
>Eventually,
>I don't see any error in the printout, I think the feature seem to have
>passed the test.
>
>
>The processing time between 2 tests is as follows.
>The following shows the start time and end time for the test where the
>Nutch
>dumper tool with the prob selection feature.
>from
>2015-04-22 15:47:08,330
>to
>2015-04-22 17:48:28,877
>
>The following shows the start time and end time for the test where the
>Nutch
>dumper tool without the tika with the feature.
>from
>2015-04-22 22:41:23,459
>to
>2015-04-23 00:11:02,767
>
>
>BTW, I forgot to mention that probabilistic mime selector with default
>weight settings also gives the following result, because by default I
>intentionally assign \ a higher weight value on the magic bytes method so
>as
>to make it work in a way similar to the old strategy. On the other hands,
>if
>I know that extension is more reliable, I can certainly add more weights
>to
>the extension approach, in this case, the prob mime selector will return
>application/cbor with a higher value of weight.
>
>> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
>> Result: "text/html"
>> 
>> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
>> Result: "application/xhtml+xml"
>
>
>Please kindly let me know if you have any confusion with the tests;
>
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Wednesday, April 22, 2015 3:49 PM
>To: Luke
>Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
>dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>memex-jpl@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Thanks Luke this is probably a good opportunity to test out your Bayesian
>mime detector how does it perform here?
>
>Sent from my iPhone
>
>> On Apr 22, 2015, at 3:29 PM, Luke <ha...@gmail.com> wrote:
>> 
>> Hi professor,
>> 
>> Please see the following results.
>> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
>> Result: "text/html"
>> 
>> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
>> Result: "application/xhtml+xml"
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Wednesday, April 22, 2015 4:21 AM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
>> (3980-Affiliate)'; dev@tika.apache.org
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>> memex-jpl@googlegroups.com
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Hi Luke,
>> 
>> Actually I just meant go into tika-mimetypes.xml and change the magic
>offsets for application/xhtml+xml and see if that works. The code you
>changed below is actually how many bytes Tika will first download to do
>MIME
>checking.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <ha...@gmail.com>
>> Date: Wednesday, April 22, 2015 at 2:25 AM
>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
><Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>,
>> NSF Polar CyberInfrastructure DR Students
>> <ns...@googlegroups.com>,
>> <me...@googlegroups.com>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> 
>>> Hi professor,
>>> 
>>> I just tried it with minLength set to 1024, I get the following
>>> "text/plain"
>>> I am a bit surprised....
>>> 
>>> BTW, the 6000 min length still give "application/xhtml+xml"; with
>>> anything below 1024 min length, I am seeing "text/plain". :)
>>> 
>>> BTW, the min length I am referring/altering is as follows
>>> MimeTypes.java
>>>    public int getMinLength() {
>>>       // This needs to be reasonably large to be able to correctly
>>> detect
>>>       // things like XML root elements after initial comment and DTDs
>>>       return 64 * 1024;
>>>   }
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> -----Original Message-----
>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>> Sent: Tuesday, April 21, 2015 7:48 PM
>>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
>>> (3980-Affiliate)'; dev@tika.apache.org
>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>> memex-jpl@googlegroups.com
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Thanks Luke.
>>> 
>>> So I guess all I was asking was could you try it out. Thanks for the
>>> lesson in the RFC.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ------------------------
>>> Chris Mattmann
>>> chris.mattmann@gmail.com
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <ha...@gmail.com>
>>> Date: Wednesday, April 22, 2015 at 1:46 AM
>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann
>>> <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>,
>>> NSF Polar CyberInfrastructure DR Students
>>> <ns...@googlegroups.com>,
>>> <me...@googlegroups.com>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi professor,
>>>> 
>>>> 
>>>> I think it highly depends on the content being read by tika, e.g. if
>>>> there is a sequence of bytes in the file that is being read and is
>>>> the same as one or more of mime types being defined in our
>>>> tika-mimes.xml, I guess that tika will put those types in its
>>>> estimation list, please note there could be multiple estimated mime
>>>> types by magic-byte detection approach. Now tika also considers the
>>>> decision made by extension detection approach, if extension says the
>>>> file type it believes is the first one in the magic type estimation
>>>> list, then certainly the first one will be returned. (the same
>>>> applies to metadata hint approach); Of course, tika also prefers the
>>>> type that is the most specialized.
>>>> 
>>>> let's get back to the following question, here is my guess though.
>>>> [Prof]: Also what happens if you tweak the definition of XHTML to
>>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
>then?
>>>> Let's consider an extreme case where we only scan 10 or 1 bytes,
>>>> then it seems that magic bytes will inevitable detect nothing, and I
>>>> think it will return the something like" application/oct-stream"
>>>> that is the most general type. As mentioned, tika favours the one
>>>> that is the most specialized, if extension approach returns the one
>>>> that is more specialized, in this extreme case I believe almost
>>>> every type is a subclass of this "application/oct-stream"....
>>>> therefore the answer in this extreme may be yes, I think it is very
>>>> possible that CBOR type detected by the extension approach takes over
>>>>in
>this case...
>>>> 
>>>> My idea was and still is that if the cbor self-Describing tag 55799
>>>> is present in the cbor file, then that can be used to detect the cbor
>type.
>>>> Again, the cbor type will probably be appended into the magic
>>>> estimation list together with another one such as application/html,
>>>> I guess the order in the list probably also matters, the first one
>>>> is preferred over the next one. Also the decision from the extension
>>>> detection approach also play the role the break the tie.
>>>> e.g. if extension detection method agrees on cbor with one of the
>>>> estimated type in the magic list, then cbor will be returned.
>>>> (again, same thing applies to metadatahint method).
>>>> 
>>>> I have not taken a closer look at a cbor file that has the tag
>>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the
>>>> tag should be present in the header with a fixed sequence of
>>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this
>>>> is present in the file or preferable in the header (within a
>>>> reasonable range of bytes ), I believe it can probably be used as
>>>> the magic numbers for the cbor type.
>>>> 
>>>> 
>>>> There is another thing I have mentioned in the jira ticket I opened
>>>> yesterday against the cbor parser and detection, it is also possible
>>>> that cbor content can be imbedded inside a plain json file, the way
>>>> that a decoder can distinguish them in that file is by looking at
>>>> the tag 55799 again. This may rarely happen but a robust parser
>>>> might be able to take care of that, tika might need to consider the
>>>> use of fastXML being used by the nutch tool when developing the cbor
>parser...
>>>> Again let me cite the same paragraph from the rfc,
>>>> 
>>>> " a decoder might be able to parse both CBOR and JSON.
>>>>  Such a decoder would need to mechanically distinguish the two
>>>> formats.  An easy way for an encoder to help the decoder would be to
>>>> tag the entire CBOR item with tag 55799, the serialization of which
>>>> will never be found at the beginning of a JSON text."
>>>> 
>>>> 
>>>> Thanks
>>>> Luke
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980)
>>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students';
>>>> memex-jpl@googlegroups.com
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hi Luke,
>>>> 
>>>> Can you post the below conversation to dev@tika and summarize it
>>>>there.
>>>> Also what happens if you tweak the definition of XHTML to not scan
>>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398) NASA Jet
>>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department University
>>>> of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <ha...@gmail.com>
>>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U
>>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
>>>> <Ch...@jpl.nasa.gov>
>>>> Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>,
>>>> "Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>,
>>>> NSF Polar CyberInfrastructure DR Students
>>>> <ns...@googlegroups.com>,
>>>> "memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi Professor,
>>>>> Please see attached jpg for the difference.
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>>> memex-jpl@googlegroups.com
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m
>>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app
>>>>> -m < /path/to/cbor/file.cbor any difference?
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> chris.mattmann@gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke <ha...@gmail.com>
>>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>>> To: 'Luke' <ha...@gmail.com>, Chris Mattmann
>>>>> <ch...@gmail.com>, 'Giuseppe Totaro'
>>>>> <to...@di.uniroma1.it>, Chris Mattmann
>>>>> <Ch...@jpl.nasa.gov>
>>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>>>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>,
>>>>> NSF Polar CyberInfrastructure DR Students
>>>>> <ns...@googlegroups.com>,
>>>>> <me...@googlegroups.com>
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Hi professor,
>>>>>> I just sent a pull request for adding cbor extension.
>>>>>> The interesting thing is that tika is still identifying the file
>>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even
>>>>>> when I manually change the file extension to the correct one (i.e.
>*.cbor ).
>>>>>> 
>>>>>> The reason is probably that tika is identifying
>"application/xhtml+xml"
>>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by
>>>>>> xhtml+nutch,
>>>>>> you see that we do have that element as part of the cbor content
>>>>>> because the entire crawled xhtml document seems to be imbedded in
>>>>>> the cbor json(PFA:
>>>>>> cbor.jpg); and also in Tika, the magic detection seems to have
>>>>>> higher priority over the glob detection, thus the type is being
>>>>>> incorrectly detected.
>>>>>> 
>>>>>> Therefore, I would like to please mention that adding the entry of
>>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now
>>>>>> without some fixed magic bytes / patterns for cbor.
>>>>>> I also would like to add that the thing will be different with our
>>>>>> probabilistic mime detection selector, because if we know that the
>>>>>> file extension is more reliable than magic bytes, then we can
>>>>>> certainly add more preferential weight to the extension... this
>>>>>> also might show the current implementation with MimeTypes
>>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>>> 
>>>>>> 
>>>>>> Thanks
>>>>>> Luke
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke [mailto:hanson311biz@gmail.com]
>>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>>>> 'memex-jpl@googlegroups.com'
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>> Yes, let me add the cbor extension entry in tika xml, will send
>>>>>> the pull request soon.
>>>>>> 
>>>>>> Thanks
>>>>>> Luke
>>>>>> -----Original Message-----
>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>>>>>> memex-jpl@googlegroups.com
>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>> 
>>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER
>>>>>> and tag along with adding an -extension command would be fantastic.
>>>>>> Can you file both of those NUTCH issues, wait a day or so, and
>>>>>> then based on feedback use your new Nutch commit karma to get
>>>>>> those into Nutch?
>>>>>> 
>>>>>> And then when creating the issues, can you link to the TIKA-1610
>issue?
>>>>>> At that point, when those two to be defined NUTCH issues are up,
>>>>>> Luke, in parallel can you throw up a pull request/patch in Tika
>>>>>> for the extension along with the MIME detection?
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> ------------------------
>>>>>> Chris Mattmann
>>>>>> chris.mattmann@gmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>>> Cc: Luke <ha...@gmail.com>, Chris Mattmann
>>>>>> <ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>> Students <ns...@googlegroups.com>,
>>>>>> "memex-jpl@googlegroups.com"
>>>>>> <me...@googlegroups.com>
>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks Luke. Great work.
>>>>>>> Chris, we wrap a single string value, representing the JSON text,
>>>>>>> for each file into CBOR (by using serializeCBORData method). For
>>>>>>> instance, using the Unix hex dump tool, we can see that, as
>>>>>>> expected, the first byte of all files is "0x7F" (the first three
>>>>>>> bits are "011", that is the major type for strings, and the
>>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the
>>>>>>> length of following text), and the following 4 bytes
>>>>>>> (single-precision
>>>>>>> float) encodes the right length of file (as described in RFC7049
>>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>>> Therefore, a CBOR tag is currently included into the file (a list
>>>>>>> of cbor tags is available here
>>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for
>>>>>>> this great research. Chris, if you agree, I can add support for
>>>>>>> prepending self-describing CBOR tag 55799 to
>>>>>>> CommonCrawldataDumper class. I believe it is very easy because I
>>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator
>>>>>>> class (the source code is available here
>>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>>> /s
>>>>>>> r
>>>>>>> c
>>>>>>> /
>>>>>>> m ain
>>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>>> Then, I can comment the TIKA-1610
>>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>>> 
>>>>>>> Regarding the file extension, in the Memex CCA format the
>>>>>>> original file extension is used. We could add support for a
>>>>>>> -extension command-line option allowing the user to give a file
>>>>>>> extension (e.g.,
>>>>>>> cbor) for all files dumped out.
>>>>>>> 
>>>>>>> Thanks a lot,
>>>>>>> Giuseppe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980)
>>>>>>> <ch...@jpl.nasa.gov> wrote:
>>>>>>> 
>>>>>>> Thanks for this great research, Luke!
>>>>>>> 
>>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398) NASA
>>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe
>>>>>>> U (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
>>>>>>> <Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>> "memex-jpl@googlegroups.com"
>>>>>>> <me...@googlegroups.com>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks professor.
>>>>>>>> Hi professor and all.
>>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>>> 
>>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>>> 
>>>>>>>> It looks like there is a self describing tag that needs to be
>>>>>>>> written in the cbor file thru which other applications might be
>>>>>>>> able to identify the cbor type....
>>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>>> 
>>>>>>>> I don't see that tag being present in the cbor file dumped by
>>>>>>>> the nutch tool, I am not very sure though.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C
>>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF
>>>>>>>> Polar CyberInfrastructure DR Students';
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> Nice one, Luke. If you have a second and you can open up an
>>>>>>>> issue in Tika to make it support CBOR, then yes, by all means!
>>>>>>>> :)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> chris.mattmann@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>>> To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann
>>>>>>>> <ch...@gmail.com>, Chris Mattmann
>>>>>>>> <Ch...@jpl.nasa.gov>, "'Bryant, Ann C
>>>>>>>>(398G-Affiliate)'"
>>>>>>>> <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>>> <me...@googlegroups.com>
>>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit
>>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper ,
>>>>>>>>>appreciated.
>>>>>>>>> 
>>>>>>>>> BTW, it looks like Tika might need to consider the support with
>>>>>>>>> COBR parser and detection.
>>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>>> PFA:
>>>>>>>>> rfc_cbor.jpg
>>>>>>>>> Actually, I don't quite understand why the
>>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with
>>>>>>>>> the .cbor extension, which seems to be helpful for type
>>>>>>>>>detection.
>>>>>>>>> 
>>>>>>>>> To professor Mattmann,
>>>>>>>>> Tika does not support the detection of COBR, although the trunk
>>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor
>>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting
>>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also
>>>>>>>>> CBOR does not have magic bytes, off the top of my head the only
>>>>>>>>> way we can detect it is using the extension, and content byte
>>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>>> data-dependent.)  J
>>>>>>>>> 
>>>>>>>>> I think I am bit deviating from the main route and discussion
>>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic
>>>>>>>>> mime detector selection" with polar data.
>>>>>>>>> Anyway, I plan to repackage tika by incorporating the
>>>>>>>>> probabilistic selection feature and replace the tika jar in
>>>>>>>>> nutch with the repackaged one, and then run the
>>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any
>>>>>>>>> specific ideas and thought with the testing, please kindly let me
>know.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>>> To: Luke liu
>>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C
>>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF
>>>>>>>>> Polar CyberInfrastructure DR Students;
>>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Luke,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> my name is Giuseppe and I am a PhD student working under the
>>>>>>>>> supervision of Prof. Chris Mattmann. I worked on
>>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a
>>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu
>>>>>>>>> <sh...@usc.edu> ha
>>>>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was
>>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>>> 
>>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is
>>>>>>>>> dumping the crawl segments to json files with the human
>>>>>>>>> readable and understandable content.
>>>>>>>>> 1) I am trying to run one of the commands on my side as shown
>>>>>>>>> in gen-common-crawl.sh, but the generated files all end with
>>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems
>>>>>>>>> to be allude to where the data is located on our
>>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org>
>>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are
>>>>>>>>> not exactly correct (probably they need to be updated), part of
>>>>>>>>> the patterns was able to allow me to locate some similar datasets
>(e.g.
>>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the
>>>>>>>>> dumped files are all ending with html, but surprisingly inside
>>>>>>>>> those outputted html files, the contents are present in json
>>>>>>>>> format;
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The file extension is (almost) always the same as the original
>file.
>>>>>>>>> More in detail, using the -epochFilename command-line option
>>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored
>>>>>>>>> with a filename of the format
>>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>>> where <filetype> is either the extension of the original file
>>>>>>>>> or .html as default if the original file does not have an
>extension.
>>>>>>>>> This schema is used for file naming and it does not depend on
>>>>>>>>> internal output format (JSON).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2) Another problem is that the root object is being set with
>>>>>>>>> some garbled chars in each of the outputted json files (with
>>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the
>>>>>>>>> outputted json file has been also attached as an example too
>>>>>>>>>(PFA:
>>>>>>>>> 1423894754000.html); the json files cannot be parsed properly
>>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>>> Even if I get rid of those garbled chars, there are not
>>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Text content and metadata extracted from the crawled binary
>>>>>>>>> data are stored in a structured document format (JSON).
>>>>>>>>> Furthermore, this document is encoded using CBOR
>>>>>>>>> <http://cbor.io/> serialization. Each not human-readable
>>>>>>>>> character that you notice in front and at the end of JSON data is
>due to CBOR-encoding.
>>>>>>>>> Thus, if you need to read JSON data from document dumped out by
>>>>>>>>> CommonCrawlDataDumper, you have to deserialized the
>>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I hope this short overview can help in you work. I really
>>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your
>>>>>>>>> great job in detection.
>>>>>>>>> 
>>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>>> not hesitate to contact me if you may need any further 
>>>>>>>>>information.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Giuseppe
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>>> the following as an example:
>>>>>>>>> 
>>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>>> TOTAL Stats:
>>>>>>>>> [
>>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>>> ]
>>>>>>>>> 
>>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>>> might be we can repackage the tika by incorporating the feature
>(i.e.
>>>>>>>>> probabilistic mime
>>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>>> as the one without it in the log.
>>>>>>>>> 
>>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>>> selection:
>>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>>> simple though.
>>>>>>>>> Here is an example snippet
>>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>>> information with this can be found in
>>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>>> mimeTypes::
>>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>>> create another pull request with this change for
>>>>>>>>> TikaConfig)
>>>>>>>>> 
>>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>>> Subject: Re: this week action from luke
>>>>>>>>> Importance: High
>>>>>>>>> 
>>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>>> 
>>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>>> 
>>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>>> 
>>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>>> integrated into Tika.
>>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>>> 
>>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Chris
>>>>>>>>> 
>>>>>>>>> ------------------------
>>>>>>>>> Chris Mattmann
>>>>>>>>> chris.mattmann@gmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Luke liu <sh...@usc.edu>
>>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>>>> <Ch...@jpl.nasa.gov>
>>>>>>>>> Cc: 'Luke' <ha...@gmail.com>
>>>>>>>>> Subject: this week action from luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Professor Mattmann,
>>>>>>>>> 
>>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>>> will be fine.
>>>>>>>>> 
>>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>>> my work (the 2 pull
>>>>>>>>> requests) when you have time.
>>>>>>>>> 
>>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>>> let me know.
>>>>>>>>> 
>>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>>> )when I have time.
>>>>>>>>> 
>>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>>> important, please do let me know too.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>> it, send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>>>> <ma...@googlegroups.com>.
>>>>>>>>> To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>>> 00
>>>>>>>>> 7
>>>>>>>>> 0
>>>>>>>>> %
>>>>>>>>> 2
>>>>>>>>> 41
>>>>>>>>> 9f3
>>>>>>>>> 0150%24%40edu.
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>> <garbled.jpg><1423894754000.html>
>> 
>> 
>

RE: [memex-jpl] this week action from luke

Posted by Luke <ha...@gmail.com>.

Both patches from Guiseppe all works based on my tests;  from the tests I
was able to see the magic tag was being appended at the beginning of the
file, and the cbor extension was being appended too when running the Nutch
dump tool command with the "-extension cbor" option. Thanks a lot for the
kind help, Giuseppe, highly appreciated. I want to please give a big thumb
up to Guiseppe's work, it is thorough and considerate too. 

To professor, 
with Guiseppe's two patches, we still need to make a bit change in Tika
mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as
it does not look very common, even if it accidentally appears in some other
type of files, tika will have extension and metadatahint as a fallback
strategy). I am going to send another pull request with that change;
But before that, it will be great to elaborate what I am going to change to
avoid any confusion.

Now we have two problems.
Problem1: Magic priority 40.
	The application/xhtml+xml has higher priority(50) than
application/cbor (40); [I don't know who (and why) assigned 40 to cbor];  So
if xhtml gets read and compared first,  cbor will not even be placed in the
magic estimation list because it has low priority. Based on the tests, it
turns out that it is true that xhtml gets read and compared first with the
input file, so any type below the priority 50 will be disregarded.


Problem2: again magic priority with 50.
	In Tika, given a file dumped by the nutch dumper tool,  both types
(xhtml and cbor) will be selected as candidate mime types and they will be
put in the magic estimation list; since xhtml type gets read first, it is
placed atop the cbor; in order to break that tie, tika will rely on the
decision from the extension method. If the extension method fails to detect
the type(for now, let's ignore metadata hint method for simplicity but the
same applies to it too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor
type to 50 the same as xhtml, because it would probably be risky to discard
any one of the estimated types without going consult the extension method.

Any comments, suggestion, thoughts will be welcomed and appreciated.

Thanks
Luke

-----Original Message-----
From: Luke [mailto:hanson311biz@gmail.com] 
Sent: Wednesday, April 22, 2015 7:45 PM
To: 'Mattmann, Chris A (3980)'
Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
'memex-jpl@googlegroups.com'
Subject: RE: [memex-jpl] this week action from luke

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
"stats total", please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of the
"Stats total" between the log produced by the tika that has the feature and
the one without it;  so if the string.equals(...) satisfies, the string of
the line will be dumped out. If there is a mismatch(e.g. the count for a
particular mime type is different), an error will be dumped out. Eventually,
I don't see any error in the printout, I think the feature seem to have
passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so as
to make it work in a way similar to the old strategy. On the other hands, if
I know that extension is more reliable, I can certainly add more weights to
the extension approach, in this case, the prob mime selector will return
application/cbor with a higher value of weight.

> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"


Please kindly let me know if you have any confusion with the tests;


Thanks
Luke

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-jpl@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <ha...@gmail.com> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
> (3980-Affiliate)'; dev@tika.apache.org
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
> memex-jpl@googlegroups.com
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <ha...@gmail.com>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
<Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <to...@di.uniroma1.it>, <de...@tika.apache.org>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
> NSF Polar CyberInfrastructure DR Students 
> <ns...@googlegroups.com>,
> <me...@googlegroups.com>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; dev@tika.apache.org
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> memex-jpl@googlegroups.com
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <ha...@gmail.com>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann 
>> <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>> NSF Polar CyberInfrastructure DR Students 
>> <ns...@googlegroups.com>,
>> <me...@googlegroups.com>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is 
>>> the same as one or more of mime types being defined in our 
>>> tika-mimes.xml, I guess that tika will put those types in its 
>>> estimation list, please note there could be multiple estimated mime 
>>> types by magic-byte detection approach. Now tika also considers the 
>>> decision made by extension detection approach, if extension says the 
>>> file type it believes is the first one in the magic type estimation 
>>> list, then certainly the first one will be returned. (the same 
>>> applies to metadata hint approach); Of course, tika also prefers the 
>>> type that is the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to 
>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, 
>>> then it seems that magic bytes will inevitable detect nothing, and I 
>>> think it will return the something like" application/oct-stream"
>>> that is the most general type. As mentioned, tika favours the one 
>>> that is the most specialized, if extension approach returns the one 
>>> that is more specialized, in this extreme case I believe almost 
>>> every type is a subclass of this "application/oct-stream"....
>>> therefore the answer in this extreme may be yes, I think it is very 
>>> possible that CBOR type detected by the extension approach takes over in
this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 
>>> is present in the cbor file, then that can be used to detect the cbor
type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, 
>>> I guess the order in the list probably also matters, the first one 
>>> is preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned.
>>> (again, same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 
>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the 
>>> tag should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this 
>>> is present in the file or preferable in the header (within a 
>>> reasonable range of bytes ), I believe it can probably be used as 
>>> the magic numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at 
>>> the tag 55799 again. This may rarely happen but a robust parser 
>>> might be able to take care of that, tika might need to consider the 
>>> use of fastXML being used by the nutch tool when developing the cbor
parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two 
>>> formats.  An easy way for an encoder to help the decoder would be to 
>>> tag the entire CBOR item with tag 55799, the serialization of which 
>>> will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980)
>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; 
>>> memex-jpl@googlegroups.com
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University 
>>> of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <ha...@gmail.com>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>> <Ch...@jpl.nasa.gov>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, 
>>> NSF Polar CyberInfrastructure DR Students 
>>> <ns...@googlegroups.com>,
>>> "memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> memex-jpl@googlegroups.com
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app 
>>>> -m < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> chris.mattmann@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <ha...@gmail.com>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>>>> <ch...@gmail.com>, 'Giuseppe Totaro'
>>>> <to...@di.uniroma1.it>, Chris Mattmann 
>>>> <Ch...@jpl.nasa.gov>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <ns...@googlegroups.com>,
>>>> <me...@googlegroups.com>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even 
>>>>> when I manually change the file extension to the correct one (i.e.
*.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying
"application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this 
>>>>> also might show the current implementation with MimeTypes 
>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:hanson311biz@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> 'memex-jpl@googlegroups.com'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send 
>>>>> the pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> memex-jpl@googlegroups.com
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and 
>>>>> then based on feedback use your new Nutch commit karma to get 
>>>>> those into Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610
issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika 
>>>>> for the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> chris.mattmann@gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>> Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>>>> <ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>> Students <ns...@googlegroups.com>,
>>>>> "memex-jpl@googlegroups.com"
>>>>> <me...@googlegroups.com>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the 
>>>>>> length of following text), and the following 4 bytes 
>>>>>> (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to 
>>>>>> CommonCrawldataDumper class. I believe it is very easy because I 
>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator 
>>>>>> class (the source code is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>> /s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610 
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the 
>>>>>> original file extension is used. We could add support for a 
>>>>>> -extension command-line option allowing the user to give a file 
>>>>>> extension (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <ch...@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA 
>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department 
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <ha...@gmail.com>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe 
>>>>>> U (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>> <Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <ns...@googlegroups.com>,
>>>>>> "memex-jpl@googlegroups.com"
>>>>>> <me...@googlegroups.com>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by 
>>>>>>> the nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
>>>>>>> Polar CyberInfrastructure DR Students'; 
>>>>>>> memex-jpl@googlegroups.com
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an 
>>>>>>> issue in Tika to make it support CBOR, then yes, by all means!
>>>>>>> :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> chris.mattmann@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>>> <ch...@gmail.com>, Chris Mattmann 
>>>>>>> <Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>> <me...@googlegroups.com>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the 
>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with 
>>>>>>>> the .cbor extension, which seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor 
>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting 
>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also 
>>>>>>>> CBOR does not have magic bytes, off the top of my head the only 
>>>>>>>> way we can detect it is using the extension, and content byte 
>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion 
>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic 
>>>>>>>> mime detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in 
>>>>>>>> nutch with the repackaged one, and then run the 
>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any 
>>>>>>>> specific ideas and thought with the testing, please kindly let me
know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <sh...@usc.edu> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was 
>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human 
>>>>>>>> readable and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown 
>>>>>>>> in gen-common-crawl.sh, but the generated files all end with 
>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems 
>>>>>>>> to be allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are 
>>>>>>>> not exactly correct (probably they need to be updated), part of 
>>>>>>>> the patterns was able to allow me to locate some similar datasets
(e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the 
>>>>>>>> dumped files are all ending with html, but surprisingly inside 
>>>>>>>> those outputted html files, the contents are present in json 
>>>>>>>> format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original
file.
>>>>>>>> More in detail, using the -epochFilename command-line option 
>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored 
>>>>>>>> with a filename of the format 
>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>> where <filetype> is either the extension of the original file 
>>>>>>>> or .html as default if the original file does not have an
extension.
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with 
>>>>>>>> some garbled chars in each of the outputted json files (with 
>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the 
>>>>>>>> outputted json file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly 
>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not 
>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary 
>>>>>>>> data are stored in a structured document format (JSON).
>>>>>>>> Furthermore, this document is encoded using CBOR 
>>>>>>>> <http://cbor.io/> serialization. Each not human-readable 
>>>>>>>> character that you notice in front and at the end of JSON data is
due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the 
>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>> the following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>> might be we can repackage the tika by incorporating the feature
(i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>> as the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>> simple though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>> create another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>> integrated into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> chris.mattmann@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <sh...@usc.edu>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>>> <Ch...@jpl.nasa.gov>
>>>>>>>> Cc: 'Luke' <ha...@gmail.com>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>> will be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>> my work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>> let me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>> it, send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>>> <ma...@googlegroups.com>.
>>>>>>>> To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>> 00
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
>

Re: [memex-jpl] this week action from luke

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thanks Luke this is probably a good opportunity to test out your Bayesian mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <ha...@gmail.com> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@gmail.com] 
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <ha...@gmail.com>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann <Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <to...@di.uniroma1.it>, <de...@tika.apache.org>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR Students <ns...@googlegroups.com>,
> <me...@googlegroups.com>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; dev@tika.apache.org
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> memex-jpl@googlegroups.com
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <ha...@gmail.com>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann 
>> <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <to...@di.uniroma1.it>, <de...@tika.apache.org>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF 
>> Polar CyberInfrastructure DR Students 
>> <ns...@googlegroups.com>,
>> <me...@googlegroups.com>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is the 
>>> same as one or more of mime types being defined in our tika-mimes.xml, 
>>> I guess that tika will put those types in its estimation list, please 
>>> note there could be multiple estimated mime types by magic-byte 
>>> detection approach. Now tika also considers the decision made by 
>>> extension detection approach, if extension says the file type it 
>>> believes is the first one in the magic type estimation list, then 
>>> certainly the first one will be returned. (the same applies to 
>>> metadata hint approach); Of course, tika also prefers the type that is 
>>> the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to not 
>>> scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>>> it seems that magic bytes will inevitable detect nothing, and I think 
>>> it will return the something like" application/oct-stream" that is the 
>>> most general type. As mentioned, tika favours the one that is the most 
>>> specialized, if extension approach returns the one that is more 
>>> specialized, in this extreme case I believe almost every type is a 
>>> subclass of this "application/oct-stream".... therefore the answer in 
>>> this extreme may be yes, I think it is very possible that CBOR type 
>>> detected by the extension approach takes over in this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 is 
>>> present in the cbor file, then that can be used to detect the cbor type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, I 
>>> guess the order in the list probably also matters, the first one is 
>>> preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. (again, 
>>> same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 55799, 
>>> but I expect to see its hex is something like 0xd9d9f7 or the tag 
>>> should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>>> present in the file or preferable in the header (within a reasonable 
>>> range of bytes ), I believe it can probably be used as the magic 
>>> numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at the 
>>> tag 55799 again. This may rarely happen but a robust parser might be 
>>> able to take care of that, tika might need to consider the use of 
>>> fastXML being used by the nutch tool when developing the cbor parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two
>>>  formats.  An easy way for an encoder to help the decoder would be to
>>>  tag the entire CBOR item with tag 55799, the serialization of which
>>>  will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 
>>> 'NSF Polar CyberInfrastructure DR Students'; 
>>> memex-jpl@googlegroups.com
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University of 
>>> Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <ha...@gmail.com>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>> <Ch...@jpl.nasa.gov>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, NSF 
>>> Polar CyberInfrastructure DR Students 
>>> <ns...@googlegroups.com>,
>>> "memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> memex-jpl@googlegroups.com
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m 
>>>> < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> chris.mattmann@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <ha...@gmail.com>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>>>> <ch...@gmail.com>, 'Giuseppe Totaro' 
>>>> <to...@di.uniroma1.it>, Chris Mattmann 
>>>> <Ch...@jpl.nasa.gov>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <ns...@googlegroups.com>,
>>>> <me...@googlegroups.com>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even when 
>>>>> I manually change the file extension to the correct one (i.e. *.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying "application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this also 
>>>>> might show the current implementation with MimeTypes detection is a 
>>>>> bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:hanson311biz@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> 'memex-jpl@googlegroups.com'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send the 
>>>>> pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> memex-jpl@googlegroups.com
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and then 
>>>>> based on feedback use your new Nutch commit karma to get those into 
>>>>> Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610 issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika for 
>>>>> the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> chris.mattmann@gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>> Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>>>> <ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>> Students <ns...@googlegroups.com>,
>>>>> "memex-jpl@googlegroups.com"
>>>>> <me...@googlegroups.com>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the length 
>>>>>> of following text), and the following 4 bytes (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to CommonCrawldataDumper 
>>>>>> class. I believe it is very easy because I have to enable the 
>>>>>> WRITE_TYPE_HEADER feature for CBORGenerator class (the source code 
>>>>>> is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the original 
>>>>>> file extension is used. We could add support for a -extension 
>>>>>> command-line option allowing the user to give a file extension 
>>>>>> (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <ch...@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department University 
>>>>>> of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <ha...@gmail.com>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>>>>> (3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>> <Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <ns...@googlegroups.com>,
>>>>>> "memex-jpl@googlegroups.com"
>>>>>> <me...@googlegroups.com>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by the 
>>>>>>> nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>>>> CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an issue 
>>>>>>> in Tika to make it support CBOR, then yes, by all means! :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> chris.mattmann@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <ha...@gmail.com>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>>> <ch...@gmail.com>, Chris Mattmann 
>>>>>>> <Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <ns...@googlegroups.com>,
>>>>>>> <me...@googlegroups.com>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the CommonCrawlDataDumper 
>>>>>>>> is not dumping the nutch segments with the .cbor extension, which 
>>>>>>>> seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>>>> the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>>>> the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does 
>>>>>>>> not have magic bytes, off the top of my head the only way we can 
>>>>>>>> detect it is using the extension, and content byte histogram 
>>>>>>>> (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion of 
>>>>>>>> this thread.... i.e. the plan for testing the "probabilistic mime 
>>>>>>>> detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in nutch 
>>>>>>>> with the repackaged one, and then run the CommonCrawlDataDumper 
>>>>>>>> and see how it goes. If you have any specific ideas and thought 
>>>>>>>> with the testing, please kindly let me know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <sh...@usc.edu> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>>>> some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human readable 
>>>>>>>> and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown in 
>>>>>>>> gen-common-crawl.sh, but the generated files all end with .html 
>>>>>>>> or htm; The command listed in gen-common-crawl.sh seems to be 
>>>>>>>> allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>>>> exactly correct (probably they need to be updated), part of the 
>>>>>>>> patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>>>> files are all ending with html, but surprisingly inside those 
>>>>>>>> outputted html files, the contents are present in json format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original file.
>>>>>>>> More in detail, using the -epochFilename command-line option (as 
>>>>>>>> in gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>>>> filename of the format <epochtime(milliseconds)>.<filetype>, 
>>>>>>>> where <filetype> is either the extension of the original file or 
>>>>>>>> .html as default if the original file does not have an extension. 
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with some 
>>>>>>>> garbled chars in each of the outputted json files (with extension 
>>>>>>>> html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>>>> file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly by 
>>>>>>>> aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>>>> element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary data 
>>>>>>>> are stored in a structured document format (JSON). Furthermore, 
>>>>>>>> this document is encoded using CBOR <http://cbor.io/> 
>>>>>>>> serialization. Each not human-readable character that you notice 
>>>>>>>> in front and at the end of JSON data is due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the CBOR-encoded 
>>>>>>>> data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not being 
>>>>>>>> collected and computed by aggregate.py from those output json 
>>>>>>>> files but it looks like it is coming from the log.... see the 
>>>>>>>> following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, might 
>>>>>>>> be we can repackage the tika by incorporating the feature (i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information as 
>>>>>>>> the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell a 
>>>>>>>> bit more about how the feature is called and used, it is simple 
>>>>>>>> though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>>>> another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start working 
>>>>>>>> on my 'finale'. BTW, don't worry, even after I am graduated, the 
>>>>>>>> graduation is not my termination with tika and this project, 
>>>>>>>> after then I still can and want to help this polar project and 
>>>>>>>> tika as much as possible, and correct the programming faults and 
>>>>>>>> bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> memex-jpl@googlegroups.com
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>>>> your code in real situations. For example, it would be fantastic 
>>>>>>>> to now run your Bayesian MIME detector over the whole NSF TREC 
>>>>>>>> Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, and 
>>>>>>>> Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually test 
>>>>>>>> it and produce a real result over the whole TREC-DD data for 
>>>>>>>> Polar? My goal will be to get your code committed and integrated 
>>>>>>>> into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, Luke 
>>>>>>>> is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> chris.mattmann@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <sh...@usc.edu>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>>> <Ch...@jpl.nasa.gov>
>>>>>>>> Cc: 'Luke' <ha...@gmail.com>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week I 
>>>>>>>> finished the last item in the list, and hopefully everything will 
>>>>>>>> be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on my 
>>>>>>>> work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do let 
>>>>>>>> me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>>> <ma...@googlegroups.com>.
>>>>>>>> To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
>

RE: [memex-jpl] this week action from luke

Posted by Luke <ha...@gmail.com>.

Hi professor,

Please see the following results.
<match value="&lt;html xmlns=" type="string" offset="0:1024"/>
Result: "text/html"

<match value="&lt;html xmlns=" type="string" offset="0:6000"/>
Result: "application/xhtml+xml"


Thanks
Luke

-----Original Message-----
From: Chris Mattmann [mailto:chris.mattmann@gmail.com] 
Sent: Wednesday, April 22, 2015 4:21 AM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Luke <ha...@gmail.com>
Date: Wednesday, April 22, 2015 at 2:25 AM
To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann <Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<to...@di.uniroma1.it>, <de...@tika.apache.org>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR Students <ns...@googlegroups.com>,
<me...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>
>Hi professor,
>
>I just tried it with minLength set to 1024, I get the following 
>"text/plain"
>I am a bit surprised....
>
>BTW, the 6000 min length still give "application/xhtml+xml"; with 
>anything below 1024 min length, I am seeing "text/plain". :)
>
>BTW, the min length I am referring/altering is as follows 
>MimeTypes.java
>	public int getMinLength() {
>        // This needs to be reasonably large to be able to correctly 
>detect
>        // things like XML root elements after initial comment and DTDs
>        return 64 * 1024;
>    }
>
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>Sent: Tuesday, April 21, 2015 7:48 PM
>To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>(3980-Affiliate)'; dev@tika.apache.org
>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>memex-jpl@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Thanks Luke.
>
>So I guess all I was asking was could you try it out. Thanks for the 
>lesson in the RFC.
>
>Cheers,
>Chris
>
>------------------------
>Chris Mattmann
>chris.mattmann@gmail.com
>
>
>
>
>-----Original Message-----
>From: Luke <ha...@gmail.com>
>Date: Wednesday, April 22, 2015 at 1:46 AM
>To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann 
><ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
><to...@di.uniroma1.it>, <de...@tika.apache.org>
>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF 
>Polar CyberInfrastructure DR Students 
><ns...@googlegroups.com>,
><me...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi professor,
>>
>>
>>I think it highly depends on the content being read by tika, e.g. if 
>>there is a sequence of bytes in the file that is being read and is the 
>>same as one or more of mime types being defined in our tika-mimes.xml, 
>>I guess that tika will put those types in its estimation list, please 
>>note there could be multiple estimated mime types by magic-byte 
>>detection approach. Now tika also considers the decision made by 
>>extension detection approach, if extension says the file type it 
>>believes is the first one in the magic type estimation list, then 
>>certainly the first one will be returned. (the same applies to 
>>metadata hint approach); Of course, tika also prefers the type that is 
>>the most specialized.
>>
>>let's get back to the following question, here is my guess though.
>>[Prof]: Also what happens if you tweak the definition of XHTML to not 
>>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>>it seems that magic bytes will inevitable detect nothing, and I think 
>>it will return the something like" application/oct-stream" that is the 
>>most general type. As mentioned, tika favours the one that is the most 
>>specialized, if extension approach returns the one that is more 
>>specialized, in this extreme case I believe almost every type is a 
>>subclass of this "application/oct-stream".... therefore the answer in 
>>this extreme may be yes, I think it is very possible that CBOR type 
>>detected by the extension approach takes over in this case...
>>
>>My idea was and still is that if the cbor self-Describing tag 55799 is 
>>present in the cbor file, then that can be used to detect the cbor type.
>>Again, the cbor type will probably be appended into the magic 
>>estimation list together with another one such as application/html, I 
>>guess the order in the list probably also matters, the first one is 
>>preferred over the next one. Also the decision from the extension 
>>detection approach also play the role the break the tie.
>>e.g. if extension detection method agrees on cbor with one of the 
>>estimated type in the magic list, then cbor will be returned. (again, 
>>same thing applies to metadatahint method).
>>
>>I have not taken a closer look at a cbor file that has the tag 55799, 
>>but I expect to see its hex is something like 0xd9d9f7 or the tag 
>>should be present in the header with a fixed sequence of
>>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>>present in the file or preferable in the header (within a reasonable 
>>range of bytes ), I believe it can probably be used as the magic 
>>numbers for the cbor type.
>>
>>
>>There is another thing I have mentioned in the jira ticket I opened 
>>yesterday against the cbor parser and detection, it is also possible 
>>that cbor content can be imbedded inside a plain json file, the way 
>>that a decoder can distinguish them in that file is by looking at the 
>>tag 55799 again. This may rarely happen but a robust parser might be 
>>able to take care of that, tika might need to consider the use of 
>>fastXML being used by the nutch tool when developing the cbor parser...
>>Again let me cite the same paragraph from the rfc,
>>
>>" a decoder might be able to parse both CBOR and JSON.
>>   Such a decoder would need to mechanically distinguish the two
>>   formats.  An easy way for an encoder to help the decoder would be to
>>   tag the entire CBOR item with tag 55799, the serialization of which
>>   will never be found at the beginning of a JSON text."
>>
>>
>>Thanks
>>Luke
>>
>>
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Tuesday, April 21, 2015 9:49 PM
>>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 
>>'NSF Polar CyberInfrastructure DR Students'; 
>>memex-jpl@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hi Luke,
>>
>>Can you post the below conversation to dev@tika and summarize it there.
>>Also what happens if you tweak the definition of XHTML to not scan 
>>until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department University of 
>>Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <ha...@gmail.com>
>>Date: Wednesday, April 22, 2015 at 12:19 AM
>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>><Ch...@jpl.nasa.gov>
>>Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, 
>>"Zimdars, Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, NSF 
>>Polar CyberInfrastructure DR Students 
>><ns...@googlegroups.com>,
>>"memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi Professor,
>>>Please see attached jpg for the difference.
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>Sent: Tuesday, April 21, 2015 5:27 PM
>>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>memex-jpl@googlegroups.com
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m 
>>>< /path/to/cbor/file.cbor any difference?
>>>
>>>------------------------
>>>Chris Mattmann
>>>chris.mattmann@gmail.com
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Luke <ha...@gmail.com>
>>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>>To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>>><ch...@gmail.com>, 'Giuseppe Totaro' 
>>><to...@di.uniroma1.it>, Chris Mattmann 
>>><Ch...@jpl.nasa.gov>
>>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>>NSF Polar CyberInfrastructure DR Students 
>>><ns...@googlegroups.com>,
>>><me...@googlegroups.com>
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>>Hi professor,
>>>>I just sent a pull request for adding cbor extension.
>>>>The interesting thing is that tika is still identifying the file 
>>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when 
>>>>I manually change the file extension to the correct one (i.e. *.cbor ).
>>>>
>>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>>by searching for the "&lt;html" in the file content, PFA:
>>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>xhtml+nutch,
>>>>you see that we do have that element as part of the cbor content 
>>>>because the entire crawled xhtml document seems to be imbedded in 
>>>>the cbor json(PFA:
>>>>cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>higher priority over the glob detection, thus the type is being 
>>>>incorrectly detected.
>>>>
>>>>Therefore, I would like to please mention that adding the entry of 
>>>><glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>without some fixed magic bytes / patterns for cbor.
>>>>I also would like to add that the thing will be different with our 
>>>>probabilistic mime detection selector, because if we know that the 
>>>>file extension is more reliable than magic bytes, then we can 
>>>>certainly add more preferential weight to the extension... this also 
>>>>might show the current implementation with MimeTypes detection is a 
>>>>bit stiff or less flexible in this scneario. :)
>>>>
>>>>
>>>>Thanks
>>>>Luke
>>>>
>>>>-----Original Message-----
>>>>From: Luke [mailto:hanson311biz@gmail.com]
>>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>'memex-jpl@googlegroups.com'
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>Yes, let me add the cbor extension entry in tika xml, will send the 
>>>>pull request soon.
>>>>
>>>>Thanks
>>>>Luke
>>>>-----Original Message-----
>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>memex-jpl@googlegroups.com
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>and tag along with adding an -extension command would be fantastic.
>>>>Can you file both of those NUTCH issues, wait a day or so, and then 
>>>>based on feedback use your new Nutch commit karma to get those into 
>>>>Nutch?
>>>>
>>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>>At that point, when those two to be defined NUTCH issues are up, 
>>>>Luke, in parallel can you throw up a pull request/patch in Tika for 
>>>>the extension along with the MIME detection?
>>>>
>>>>Cheers,
>>>>Chris
>>>>
>>>>------------------------
>>>>Chris Mattmann
>>>>chris.mattmann@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>>><ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>Students <ns...@googlegroups.com>,
>>>>"memex-jpl@googlegroups.com"
>>>><me...@googlegroups.com>
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks Luke. Great work.
>>>>>Chris, we wrap a single string value, representing the JSON text, 
>>>>>for each file into CBOR (by using serializeCBORData method). For 
>>>>>instance, using the Unix hex dump tool, we can see that, as 
>>>>>expected, the first byte of all files is "0x7F" (the first three 
>>>>>bits are "011", that is the major type for strings, and the 
>>>>>following 5 bits are "11010", meaning a uint32_t encodes the length 
>>>>>of following text), and the following 4 bytes (single-precision
>>>>>float) encodes the right length of file (as described in RFC7049 
>>>>><http://tools.ietf.org/html/rfc7049>).
>>>>>Therefore, a CBOR tag is currently included into the file (a list 
>>>>>of cbor tags is available here 
>>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>this great research. Chris, if you agree, I can add support for 
>>>>>prepending self-describing CBOR tag 55799 to CommonCrawldataDumper 
>>>>>class. I believe it is very easy because I have to enable the 
>>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code 
>>>>>is available here 
>>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s
>>>>>r
>>>>>c
>>>>>/
>>>>>m ain
>>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>Then, I can comment the TIKA-1610
>>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>
>>>>>Regarding the file extension, in the Memex CCA format the original 
>>>>>file extension is used. We could add support for a -extension 
>>>>>command-line option allowing the user to give a file extension 
>>>>>(e.g.,
>>>>>cbor) for all files dumped out.
>>>>>
>>>>>Thanks a lot,
>>>>>Giuseppe
>>>>>
>>>>>
>>>>>
>>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>><ch...@jpl.nasa.gov> wrote:
>>>>>
>>>>>Thanks for this great research, Luke!
>>>>>
>>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>>
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Chris Mattmann, Ph.D.
>>>>>Chief Architect
>>>>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>Office: 168-519, Mailstop: 168-527
>>>>>Email: chris.a.mattmann@nasa.gov
>>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Adjunct Associate Professor, Computer Science Department University 
>>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <ha...@gmail.com>
>>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>>>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>>><Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>Students <ns...@googlegroups.com>,
>>>>>"memex-jpl@googlegroups.com"
>>>>><me...@googlegroups.com>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks professor.
>>>>>>Hi professor and all.
>>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>
>>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>>
>>>>>>It looks like there is a self describing tag that needs to be 
>>>>>>written in the cbor file thru which other applications might be 
>>>>>>able to identify the cbor type....
>>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>
>>>>>>I don’t see that tag being present in the cbor file dumped by the 
>>>>>>nutch tool, I am not very sure though.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>>>CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>Nice one, Luke. If you have a second and you can open up an issue 
>>>>>>in Tika to make it support CBOR, then yes, by all means! :)
>>>>>>
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>chris.mattmann@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke <ha...@gmail.com>
>>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>>><ch...@gmail.com>, Chris Mattmann 
>>>>>><Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>><an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>>Students <ns...@googlegroups.com>,
>>>>>><me...@googlegroups.com>
>>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>>
>>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>
>>>>>>>BTW, it looks like Tika might need to consider the support with 
>>>>>>>COBR parser and detection.
>>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>PFA:
>>>>>>>rfc_cbor.jpg
>>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper 
>>>>>>>is not dumping the nutch segments with the .cbor extension, which 
>>>>>>>seems to be helpful for type detection.
>>>>>>>
>>>>>>>To professor Mattmann,
>>>>>>>Tika does not support the detection of COBR, although the trunk 
>>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>>>the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does 
>>>>>>>not have magic bytes, off the top of my head the only way we can 
>>>>>>>detect it is using the extension, and content byte histogram 
>>>>>>>(please note, this is a local optimal solution and
>>>>>>>data-dependent.)  J
>>>>>>>
>>>>>>>I think I am bit deviating from the main route and discussion of 
>>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime 
>>>>>>>detector selection” with polar data.
>>>>>>>Anyway, I plan to repackage tika by incorporating the 
>>>>>>>probabilistic selection feature and replace the tika jar in nutch 
>>>>>>>with the repackaged one, and then run the CommonCrawlDataDumper 
>>>>>>>and see how it goes. If you have any specific ideas and thought 
>>>>>>>with the testing, please kindly let me know.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>To: Luke liu
>>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>Polar CyberInfrastructure DR Students; memex-jpl@googlegroups.com
>>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Luke,
>>>>>>>
>>>>>>>
>>>>>>>my name is Giuseppe and I am a PhD student working under the 
>>>>>>>supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>couple of your observations. My comments inline below.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>><sh...@usc.edu> ha
>>>>>>>scritto:
>>>>>>>
>>>>>>>
>>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>>>some time in understanding the code repo i.e.
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>dumping the crawl segments to json files with the human readable 
>>>>>>>and understandable content.
>>>>>>>1) I am trying to run one of the commands on my side as shown in 
>>>>>>>gen-common-crawl.sh, but the generated files all end with .html 
>>>>>>>or htm; The command listed in gen-common-crawl.sh seems to be 
>>>>>>>allude to where the data is located on our 
>>>>>>>nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>>>exactly correct (probably they need to be updated), part of the 
>>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>>>files are all ending with html, but surprisingly inside those 
>>>>>>>outputted html files, the contents are present in json format;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>>More in detail, using the -epochFilename command-line option (as 
>>>>>>>in gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, 
>>>>>>>where <filetype> is either the extension of the original file or 
>>>>>>>.html as default if the original file does not have an extension. 
>>>>>>>This schema is used for file naming and it does not depend on 
>>>>>>>internal output format (JSON).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>2) Another problem is that the root object is being set with some 
>>>>>>>garbled chars in each of the outputted json files (with extension 
>>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>>>file has been also attached as an example too (PFA:
>>>>>>>1423894754000.html); the json files cannot be parsed properly by 
>>>>>>>aggregate.py due to those garbled chars.
>>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>>>element which are being read by aggregate.py.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Text content and metadata extracted from the crawled binary data 
>>>>>>>are stored in a structured document format (JSON). Furthermore, 
>>>>>>>this document is encoded using CBOR <http://cbor.io/> 
>>>>>>>serialization. Each not human-readable character that you notice 
>>>>>>>in front and at the end of JSON data is due to CBOR-encoding.
>>>>>>>Thus, if you need to read JSON data from document dumped out by 
>>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded 
>>>>>>>data structure inside the file.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>I hope this short overview can help in you work. I really 
>>>>>>>appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>great job in detection.
>>>>>>>
>>>>>>>I am available to provide you all support I can give, so you do 
>>>>>>>not hesitate to contact me if you may need any further information.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>
>>>>>>>Giuseppe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Finally, after some research, I guess that the statistical 
>>>>>>>information (present in the readme of the code repo) is not being 
>>>>>>>collected and computed by aggregate.py from those output json 
>>>>>>>files but it looks like it is coming from the log.... see the 
>>>>>>>following as an example:
>>>>>>>
>>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>>TOTAL Stats:
>>>>>>>[
>>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>>]
>>>>>>>
>>>>>>>It turns out that aggregate.py is not the one that produces the 
>>>>>>>statistical information, not sure what it does... but anyway, I 
>>>>>>>think I understand the whole idea and I do concur with it, might 
>>>>>>>be we can repackage the tika by incorporating the feature (i.e.
>>>>>>>probabilistic mime
>>>>>>>selection) in it and see if it can output the same information as 
>>>>>>>the one without it in the log.
>>>>>>>
>>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>selection:
>>>>>>>in my pull request, I added a simple test case which might tell a 
>>>>>>>bit more about how the feature is called and used, it is simple 
>>>>>>>though.
>>>>>>>Here is an example snippet
>>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>information with this can be found in
>>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>modify/add some code in the TikaConfig which initializes a list 
>>>>>>>of default detectors, and we need to get rid of the detector -
>>>>>>>mimeTypes::
>>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>>>another pull request with this change for
>>>>>>>TikaConfig)
>>>>>>>
>>>>>>>I think that is all of my initial thought with some finding and 
>>>>>>>plan; if you have anything you would like to please add and 
>>>>>>>comment, please do kindly let me know, then I will start working 
>>>>>>>on my 'finale'. BTW, don’t worry, even after I am graduated, the 
>>>>>>>graduation is not my termination with tika and this project, 
>>>>>>>after then I still can and want to help this polar project and 
>>>>>>>tika as much as possible, and correct the programming faults and 
>>>>>>>bugs, respond to the tika issues ,etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>memex-jpl@googlegroups.com
>>>>>>>Subject: Re: this week action from luke
>>>>>>>Importance: High
>>>>>>>
>>>>>>>Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>>>your code in real situations. For example, it would be fantastic 
>>>>>>>to now run your Bayesian MIME detector over the whole NSF TREC 
>>>>>>>Dynamic Domain data for Polar described here:
>>>>>>>
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and 
>>>>>>>Annie can explain it, also CC’ed.
>>>>>>>
>>>>>>>Can we make that your goal for the next 2 weeks to actually test 
>>>>>>>it and produce a real result over the whole TREC-DD data for 
>>>>>>>Polar? My goal will be to get your code committed and integrated 
>>>>>>>into Tika.
>>>>>>>The more you can write me a guide of how to build and test your 
>>>>>>>code with Tika so I can get it committed the better.
>>>>>>>
>>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is 
>>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s 
>>>>>>>existing MIME detection approach. If folks have any Memex needs 
>>>>>>>to try and test more accurate file identification with Tika, Luke 
>>>>>>>is the guy to talk to and I have him for 2 more weeks.
>>>>>>>
>>>>>>>Thanks!
>>>>>>>
>>>>>>>Cheers,
>>>>>>>Chris
>>>>>>>
>>>>>>>------------------------
>>>>>>>Chris Mattmann
>>>>>>>chris.mattmann@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Luke liu <sh...@usc.edu>
>>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>>><Ch...@jpl.nasa.gov>
>>>>>>>Cc: 'Luke' <ha...@gmail.com>
>>>>>>>Subject: this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Professor Mattmann,
>>>>>>>
>>>>>>>I think I am in the final phase of the research, and last week I 
>>>>>>>finished the last item in the list, and hopefully everything will 
>>>>>>>be fine.
>>>>>>>
>>>>>>>For now, i probably can spend some time in verifying or 
>>>>>>>optimizing the codes, the majority of the research has been 
>>>>>>>done…and it will be also great if you can please comment on my 
>>>>>>>work (the 2 pull
>>>>>>>requests) when you have time.
>>>>>>>
>>>>>>>If you do have confusion with any of my work, please also do let 
>>>>>>>me know.
>>>>>>>
>>>>>>>Thanks and I am glad working with you, for the next a couple of 
>>>>>>>weeks before graduation, I am going to continue revising and 
>>>>>>>testing the code and features to get rid of some flaws (if any 
>>>>>>>)when I have time.
>>>>>>>
>>>>>>>Not sure if I miss out something, and if I do miss some thing 
>>>>>>>important, please do let me know too.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>You received this message because you are subscribed to the 
>>>>>>>Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>><ma...@googlegroups.com>.
>>>>>>>To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>To view this discussion on the web visit
>>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100
>>>>>>>7
>>>>>>>0
>>>>>>>%
>>>>>>>2
>>>>>>>41
>>>>>>>9f3
>>>>>>>0150%24%40edu.
>>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>>><garbled.jpg><1423894754000.html>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: [memex-jpl] this week action from luke

Posted by Chris Mattmann <ch...@gmail.com>.

Hi Luke,

Actually I just meant go into tika-mimetypes.xml and change the
magic offsets for application/xhtml+xml and see if that works. The
code you changed below is actually how many bytes Tika will first
download to do MIME checking.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Luke <ha...@gmail.com>
Date: Wednesday, April 22, 2015 at 2:25 AM
To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
<Ch...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<to...@di.uniroma1.it>, <de...@tika.apache.org>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, "'Zimdars,
Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar
CyberInfrastructure DR Students <ns...@googlegroups.com>,
<me...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>
>Hi professor,
>
>I just tried it with minLength set to 1024, I get the following
>"text/plain"
>I am a bit surprised....
>
>BTW, the 6000 min length still give "application/xhtml+xml"; with
>anything below 1024 min length, I am seeing "text/plain". :)
>
>BTW, the min length I am referring/altering is as follows
>MimeTypes.java
>	public int getMinLength() {
>        // This needs to be reasonably large to be able to correctly
>detect
>        // things like XML root elements after initial comment and DTDs
>        return 64 * 1024;
>    }
>
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>Sent: Tuesday, April 21, 2015 7:48 PM
>To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
>(3980-Affiliate)'; dev@tika.apache.org
>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)';
>'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Thanks Luke.
>
>So I guess all I was asking was could you try it out. Thanks for the
>lesson in the RFC.
>
>Cheers,
>Chris
>
>------------------------
>Chris Mattmann
>chris.mattmann@gmail.com
>
>
>
>
>-----Original Message-----
>From: Luke <ha...@gmail.com>
>Date: Wednesday, April 22, 2015 at 1:46 AM
>To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann
><ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
><to...@di.uniroma1.it>, <de...@tika.apache.org>
>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF
>Polar CyberInfrastructure DR Students
><ns...@googlegroups.com>,
><me...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi professor,
>>
>>
>>I think it highly depends on the content being read by tika, e.g. if
>>there is a sequence of bytes in the file that is being read and is the
>>same as one or more of mime types being defined in our tika-mimes.xml,
>>I guess that tika will put those types in its estimation list, please
>>note there could be multiple estimated mime types by magic-byte
>>detection approach. Now tika also considers the decision made by
>>extension detection approach, if extension says the file type it
>>believes is the first one in the magic type estimation list, then
>>certainly the first one will be returned. (the same applies to metadata
>>hint approach); Of course, tika also prefers the type that is the most
>>specialized.
>>
>>let's get back to the following question, here is my guess though.
>>[Prof]: Also what happens if you tweak the definition of XHTML to not
>>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>Let's consider an extreme case where we only scan 10 or 1 bytes, then
>>it seems that magic bytes will inevitable detect nothing, and I think
>>it will return the something like" application/oct-stream" that is the
>>most general type. As mentioned, tika favours the one that is the most
>>specialized, if extension approach returns the one that is more
>>specialized, in this extreme case I believe almost every type is a
>>subclass of this "application/oct-stream".... therefore the answer in
>>this extreme may be yes, I think it is very possible that CBOR type
>>detected by the extension approach takes over in this case...
>>
>>My idea was and still is that if the cbor self-Describing tag 55799 is
>>present in the cbor file, then that can be used to detect the cbor type.
>>Again, the cbor type will probably be appended into the magic
>>estimation list together with another one such as application/html, I
>>guess the order in the list probably also matters, the first one is
>>preferred over the next one. Also the decision from the extension
>>detection approach also play the role the break the tie.
>>e.g. if extension detection method agrees on cbor with one of the
>>estimated type in the magic list, then cbor will be returned. (again,
>>same thing applies to metadatahint method).
>>
>>I have not taken a closer look at a cbor file that has the tag 55799,
>>but I expect to see its hex is something like 0xd9d9f7 or the tag
>>should be present in the header with a fixed sequence of
>>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
>>present in the file or preferable in the header (within a reasonable
>>range of bytes ), I believe it can probably be used as the magic
>>numbers for the cbor type.
>>
>>
>>There is another thing I have mentioned in the jira ticket I opened
>>yesterday against the cbor parser and detection, it is also possible
>>that cbor content can be imbedded inside a plain json file, the way
>>that a decoder can distinguish them in that file is by looking at the
>>tag 55799 again. This may rarely happen but a robust parser might be
>>able to take care of that, tika might need to consider the use of
>>fastXML being used by the nutch tool when developing the cbor parser...
>>Again let me cite the same paragraph from the rfc,
>>
>>" a decoder might be able to parse both CBOR and JSON.
>>   Such a decoder would need to mechanically distinguish the two
>>   formats.  An easy way for an encoder to help the decoder would be to
>>   tag the entire CBOR item with tag 55799, the serialization of which
>>   will never be found at the beginning of a JSON text."
>>
>>
>>Thanks
>>Luke
>>
>>
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Tuesday, April 21, 2015 9:49 PM
>>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
>>'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hi Luke,
>>
>>Can you post the below conversation to dev@tika and summarize it there.
>>Also what happens if you tweak the definition of XHTML to not scan
>>until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398) NASA Jet
>>Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department University of
>>Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <ha...@gmail.com>
>>Date: Wednesday, April 22, 2015 at 12:19 AM
>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U
>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
>><Ch...@jpl.nasa.gov>
>>Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, "Zimdars,
>>Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, NSF Polar
>>CyberInfrastructure DR Students
>><ns...@googlegroups.com>,
>>"memex-jpl@googlegroups.com" <me...@googlegroups.com>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi Professor,
>>>Please see attached jpg for the difference.
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>Sent: Tuesday, April 21, 2015 5:27 PM
>>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>memex-jpl@googlegroups.com
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Hey Luke what happens if you do java -jar /path/to/tika-app -m
>>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m <
>>>/path/to/cbor/file.cbor any difference?
>>>
>>>------------------------
>>>Chris Mattmann
>>>chris.mattmann@gmail.com
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Luke <ha...@gmail.com>
>>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>>To: 'Luke' <ha...@gmail.com>, Chris Mattmann
>>><ch...@gmail.com>, 'Giuseppe Totaro' <to...@di.uniroma1.it>,
>>>Chris Mattmann <Ch...@jpl.nasa.gov>
>>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>,
>>>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>,
>>>NSF Polar CyberInfrastructure DR Students
>>><ns...@googlegroups.com>,
>>><me...@googlegroups.com>
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>>Hi professor,
>>>>I just sent a pull request for adding cbor extension.
>>>>The interesting thing is that tika is still identifying the file
>>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when
>>>>I manually change the file extension to the correct one (i.e. *.cbor ).
>>>>
>>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>>by searching for the "&lt;html" in the file content, PFA:
>>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by
>>>>xhtml+nutch,
>>>>you see that we do have that element as part of the cbor content
>>>>because the entire crawled xhtml document seems to be imbedded in the
>>>>cbor json(PFA:
>>>>cbor.jpg); and also in Tika, the magic detection seems to have higher
>>>>priority over the glob detection, thus the type is being incorrectly
>>>>detected.
>>>>
>>>>Therefore, I would like to please mention that adding the entry of
>>>><glob pattern="*.cbor"/> is not resolving the issue as of now without
>>>>some fixed magic bytes / patterns for cbor.
>>>>I also would like to add that the thing will be different with our
>>>>probabilistic mime detection selector, because if we know that the
>>>>file extension is more reliable than magic bytes, then we can
>>>>certainly add more preferential weight to the extension... this also
>>>>might show the current implementation with MimeTypes detection is a
>>>>bit stiff or less flexible in this scneario. :)
>>>>
>>>>
>>>>Thanks
>>>>Luke
>>>>
>>>>-----Original Message-----
>>>>From: Luke [mailto:hanson311biz@gmail.com]
>>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>>'memex-jpl@googlegroups.com'
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>Yes, let me add the cbor extension entry in tika xml, will send the
>>>>pull request soon.
>>>>
>>>>Thanks
>>>>Luke
>>>>-----Original Message-----
>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>>>>memex-jpl@googlegroups.com
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER
>>>>and tag along with adding an -extension command would be fantastic.
>>>>Can you file both of those NUTCH issues, wait a day or so, and then
>>>>based on feedback use your new Nutch commit karma to get those into
>>>>Nutch?
>>>>
>>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>>At that point, when those two to be defined NUTCH issues are up,
>>>>Luke, in parallel can you throw up a pull request/patch in Tika for
>>>>the extension along with the MIME detection?
>>>>
>>>>Cheers,
>>>>Chris
>>>>
>>>>------------------------
>>>>Chris Mattmann
>>>>chris.mattmann@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>>Cc: Luke <ha...@gmail.com>, Chris Mattmann
>>>><ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>Students <ns...@googlegroups.com>,
>>>>"memex-jpl@googlegroups.com"
>>>><me...@googlegroups.com>
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks Luke. Great work.
>>>>>Chris, we wrap a single string value, representing the JSON text,
>>>>>for each file into CBOR (by using serializeCBORData method). For
>>>>>instance, using the Unix hex dump tool, we can see that, as
>>>>>expected, the first byte of all files is "0x7F" (the first three
>>>>>bits are "011", that is the major type for strings, and the
>>>>>following 5 bits are "11010", meaning a uint32_t encodes the length
>>>>>of following text), and the following 4 bytes (single-precision
>>>>>float) encodes the right length of file (as described in RFC7049
>>>>><http://tools.ietf.org/html/rfc7049>).
>>>>>Therefore, a CBOR tag is currently included into the file (a list of
>>>>>cbor tags is available here
>>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>I did not know about CBOR "magic header". Thanks a lot Luke for this
>>>>>great research. Chris, if you agree, I can add support for
>>>>>prepending self-describing CBOR tag 55799 to CommonCrawldataDumper
>>>>>class. I believe it is very easy because I have to enable the
>>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code
>>>>>is available here
>>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/sr
>>>>>c
>>>>>/
>>>>>m ain
>>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>Then, I can comment the TIKA-1610
>>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>
>>>>>Regarding the file extension, in the Memex CCA format the original
>>>>>file extension is used. We could add support for a -extension
>>>>>command-line option allowing the user to give a file extension
>>>>>(e.g.,
>>>>>cbor) for all files dumped out.
>>>>>
>>>>>Thanks a lot,
>>>>>Giuseppe
>>>>>
>>>>>
>>>>>
>>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980)
>>>>><ch...@jpl.nasa.gov> wrote:
>>>>>
>>>>>Thanks for this great research, Luke!
>>>>>
>>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>>
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Chris Mattmann, Ph.D.
>>>>>Chief Architect
>>>>>Instrument Software and Science Data Systems Section (398) NASA Jet
>>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>Office: 168-519, Mailstop: 168-527
>>>>>Email: chris.a.mattmann@nasa.gov
>>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Adjunct Associate Professor, Computer Science Department University
>>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <ha...@gmail.com>
>>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U
>>>>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann
>>>>><Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>Students <ns...@googlegroups.com>,
>>>>>"memex-jpl@googlegroups.com"
>>>>><me...@googlegroups.com>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks professor.
>>>>>>Hi professor and all.
>>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>
>>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>>
>>>>>>It looks like there is a self describing tag that needs to be
>>>>>>written in the cbor file thru which other applications might be
>>>>>>able to identify the cbor type....
>>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>
>>>>>>I don’t see that tag being present in the cbor file dumped by the
>>>>>>nutch tool, I am not very sure though.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C
>>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar
>>>>>>CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>Nice one, Luke. If you have a second and you can open up an issue
>>>>>>in Tika to make it support CBOR, then yes, by all means! :)
>>>>>>
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>chris.mattmann@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke <ha...@gmail.com>
>>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann
>>>>>><ch...@gmail.com>, Chris Mattmann
>>>>>><Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>><an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>>Students <ns...@googlegroups.com>,
>>>>>><me...@googlegroups.com>
>>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>>
>>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of
>>>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>
>>>>>>>BTW, it looks like Tika might need to consider the support with
>>>>>>>COBR parser and detection.
>>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>PFA:
>>>>>>>rfc_cbor.jpg
>>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper
>>>>>>>is not dumping the nutch segments with the .cbor extension, which
>>>>>>>seems to be helpful for type detection.
>>>>>>>
>>>>>>>To professor Mattmann,
>>>>>>>Tika does not support the detection of COBR, although the trunk
>>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in
>>>>>>>the tika-mimetypes.xml, those entries are not detecting properly
>>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does
>>>>>>>not have magic bytes, off the top of my head the only way we can
>>>>>>>detect it is using the extension, and content byte histogram
>>>>>>>(please note, this is a local optimal solution and
>>>>>>>data-dependent.)  J
>>>>>>>
>>>>>>>I think I am bit deviating from the main route and discussion of
>>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime
>>>>>>>detector selection” with polar data.
>>>>>>>Anyway, I plan to repackage tika by incorporating the
>>>>>>>probabilistic selection feature and replace the tika jar in nutch
>>>>>>>with the repackaged one, and then run the CommonCrawlDataDumper
>>>>>>>and see how it goes. If you have any specific ideas and thought
>>>>>>>with the testing, please kindly let me know.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>To: Luke liu
>>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF
>>>>>>>Polar CyberInfrastructure DR Students; memex-jpl@googlegroups.com
>>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Luke,
>>>>>>>
>>>>>>>
>>>>>>>my name is Giuseppe and I am a PhD student working under the
>>>>>>>supervision of Prof. Chris Mattmann. I worked on
>>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a
>>>>>>>couple of your observations. My comments inline below.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <sh...@usc.edu>
>>>>>>>ha
>>>>>>>scritto:
>>>>>>>
>>>>>>>
>>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending
>>>>>>>some time in understanding the code repo i.e.
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping
>>>>>>>the crawl segments to json files with the human readable and
>>>>>>>understandable content.
>>>>>>>1) I am trying to run one of the commands on my side as shown in
>>>>>>>gen-common-crawl.sh, but the generated files all end with .html or
>>>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude
>>>>>>>to where the data is located on our nsfpolardata.dyndns.org
>>>>>>><http://nsfpolardata.dyndns.org>
>>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not
>>>>>>>exactly correct (probably they need to be updated), part of the
>>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped
>>>>>>>files are all ending with html, but surprisingly inside those
>>>>>>>outputted html files, the contents are present in json format;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>>More in detail, using the -epochFilename command-line option (as
>>>>>>>in gen-common-crawl.sh), the scraped data will be stored with a
>>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where
>>>>>>><filetype> is either the extension of the original file or .html
>>>>>>>as default if the original file does not have an extension. This
>>>>>>>schema is used for file naming and it does not depend on internal
>>>>>>>output format (JSON).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>2) Another problem is that the root object is being set with some
>>>>>>>garbled chars in each of the outputted json files (with extension
>>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json
>>>>>>>file has been also attached as an example too (PFA:
>>>>>>>1423894754000.html); the json files cannot be parsed properly by
>>>>>>>aggregate.py due to those garbled chars.
>>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes
>>>>>>>element which are being read by aggregate.py.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Text content and metadata extracted from the crawled binary data
>>>>>>>are stored in a structured document format (JSON). Furthermore,
>>>>>>>this document is encoded using CBOR <http://cbor.io/>
>>>>>>>serialization. Each not human-readable character that you notice
>>>>>>>in front and at the end of JSON data is due to CBOR-encoding.
>>>>>>>Thus, if you need to read JSON data from document dumped out by
>>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded
>>>>>>>data structure inside the file.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>I hope this short overview can help in you work. I really
>>>>>>>appreciate your feedback and, by the way, thanks a lot for your
>>>>>>>great job in detection.
>>>>>>>
>>>>>>>I am available to provide you all support I can give, so you do
>>>>>>>not hesitate to contact me if you may need any further information.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>
>>>>>>>Giuseppe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Finally, after some research, I guess that the statistical
>>>>>>>information (present in the readme of the code repo) is not being
>>>>>>>collected and computed by aggregate.py from those output json
>>>>>>>files but it looks like it is coming from the log.... see the
>>>>>>>following as an example:
>>>>>>>
>>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper -
>>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>>TOTAL Stats:
>>>>>>>[
>>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>>]
>>>>>>>
>>>>>>>It turns out that aggregate.py is not the one that produces the
>>>>>>>statistical information, not sure what it does... but anyway, I
>>>>>>>think I understand the whole idea and I do concur with it, might
>>>>>>>be we can repackage the tika by incorporating the feature (i.e.
>>>>>>>probabilistic mime
>>>>>>>selection) in it and see if it can output the same information as
>>>>>>>the one without it in the log.
>>>>>>>
>>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>selection:
>>>>>>>in my pull request, I added a simple test case which might tell a
>>>>>>>bit more about how the feature is called and used, it is simple
>>>>>>>though.
>>>>>>>Here is an example snippet
>>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new
>>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more
>>>>>>>information with this can be found in
>>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>Tika().detect() is being called by commoncrawldump), we need to
>>>>>>>modify/add some code in the TikaConfig which initializes a list of
>>>>>>>default detectors, and we need to get rid of the detector -
>>>>>>>mimeTypes::
>>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create
>>>>>>>another pull request with this change for
>>>>>>>TikaConfig)
>>>>>>>
>>>>>>>I think that is all of my initial thought with some finding and
>>>>>>>plan; if you have anything you would like to please add and
>>>>>>>comment, please do kindly let me know, then I will start working
>>>>>>>on my 'finale'. BTW, don’t worry, even after I am graduated, the
>>>>>>>graduation is not my termination with tika and this project, after
>>>>>>>then I still can and want to help this polar project and tika as
>>>>>>>much as possible, and correct the programming faults and bugs,
>>>>>>>respond to the tika issues ,etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students;
>>>>>>>memex-jpl@googlegroups.com
>>>>>>>Subject: Re: this week action from luke
>>>>>>>Importance: High
>>>>>>>
>>>>>>>Awesome Luke. I am going to work specifically on now benchmarking
>>>>>>>your code in real situations. For example, it would be fantastic
>>>>>>>to now run your Bayesian MIME detector over the whole NSF TREC
>>>>>>>Dynamic Domain data for Polar described here:
>>>>>>>
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and
>>>>>>>Annie can explain it, also CC’ed.
>>>>>>>
>>>>>>>Can we make that your goal for the next 2 weeks to actually test
>>>>>>>it and produce a real result over the whole TREC-DD data for
>>>>>>>Polar? My goal will be to get your code committed and integrated
>>>>>>>into Tika.
>>>>>>>The more you can write me a guide of how to build and test your
>>>>>>>code with Tika so I can get it committed the better.
>>>>>>>
>>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is
>>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s
>>>>>>>existing MIME detection approach. If folks have any Memex needs to
>>>>>>>try and test more accurate file identification with Tika, Luke is
>>>>>>>the guy to talk to and I have him for 2 more weeks.
>>>>>>>
>>>>>>>Thanks!
>>>>>>>
>>>>>>>Cheers,
>>>>>>>Chris
>>>>>>>
>>>>>>>------------------------
>>>>>>>Chris Mattmann
>>>>>>>chris.mattmann@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Luke liu <sh...@usc.edu>
>>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann
>>>>>>><Ch...@jpl.nasa.gov>
>>>>>>>Cc: 'Luke' <ha...@gmail.com>
>>>>>>>Subject: this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Professor Mattmann,
>>>>>>>
>>>>>>>I think I am in the final phase of the research, and last week I
>>>>>>>finished the last item in the list, and hopefully everything will
>>>>>>>be fine.
>>>>>>>
>>>>>>>For now, i probably can spend some time in verifying or optimizing
>>>>>>>the codes, the majority of the research has been done…and it will
>>>>>>>be also great if you can please comment on my work (the 2 pull
>>>>>>>requests) when you have time.
>>>>>>>
>>>>>>>If you do have confusion with any of my work, please also do let
>>>>>>>me know.
>>>>>>>
>>>>>>>Thanks and I am glad working with you, for the next a couple of
>>>>>>>weeks before graduation, I am going to continue revising and
>>>>>>>testing the code and features to get rid of some flaws (if any
>>>>>>>)when I have time.
>>>>>>>
>>>>>>>Not sure if I miss out something, and if I do miss some thing
>>>>>>>important, please do let me know too.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>You received this message because you are subscribed to the Google
>>>>>>>Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>To unsubscribe from this group and stop receiving emails from it,
>>>>>>>send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>>><ma...@googlegroups.com>.
>>>>>>>To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>To view this discussion on the web visit
>>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351007
>>>>>>>0
>>>>>>>%
>>>>>>>2
>>>>>>>41
>>>>>>>9f3
>>>>>>>0150%24%40edu.
>>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>>><garbled.jpg><1423894754000.html>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

RE: [memex-jpl] this week action from luke

Posted by Luke <ha...@gmail.com>.

Hi professor,

I just tried it with minLength set to 1024, I get the following 
"text/plain"
I am a bit surprised....

BTW, the 6000 min length still give "application/xhtml+xml"; with anything below 1024 min length, I am seeing "text/plain". :)

BTW, the min length I am referring/altering is as follows
MimeTypes.java
	public int getMinLength() {
        // This needs to be reasonably large to be able to correctly detect
        // things like XML root elements after initial comment and DTDs
        return 64 * 1024;
    }


Thanks
Luke

-----Original Message-----
From: Chris Mattmann [mailto:chris.mattmann@gmail.com] 
Sent: Tuesday, April 21, 2015 7:48 PM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Luke <ha...@gmail.com>
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann <Ch...@jpl.nasa.gov>, Chris Mattmann <ch...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<to...@di.uniroma1.it>, <de...@tika.apache.org>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR Students <ns...@googlegroups.com>,
<me...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>Hi professor,
>
>
>I think it highly depends on the content being read by tika, e.g. if 
>there is a sequence of bytes in the file that is being read and is the 
>same as one or more of mime types being defined in our tika-mimes.xml, 
>I guess that tika will put those types in its estimation list, please 
>note there could be multiple estimated mime types by magic-byte 
>detection approach. Now tika also considers the decision made by 
>extension detection approach, if extension says the file type it 
>believes is the first one in the magic type estimation list, then 
>certainly the first one will be returned. (the same applies to metadata 
>hint approach); Of course, tika also prefers the type that is the most specialized.
>
>let's get back to the following question, here is my guess though.
>[Prof]: Also what happens if you tweak the definition of XHTML to not 
>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>it seems that magic bytes will inevitable detect nothing, and I think 
>it will return the something like" application/oct-stream" that is the 
>most general type. As mentioned, tika favours the one that is the most 
>specialized, if extension approach returns the one that is more 
>specialized, in this extreme case I believe almost every type is a 
>subclass of this "application/oct-stream".... therefore the answer in 
>this extreme may be yes, I think it is very possible that CBOR type 
>detected by the extension approach takes over in this case...
>
>My idea was and still is that if the cbor self-Describing tag 55799 is 
>present in the cbor file, then that can be used to detect the cbor type.
>Again, the cbor type will probably be appended into the magic 
>estimation list together with another one such as application/html, I 
>guess the order in the list probably also matters, the first one is 
>preferred over the next one. Also the decision from the extension 
>detection approach also play the role the break the tie.
>e.g. if extension detection method agrees on cbor with one of the 
>estimated type in the magic list, then cbor will be returned. (again, 
>same thing applies to metadatahint method).
>
>I have not taken a closer look at a cbor file that has the tag 55799, 
>but I expect to see its hex is something like 0xd9d9f7 or the tag 
>should be present in the header with a fixed sequence of
>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>present in the file or preferable in the header (within a reasonable 
>range of bytes ), I believe it can probably be used as the magic 
>numbers for the cbor type.
>
>
>There is another thing I have mentioned in the jira ticket I opened 
>yesterday against the cbor parser and detection, it is also possible 
>that cbor content can be imbedded inside a plain json file, the way 
>that a decoder can distinguish them in that file is by looking at the 
>tag 55799 again. This may rarely happen but a robust parser might be 
>able to take care of that, tika might need to consider the use of 
>fastXML being used by the nutch tool when developing the cbor parser...
>Again let me cite the same paragraph from the rfc,
>
>" a decoder might be able to parse both CBOR and JSON.
>   Such a decoder would need to mechanically distinguish the two
>   formats.  An easy way for an encoder to help the decoder would be to
>   tag the entire CBOR item with tag 55799, the serialization of which
>   will never be found at the beginning of a JSON text."
>
>
>Thanks
>Luke
>
>
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Tuesday, April 21, 2015 9:49 PM
>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 
>'NSF Polar CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Hi Luke,
>
>Can you post the below conversation to dev@tika and summarize it there.
>Also what happens if you tweak the definition of XHTML to not scan 
>until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet 
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department University of 
>Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Luke <ha...@gmail.com>
>Date: Wednesday, April 22, 2015 at 12:19 AM
>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
><Ch...@jpl.nasa.gov>
>Cc: "Bryant, Ann C (398G-Affiliate)" <an...@gmail.com>, "Zimdars, 
>Paul A (3980-Affiliate)" <Pa...@jpl.nasa.gov>, NSF Polar 
>CyberInfrastructure DR Students 
><ns...@googlegroups.com>,
>"memex-jpl@googlegroups.com" <me...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi Professor,
>>Please see attached jpg for the difference.
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>Sent: Tuesday, April 21, 2015 5:27 PM
>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>memex-jpl@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m < 
>>/path/to/cbor/file.cbor any difference?
>>
>>------------------------
>>Chris Mattmann
>>chris.mattmann@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <ha...@gmail.com>
>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>To: 'Luke' <ha...@gmail.com>, Chris Mattmann 
>><ch...@gmail.com>, 'Giuseppe Totaro' <to...@di.uniroma1.it>, 
>>Chris Mattmann <Ch...@jpl.nasa.gov>
>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <an...@gmail.com>, 
>>"'Zimdars, Paul A (3980-Affiliate)'" <Pa...@jpl.nasa.gov>, 
>>NSF Polar CyberInfrastructure DR Students 
>><ns...@googlegroups.com>,
>><me...@googlegroups.com>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi professor,
>>>I just sent a pull request for adding cbor extension.
>>>The interesting thing is that tika is still identifying the file 
>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when 
>>>I manually change the file extension to the correct one (i.e. *.cbor ).
>>>
>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>by searching for the "&lt;html" in the file content, PFA:
>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>xhtml+nutch,
>>>you see that we do have that element as part of the cbor content 
>>>because the entire crawled xhtml document seems to be imbedded in the 
>>>cbor json(PFA:
>>>cbor.jpg); and also in Tika, the magic detection seems to have higher 
>>>priority over the glob detection, thus the type is being incorrectly 
>>>detected.
>>>
>>>Therefore, I would like to please mention that adding the entry of 
>>><glob pattern="*.cbor"/> is not resolving the issue as of now without 
>>>some fixed magic bytes / patterns for cbor.
>>>I also would like to add that the thing will be different with our 
>>>probabilistic mime detection selector, because if we know that the 
>>>file extension is more reliable than magic bytes, then we can 
>>>certainly add more preferential weight to the extension... this also 
>>>might show the current implementation with MimeTypes detection is a 
>>>bit stiff or less flexible in this scneario. :)
>>>
>>>
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Luke [mailto:hanson311biz@gmail.com]
>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>'memex-jpl@googlegroups.com'
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>Yes, let me add the cbor extension entry in tika xml, will send the 
>>>pull request soon.
>>>
>>>Thanks
>>>Luke
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>memex-jpl@googlegroups.com
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>and tag along with adding an -extension command would be fantastic. 
>>>Can you file both of those NUTCH issues, wait a day or so, and then 
>>>based on feedback use your new Nutch commit karma to get those into Nutch?
>>>
>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>At that point, when those two to be defined NUTCH issues are up, 
>>>Luke, in parallel can you throw up a pull request/patch in Tika for 
>>>the extension along with the MIME detection?
>>>
>>>Cheers,
>>>Chris
>>>
>>>------------------------
>>>Chris Mattmann
>>>chris.mattmann@gmail.com
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Giuseppe Totaro <to...@di.uniroma1.it>
>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>Cc: Luke <ha...@gmail.com>, Chris Mattmann 
>>><ch...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>Students <ns...@googlegroups.com>,
>>>"memex-jpl@googlegroups.com"
>>><me...@googlegroups.com>
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>>Thanks Luke. Great work.
>>>>Chris, we wrap a single string value, representing the JSON text, 
>>>>for each file into CBOR (by using serializeCBORData method). For 
>>>>instance, using the Unix hex dump tool, we can see that, as 
>>>>expected, the first byte of all files is "0x7F" (the first three 
>>>>bits are "011", that is the major type for strings, and the 
>>>>following 5 bits are "11010", meaning a uint32_t encodes the length 
>>>>of following text), and the following 4 bytes (single-precision 
>>>>float) encodes the right length of file (as described in RFC7049 
>>>><http://tools.ietf.org/html/rfc7049>).
>>>>Therefore, a CBOR tag is currently included into the file (a list of 
>>>>cbor tags is available here 
>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>I did not know about CBOR "magic header". Thanks a lot Luke for this 
>>>>great research. Chris, if you agree, I can add support for 
>>>>prepending self-describing CBOR tag 55799 to CommonCrawldataDumper 
>>>>class. I believe it is very easy because I have to enable the 
>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code 
>>>>is available here 
>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/sr
>>>>c
>>>>/
>>>>m ain
>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>Then, I can comment the TIKA-1610
>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>
>>>>Regarding the file extension, in the Memex CCA format the original 
>>>>file extension is used. We could add support for a -extension 
>>>>command-line option allowing the user to give a file extension 
>>>>(e.g.,
>>>>cbor) for all files dumped out.
>>>>
>>>>Thanks a lot,
>>>>Giuseppe
>>>>
>>>>
>>>>
>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>><ch...@jpl.nasa.gov> wrote:
>>>>
>>>>Thanks for this great research, Luke!
>>>>
>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Chris Mattmann, Ph.D.
>>>>Chief Architect
>>>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>Office: 168-519, Mailstop: 168-527
>>>>Email: chris.a.mattmann@nasa.gov
>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Adjunct Associate Professor, Computer Science Department University 
>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Luke <ha...@gmail.com>
>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>To: Chris Mattmann <ch...@gmail.com>, "Totaro, Giuseppe U 
>>>>(3980-Affiliate)" <to...@di.uniroma1.it>, Chris Mattmann 
>>>><Ch...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>><an...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>Students <ns...@googlegroups.com>,
>>>>"memex-jpl@googlegroups.com"
>>>><me...@googlegroups.com>
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks professor.
>>>>>Hi professor and all.
>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>
>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>
>>>>>It looks like there is a self describing tag that needs to be 
>>>>>written in the cbor file thru which other applications might be 
>>>>>able to identify the cbor type....
>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>
>>>>>I don’t see that tag being present in the cbor file dumped by the 
>>>>>nutch tool, I am not very sure though.
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>>CyberInfrastructure DR Students'; memex-jpl@googlegroups.com
>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>
>>>>>Nice one, Luke. If you have a second and you can open up an issue 
>>>>>in Tika to make it support CBOR, then yes, by all means! :)
>>>>>
>>>>>
>>>>>------------------------
>>>>>Chris Mattmann
>>>>>chris.mattmann@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <ha...@gmail.com>
>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>To: 'Giuseppe Totaro' <to...@di.uniroma1.it>, Chris Mattmann 
>>>>><ch...@gmail.com>, Chris Mattmann 
>>>>><Ch...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>><an...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>><Pa...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>>Students <ns...@googlegroups.com>,
>>>>><me...@googlegroups.com>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of 
>>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>
>>>>>>BTW, it looks like Tika might need to consider the support with 
>>>>>>COBR parser and detection.
>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. PFA:
>>>>>>rfc_cbor.jpg
>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper 
>>>>>>is not dumping the nutch segments with the .cbor extension, which 
>>>>>>seems to be helpful for type detection.
>>>>>>
>>>>>>To professor Mattmann,
>>>>>>Tika does not support the detection of COBR, although the trunk 
>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>>the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does 
>>>>>>not have magic bytes, off the top of my head the only way we can 
>>>>>>detect it is using the extension, and content byte histogram 
>>>>>>(please note, this is a local optimal solution and 
>>>>>>data-dependent.)  J
>>>>>>
>>>>>>I think I am bit deviating from the main route and discussion of 
>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime 
>>>>>>detector selection” with polar data.
>>>>>>Anyway, I plan to repackage tika by incorporating the 
>>>>>>probabilistic selection feature and replace the tika jar in nutch 
>>>>>>with the repackaged one, and then run the CommonCrawlDataDumper 
>>>>>>and see how it goes. If you have any specific ideas and thought 
>>>>>>with the testing, please kindly let me know.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>From: Giuseppe Totaro [mailto:totaro@di.uniroma1.it]
>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>To: Luke liu
>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>Polar CyberInfrastructure DR Students; memex-jpl@googlegroups.com
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Luke,
>>>>>>
>>>>>>
>>>>>>my name is Giuseppe and I am a PhD student working under the 
>>>>>>supervision of Prof. Chris Mattmann. I worked on 
>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>couple of your observations. My comments inline below.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <sh...@usc.edu> 
>>>>>>ha
>>>>>>scritto:
>>>>>>
>>>>>>
>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>>some time in understanding the code repo i.e.
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping 
>>>>>>the crawl segments to json files with the human readable and 
>>>>>>understandable content.
>>>>>>1) I am trying to run one of the commands on my side as shown in 
>>>>>>gen-common-crawl.sh, but the generated files all end with .html or 
>>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude 
>>>>>>to where the data is located on our nsfpolardata.dyndns.org 
>>>>>><http://nsfpolardata.dyndns.org>
>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>>exactly correct (probably they need to be updated), part of the 
>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>>files are all ending with html, but surprisingly inside those 
>>>>>>outputted html files, the contents are present in json format;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>More in detail, using the -epochFilename command-line option (as 
>>>>>>in gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where 
>>>>>><filetype> is either the extension of the original file or .html 
>>>>>>as default if the original file does not have an extension. This 
>>>>>>schema is used for file naming and it does not depend on internal 
>>>>>>output format (JSON).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>2) Another problem is that the root object is being set with some 
>>>>>>garbled chars in each of the outputted json files (with extension 
>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>>file has been also attached as an example too (PFA:
>>>>>>1423894754000.html); the json files cannot be parsed properly by 
>>>>>>aggregate.py due to those garbled chars.
>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>>element which are being read by aggregate.py.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Text content and metadata extracted from the crawled binary data 
>>>>>>are stored in a structured document format (JSON). Furthermore, 
>>>>>>this document is encoded using CBOR <http://cbor.io/> 
>>>>>>serialization. Each not human-readable character that you notice 
>>>>>>in front and at the end of JSON data is due to CBOR-encoding. 
>>>>>>Thus, if you need to read JSON data from document dumped out by 
>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded 
>>>>>>data structure inside the file.
>>>>>>
>>>>>>
>>>>>>
>>>>>>I hope this short overview can help in you work. I really 
>>>>>>appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>great job in detection.
>>>>>>
>>>>>>I am available to provide you all support I can give, so you do 
>>>>>>not hesitate to contact me if you may need any further information.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Giuseppe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Finally, after some research, I guess that the statistical 
>>>>>>information (present in the readme of the code repo) is not being 
>>>>>>collected and computed by aggregate.py from those output json 
>>>>>>files but it looks like it is coming from the log.... see the 
>>>>>>following as an example:
>>>>>>
>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>TOTAL Stats:
>>>>>>[
>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>]
>>>>>>
>>>>>>It turns out that aggregate.py is not the one that produces the 
>>>>>>statistical information, not sure what it does... but anyway, I 
>>>>>>think I understand the whole idea and I do concur with it, might 
>>>>>>be we can repackage the tika by incorporating the feature (i.e.
>>>>>>probabilistic mime
>>>>>>selection) in it and see if it can output the same information as 
>>>>>>the one without it in the log.
>>>>>>
>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>selection:
>>>>>>in my pull request, I added a simple test case which might tell a 
>>>>>>bit more about how the feature is called and used, it is simple 
>>>>>>though.
>>>>>>Here is an example snippet
>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>information with this can be found in
>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as 
>>>>>>Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>modify/add some code in the TikaConfig which initializes a list of 
>>>>>>default detectors, and we need to get rid of the detector - mimeTypes::
>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>>another pull request with this change for
>>>>>>TikaConfig)
>>>>>>
>>>>>>I think that is all of my initial thought with some finding and 
>>>>>>plan; if you have anything you would like to please add and 
>>>>>>comment, please do kindly let me know, then I will start working 
>>>>>>on my 'finale'. BTW, don’t worry, even after I am graduated, the 
>>>>>>graduation is not my termination with tika and this project, after 
>>>>>>then I still can and want to help this polar project and tika as 
>>>>>>much as possible, and correct the programming faults and bugs, 
>>>>>>respond to the tika issues ,etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>memex-jpl@googlegroups.com
>>>>>>Subject: Re: this week action from luke
>>>>>>Importance: High
>>>>>>
>>>>>>Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>>your code in real situations. For example, it would be fantastic 
>>>>>>to now run your Bayesian MIME detector over the whole NSF TREC 
>>>>>>Dynamic Domain data for Polar described here:
>>>>>>
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and 
>>>>>>Annie can explain it, also CC’ed.
>>>>>>
>>>>>>Can we make that your goal for the next 2 weeks to actually test 
>>>>>>it and produce a real result over the whole TREC-DD data for 
>>>>>>Polar? My goal will be to get your code committed and integrated into Tika.
>>>>>>The more you can write me a guide of how to build and test your 
>>>>>>code with Tika so I can get it committed the better.
>>>>>>
>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is 
>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s 
>>>>>>existing MIME detection approach. If folks have any Memex needs to 
>>>>>>try and test more accurate file identification with Tika, Luke is 
>>>>>>the guy to talk to and I have him for 2 more weeks.
>>>>>>
>>>>>>Thanks!
>>>>>>
>>>>>>Cheers,
>>>>>>Chris
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>chris.mattmann@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke liu <sh...@usc.edu>
>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>To: Chris Mattmann <ch...@gmail.com>, Chris Mattmann 
>>>>>><Ch...@jpl.nasa.gov>
>>>>>>Cc: 'Luke' <ha...@gmail.com>
>>>>>>Subject: this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Professor Mattmann,
>>>>>>
>>>>>>I think I am in the final phase of the research, and last week I 
>>>>>>finished the last item in the list, and hopefully everything will 
>>>>>>be fine.
>>>>>>
>>>>>>For now, i probably can spend some time in verifying or optimizing 
>>>>>>the codes, the majority of the research has been done…and it will 
>>>>>>be also great if you can please comment on my work (the 2 pull
>>>>>>requests) when you have time.
>>>>>>
>>>>>>If you do have confusion with any of my work, please also do let 
>>>>>>me know.
>>>>>>
>>>>>>Thanks and I am glad working with you, for the next a couple of 
>>>>>>weeks before graduation, I am going to continue revising and 
>>>>>>testing the code and features to get rid of some flaws (if any 
>>>>>>)when I have time.
>>>>>>
>>>>>>Not sure if I miss out something, and if I do miss some thing 
>>>>>>important, please do let me know too.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>--
>>>>>>You received this message because you are subscribed to the Google 
>>>>>>Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>To unsubscribe from this group and stop receiving emails from it, 
>>>>>>send an email to memex-jpl+unsubscribe@googlegroups.com
>>>>>><ma...@googlegroups.com>.
>>>>>>To post to this group, send email to memex-jpl@googlegroups.com.
>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>To view this discussion on the web visit
>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351007
>>>>>>0
>>>>>>%
>>>>>>2
>>>>>>41
>>>>>>9f3
>>>>>>0150%24%40edu.
>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>><garbled.jpg><1423894754000.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>