You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chris Mattmann <ma...@apache.org> on 2014/01/31 07:02:22 UTC

Submission to ApacheCon on Tika

Hey Guys,

I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA
2014:

Real Data Science: Exploring the FBI's Vault dataset with Apache Tika,
Nutch and Solr
Event ApacheCon North America
Submission Type Lightning Talk
Category Developer
Biography Chris Mattmann has a wealth of experience in software design,
and in the construction of large-scale data-intensive systems. His work
has infected a broad set of communities, ranging from helping NASA unlock
data from its next generation of earth science system satellites, to
assisting graduate students at the University of Southern California (his
Alma mater) in the study of software architecture, all the way to helping
industry and open source as a member of the Apache Software Foundation.
When he's not busy being busy, he's spending time with his lovely wife and
son braving the mean streets of Southern California.
Abstract Apache Tika is a content detection and analysis toolkit allowing
automated MIME type identification and rapid parsing of text and metadata
from over 1200 types of files including all major file types from the
Internet Assigned Number Authority's MIME database. In this talk I'll show
you how to practically use Apache Tika to explore the FBI's vault of
declassified PDF documents, and to use Apache Nutch to pull down the
dataset, and how to use Solr to ingest, and geoclassify the documents so
that can build a map of FBI PDF documents corresponding to your favorite
conspiracies throughout the USA. I've taught this material in my CSCI 572
Search Engines class at USC and it's a big hit. These are normally three
assignments, so I will do my best to boil down their essence into a
45min-60 min talk replete with danger and excitement.
Audience Developers interested in using Tika, Nutch and Solr. Folks
interested in the FBI vault dataset. GIS wonks. The like.
Experience Level Intermediate
Benefits to the Ecosystem The core of the talk will be Tika, but there
will be some Nutch magic, and some Solr magic at very basic levels. The
benefits of the ecosystem will be the real display of data science
involved and on a real dataset.
Technical Requirements I need an internet connection, and a projector.
Status New




Cheers,
Chris



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Submission to ApacheCon on Tika

Posted by Chris Mattmann <ma...@apache.org>.
Thanks Jukka!

My Tika talk had to be moved to Wednesday since I wasn't sure
I would be there at ApacheCon the whole time, and co-locating
my talks around the same day was advantageous, so I asked Rich
to move me. Annie's talk was originally I believe set for Wed
too, however I am not sure if she has the same constraints.

Cheers,
Chris




-----Original Message-----
From: Jukka Zitting <ju...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, March 2, 2014 7:59 AM
To: Tika Development <de...@tika.apache.org>
Subject: Re: Submission to ApacheCon on Tika

>Hi,
>
>On Fri, Jan 31, 2014 at 10:44 AM, Jukka Zitting <ju...@gmail.com>
>wrote:
>> OK, good! I'll adjust my submission so that it would work well as a
>> possible followup to your talk, and we can coordinate the details if
>> both get accepted.
>
>Looks like all the Tika talks got accepted!
>
>See 
>http://apacheconnorthamerica2014.sched.org/event/128f3cc50234ff7be822feea5
>8870ca7
>for my submission. As mentioned, I adjusted it to more explicitly
>cover the odds and ends that won't fit (or won't be covered in much
>detail) in the other presentations. So far I identified structured
>text, language detection, type inference, XMP and JVM forking as such
>topics. I can also cover things like details of type detection,
>container parsing and the various ParseContext tricks we have,
>depending on how much or little of those topics are already included
>in the other presentations. Let's sync up on the details over the next
>few weeks as we work on the presentations.
>
>The schedule for Tika-related talks
>(http://apacheconnorthamerica2014.sched.org/?s=tika) looks a bit
>awkward. My talk is scheduled for Wednesday morning before Nick's
>afternoon slot, and Chris' and Annie's case studies overlap at 10am on
>Wednesday. I guess we should ask the organizers to consider
>rescheduling the talks.
>
>BR,
>
>Jukka Zitting



Re: Submission to ApacheCon on Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 2 Mar 2014, Jukka Zitting wrote:
> The schedule for Tika-related talks 
> (http://apacheconnorthamerica2014.sched.org/?s=tika) looks a bit 
> awkward. My talk is scheduled for Wednesday morning before Nick's 
> afternoon slot, and Chris' and Annie's case studies overlap at 10am on 
> Wednesday. I guess we should ask the organizers to consider rescheduling 
> the talks.

Can you give Rich a prod about that? I'd asked him when the draft schedule 
came out to re-order them, but it looks like that has got lost / forgotten 
:(

(I'm also talking in the community track on Wednesday, so that does give a 
little bit of a constraint, but Rich had agreed to a tweak that would've 
kept a sensible order and not had me talking twice at the same time!)

Nick

Re: Submission to ApacheCon on Tika

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jan 31, 2014 at 10:44 AM, Jukka Zitting <ju...@gmail.com> wrote:
> OK, good! I'll adjust my submission so that it would work well as a
> possible followup to your talk, and we can coordinate the details if
> both get accepted.

Looks like all the Tika talks got accepted!

See http://apacheconnorthamerica2014.sched.org/event/128f3cc50234ff7be822feea58870ca7
for my submission. As mentioned, I adjusted it to more explicitly
cover the odds and ends that won't fit (or won't be covered in much
detail) in the other presentations. So far I identified structured
text, language detection, type inference, XMP and JVM forking as such
topics. I can also cover things like details of type detection,
container parsing and the various ParseContext tricks we have,
depending on how much or little of those topics are already included
in the other presentations. Let's sync up on the details over the next
few weeks as we work on the presentations.

The schedule for Tika-related talks
(http://apacheconnorthamerica2014.sched.org/?s=tika) looks a bit
awkward. My talk is scheduled for Wednesday morning before Nick's
afternoon slot, and Chris' and Annie's case studies overlap at 10am on
Wednesday. I guess we should ask the organizers to consider
rescheduling the talks.

BR,

Jukka Zitting

Re: Submission to ApacheCon on Tika

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jan 31, 2014 at 10:37 AM, Nick Burch <ap...@gagravarr.org> wrote:
> I've proposed a general intro to Tika one, with a bit of a focus on scaling
> it out, suggested title is
>
>    What's with the 1s and 0s? Making sense of binary data at scale
>    with Tika and friends
>
> If you're doing a mime magic one, then I can cut back on the detection bits
> and focus more on the other parts!

OK, good! I'll adjust my submission so that it would work well as a
possible followup to your talk, and we can coordinate the details if
both get accepted.

If I understood correctly, there will be more tracks than usually, so
we might have a chance to dig deeper over more than just one Tika
talk.

BR,

Jukka Zitting

Re: Submission to ApacheCon on Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 31 Jan 2014, Jukka Zitting wrote:
> [to: only tika]
> On Fri, Jan 31, 2014 at 1:02 AM, Chris Mattmann <ma...@apache.org> wrote:
>> I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA
>> 2014:
>
> Nice! Looking forward to seeing you there. :-)
>
> I'm considering submitting an updated version of my mime magic talk
> for more Tika coverage. Do we have others coming in who're planning to
> present Tika?

I've proposed a general intro to Tika one, with a bit of a focus on 
scaling it out, suggested title is

    What's with the 1s and 0s? Making sense of binary data at scale
    with Tika and friends

If you're doing a mime magic one, then I can cut back on the detection 
bits and focus more on the other parts!

Nick

Re: Submission to ApacheCon on Tika

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

[to: only tika]

On Fri, Jan 31, 2014 at 1:02 AM, Chris Mattmann <ma...@apache.org> wrote:
> I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA
> 2014:

Nice! Looking forward to seeing you there. :-)

I'm considering submitting an updated version of my mime magic talk
for more Tika coverage. Do we have others coming in who're planning to
present Tika?

BR,

Jukka Zitting