You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by lewis john mcgibbney <le...@apache.org> on 2018/06/21 01:37:20 UTC
Fwd: FW: June 2018 Newsletter - LDC

---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
Date: Tue, Jun 19, 2018 at 3:34 PM
Subject: FW: June 2018 Newsletter - LDC
To: lewis john mcgibbney <le...@apache.org>






Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>

Pasadena, California 91109
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>
-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_55933217]



 Dare Mighty Things



*From: *Ldc-customers1 <ld...@ldc.upenn.edu> on behalf of
Penn LDC <ld...@ldc.upenn.edu>
*Date: *Monday, June 18, 2018 at 8:09 AM
*To: *Penn LDC <ld...@ldc.upenn.edu>
*Subject: *June 2018 Newsletter - LDC



*In this newsletter: *



*LDC Catalog certified as CoreTrustSeal data repository *


*LDC data and commercial technology development *
*New Publications:*

*BOLT Chinese SMS/Chat* <https://catalog.ldc.upenn.edu/LDC2018T15>

*Multi-Language Conversational Telephone Speech 2011 -- Central European*
<https://catalog.ldc.upenn.edu/LDC2018S08>

*TAC KBP English Entity Linking - Comprehensive Training and Evaluation
Data 2009-2013* <https://catalog.ldc.upenn.edu/LDC2018T16>

*IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b*
<https://catalog.ldc.upenn.edu/LDC2018S07>
____________________________________________________________
__________________

*LDC Catalog certified as CoreTrustSeal data repository *

LDC is pleased to announce that the Catalog <https://catalog.ldc.upenn.edu/>
has been awarded the CoreTrustSeal <https://www.coretrustseal.org/> for
recognition as a trustworthy data repository. This means that the Catalog
meets a series of standards covering data access, rights management,
curation, and storage developed by the ISCU World Data System and the Data
Seal of Approval. LDC joins the other 136 certified repositories around the
globe in the commitment to promote sustainable and trustworthy data
infrastructures.

*LDC data and commercial technology development*

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or
for any commercial purpose. LDC data users should consult corpus-specific
license agreements for limitations on the use of certain corpora. Visit the
Licensing <https://www.ldc.upenn.edu/data-management/using/licensing> page
for further information.

____________________________________________________________
___________________


* New publications:*

(1) *BOLT Chinese SMS/Chat* <https://catalog.ldc.upenn.edu/LDC2018T15> was
developed by LDC and consists of naturally-occurring Short Message Service
(SMS) and Chat (CHT) data collected through data donations and live
collection involving native speakers of Chinese. The corpus contains 14,877
conversations totaling 3,005,810 words across 497,543 messages.

The BOLT  <https://www.ldc.upenn.edu/collaborations/current-projects/bolt>(Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting
informal data sources – discussion forums, text messaging, and chat – in
Chinese, Egyptian Arabic, and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking, and co-reference. The data in this release was collected using
two methods: new collection via LDC's collection platform, and donation of
SMS or chat archives from BOLT collection participants.

BOLT Chinese SMS/Chat is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.



*



(2) *Multi-Language Conversational Telephone Speech 2011 -- Central
European* <https://catalog.ldc.upenn.edu/LDC2018S08> was developed by LDC
and is comprised of approximately 44 hours of telephone speech in two
distinct language varieties of Central Europe: Czech and Slovak.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation (
LRE <https://www.nist.gov/itl/iad/mig/2011-language-recognition-evaluation>).
Participants were recruited by native speakers who contacted acquaintances
in their social network. Those native speakers made one call, up to 15
minutes, to each acquaintance. Human auditors labeled the calls for callee
gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

·         Slavic Group (LDC2016S11
<https://catalog.ldc.upenn.edu/LDC2016S11>)

·         Turkish (LDC2017S09 <https://catalog.ldc.upenn.edu/LDC2017S09>)

·         South Asian (LDC2017S14 <https://catalog.ldc.upenn.edu/LDC2017S14>
)

·         Central Asian (LDC2018S03
<https://catalog.ldc.upenn.edu/LDC2018S03>)

Multi-Language Conversational Telephone Speech 2011 -- Central European is
distributed via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $2000.



*



(3) *TAC KBP English Entity Linking - Comprehensive Training and Evaluation
Data 2009-2013* <https://catalog.ldc.upenn.edu/LDC2018T16> was developed by
LDC and contains training and evaluation data produced in support of the
TAC KBP English Entity Linking tasks in 2009 <http://pmcnamee.net/kbp.html>,
2010 <https://tac.nist.gov/2010/KBP/index.html>, 2011
<http://tac.nist.gov/2011/KBP/index.html>, 2012
<http://tac.nist.gov/2012/KBP/index.html>, and 2013
<http://tac.nist.gov/2013/KBP/index.html>. It includes queries and gold
standard entity type information, Knowledge Base links, and equivalence
class clusters for NIL entities. Also included are the source documents for
the queries, specifically, English newswire, discussion forum, and web
data. The corresponding knowledge base is available as TAC KBP Reference
Knowledge Base (LDC2014T16 <https://catalog.ldc.upenn.edu/LDC2014T16>).
Also included in this package are the results of an Entity Linking IAA
(Inter-Annotator Agreement) study conducted in 2010.

TAC KBP encourages the development of systems that can match entities
mentioned in natural texts with those appearing in a knowledge base and
extract novel information about entities from a document collection and add
it to a new or existing knowledge base. English Entity Linking was first
conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure
systems' ability to determine whether an entity, specified by a query, has
a matching node in a reference knowledge base (KB) and, if so, to create a
link between the two. If there is no matching node for a query entity in
the KB, EL systems are required to cluster the mention together with others
referencing the same entity.

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data
2009-2013 is distributed via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1000.

*



(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
<https://catalog.ldc.upenn.edu/LDC2018S07> was developed by Appen
<http://www.appen.com/> for the IARPA (Intelligence Advanced Research
Projects Activity) Babel
<http://www.iarpa.gov/index.php/research-programs/babel> program. It
contains approximately 191 hours of Cebuano conversational and scripted
telephone speech collected in 2013 and 2014 along with corresponding
transcripts.



The Cebuano speech in this release represents that spoken in the Cebu-North
Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender
distribution among speakers is approximately equal; speakers' ages range
from 16 years to 75 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the
street, a home or office, a public place, and inside a vehicle.



IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via
web download.



2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US $25.



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810
<https://maps.google.com/?q=3600+Market+St.+Suite+810+%0D%0A+%0D%0A+Philadelphia,+PA+19104&entry=gmail&source=g>

      Philadelphia, PA 19104
<https://maps.google.com/?q=3600+Market+St.+Suite+810+%0D%0A+%0D%0A+Philadelphia,+PA+19104&entry=gmail&source=g>







-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc