You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@joshua.apache.org by lewis john mcgibbney <le...@apache.org> on 2018/03/15 17:04:59 UTC
Fwd: FW: March 2018 Newsletter - LDC

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
Date: Thu, Mar 15, 2018 at 08:29
Subject: FW: March 2018 Newsletter - LDC
To: lewis john mcgibbney <le...@apache.org>






Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>

Jet Propulsion Laboratory
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>

California Institute of Technology

4800 Oak Grove Drive
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>

Pasadena, California 91109
<https://maps.google.com/?q=4800+Oak+Grove+Drive+%0D%0A+%0D%0A+%0D%0A+Pasadena,+California+91109&entry=gmail&source=g>
-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_1830728840]



 Dare Mighty Things

*From: *Ldc-customers1 <ld...@ldc.upenn.edu> on behalf of
Penn LDC <ld...@ldc.upenn.edu>
*Date: *Thursday, March 15, 2018 at 7:14 AM
*To: *Penn LDC <ld...@ldc.upenn.edu>
*Subject: *March 2018 Newsletter - LDC




*In this newsletter: *
*New Publications:*

*BOLT Arabic Discussion Forums* <https://catalog.ldc.upenn.edu/LDC2018T10>

*LORELEI Somali Representative Language Pack - Monolingual and Parallel
Text* <https://catalog.ldc.upenn.edu/LDC2018T11>

*SPADE (Syntactic Phrase Alignment Dataset for Evaluation)*
<https://catalog.ldc.upenn.edu/LDC2018T09>

_______________________________________________________________________________


* New publications:*



(1) BOLT Arabic Discussion Forums <https://catalog.ldc.upenn.edu/LDC2018T10>
was developed by LDC and consists of 813,080 discussion forum threads in
Egyptian Arabic harvested from the Internet using a combination of manual
and automatic processes. The DARPA BOLT
<https://www.ldc.upenn.edu/collaborations/current-projects/bolt> (Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. The material in this release represents the
unannotated Arabic source data in the discussion forum genre.



Collection was seeded based on the results of manual data scouting by
native speaker annotators. Scouts were instructed to seek content in
Egyptian Arabic that was original, interactive and informal. Upon locating
an appropriate thread, scouts submitted the URL and some simple judgments
about it to a database, via a web browser plug-in. The scale of the
collection precluded manual review of all data. Only a small portion of the
threads included in this release were manually reviewed, and it is expected
that there may be some offensive or otherwise undesired content as well as
some threads that contain a large amount of non-Arabic content. It should
also be noted that many threads may contain a mixture of Egyptian and other
varieties of Arabic, even among the threads that are primarily Arabic.



BOLT Arabic Discussion Forums is distributed via web download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for $3,500.



*



(2) LORELEI Somali Representative Language Pack - Monolingual and Parallel
Text <https://catalog.ldc.upenn.edu/LDC2018T11> was developed by LDC and is
comprised of approximately 13 million words of monolingual Somali text,
approximately 800,000 of which are translated into English. Another 100,000
words are also translated from English into Somali. The LORELEI (Low
Resource Languages for Emergent Incidents) Program is concerned with
building Human Language Technology for low resource languages in the
context of emergent situations like natural disasters or disease outbreaks.



Data was collected in the following genres: discussion forums, news,
reference, social network and weblog. Both monolingual text collection and
parallel text creation involved a combination of manual and automatic
methods, which are detailed in the included documentation. All harvested
content was initially converted from its original HTML form into a
relatively uniform XML format. Also included in this release are two tools:
one to recreate original source data from the processed XML material and
the other to condition text data users download from Twitter.



LORELEI Somali Representative Language Pack - Monolingual and Parallel Text
is distributed via web download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for $1,500.



*



(3) SPADE (Syntactic Phrase Alignment Dataset for Evaluation)
<https://catalog.ldc.upenn.edu/LDC2018T09> consists of annotated parse
trees and alignment on English sentential paraphrases extracted from
machine translation evaluation corpora and separated into development and
test sets.



Reference translations from machine translation evaluation corpora were
used as sentential paraphrases. They were sourced from the following data
sets released by LDC from the NIST (National Institute of Standards and
Technology) open machine translation evaluation series (OpenMT
<https://www.nist.gov/itl/iad/mig/open-machine-translation-evaluation>):
LDC2010T14 <https://catalog.ldc.upenn.edu/LDC2010T14>, LDC2010T17
<https://catalog.ldc.upenn.edu/LDC2010T17>, LDC2010T21
<https://catalog.ldc.upenn.edu/LDC2010T21>, and LDC2013T03
<https://catalog.ldc.upenn.edu/LDC2013T03>.



Reference translations of 10 to 30 words were randomly extracted for
annotation from NIST OpenMT corpora. Gold standard annotations of HPSG
(head-driven phrase structure grammar) trees and phrase alignments were
performed, resulting in 20,276 phrases extracted from 201 sentential
paraphrases and 15,721 paraphrase alignments.



SPADE is distributed via web download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for $250.






Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275
<https://maps.google.com/?q=3600+Market+St.+Suite+810+%0D%0A+Philadelphia,+PA+19104&entry=gmail&source=g>

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810
<https://maps.google.com/?q=3600+Market+St.+Suite+810+%0D%0A+Philadelphia,+PA+19104&entry=gmail&source=g>

      Philadelphia, PA 19104
<https://maps.google.com/?q=3600+Market+St.+Suite+810+%0D%0A+Philadelphia,+PA+19104&entry=gmail&source=g>






-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc