Projects of the summer school “Coding for Language Communities 2014”
Here is a list of the projects that were implemented during the summer school:
- Roma dictionary
- Finding near-matches for spellchecker improvement
- Real time tweet map: A visualization of lesser-resourced languages
- Speech recognition for Icelandic
- A Fala dictionary
Project: Roma dictionary
Team Members: Bruce Birch, Ricardo Filipe, Benoit Legouy, Pedro Manha, Eleni Ntalambyra Elodie Paquette, Claire Ulrich
With graphic design assistance from Rosana Margarida
Brief Project Description
ROMA Dictionary (working title) is an app (iOS, Android, Web) which is aimed at crowdsourcing a corpus (text and audio) of Roma as spoken and written by Roma teenagers on Facebook. In a recent interview for the website “Czech Books”, writer Lukáš Houdek spoke about the upsurge in the use of Romani online:
“I think Facebook is very important in the life of Roma, because part of their families live abroad: in England, Canada or the US. That’s how they keep in touch and how they get information as well. It’s very interesting that thanks to Facebook many Roma have started to write in Romany, because until now there have not been so many opportunities to write Romany. It was always a language of verbal communication, but now they have to communicate in writing. In this way they are becoming more used to writing and reading Romany.”
An important inspiration for this project is the “Urban Dictionary” project, a crowdsourcing site originally set up as a repository for slang or cultural words or phrases in (American) English which couldn’t easily be found elsewhere. One of the first definitions on the site was “the man”, defined as “the faces of ‘the establishment’ put in place to ‘bring us down’.” The intention was clearly to give a voice to “non-establishment”, or “anti-establishment” sectors of society. This remains a major feature of the site, which has now broadened out into a general dictionary containing over 8 million entries.
The position of Roma throughout the world as outsiders means that they are by definition a “non-establishment” and marginalized group who have suffered from discrimination over a period of centuries.
The idea is therefore to take advantage of the increasing usage of written Romany online, by providing a safe place to deposit risky and funny language, which at the same time can grow into a useful resource both for the Romany-writing community and potentially for linguistic (and other) research.
The app will be monolingual, a space for interaction between Romani speakers. Definitions are not standard dictionary definitions containing information such as part of speech, IPA transcription, etymology, etc. Rather they are impressionistic depictions of terms and their usage with an emphasis on example sentences and mini-dialogues which are highly evocative of cultural context.
The Roma dictionary was the winning project of our internal vote for the best project!
Project: Finding near-matches for spellchecker improvement
The goal of the software is to identify alternative spellings of the same word, to facilitate the development and refinement of spell-checking tools for languages with widely disregarded spelling standards, or even to guide the development of orthography standards for languages with lack them. This overcomes a shortcoming with corpus-based approaches to these tasks, in which the words in the corpus must be accepted as correctly spelled, even when they contain typos or are not consistent with each other. The software, written in Python, can generate clusters of spellings which are likely to be from the same word and can also generate a list of the most common spelling alternations. The project, which could grow to include other related tools, is dubbed “koreksyon” (“proofreading” in Haitian Creole). Some results for Haitian Creole (exploring alternations of pro/pwo and mp/np) are given in https://docs.google.com/spreadsheets/d/1F2empVjzKZMnm9_fAaXUeEWDtnyJPy-CB3X0tFUREsw/edit?usp=sharing.
The code is published on Github at https://github.com/reokatoa/koreksyon.
Project: Real time tweet map: A visualization of lesser-resourced languages
Team members: Netta Ben-Meir, Caroline Borowski Nadia Paraskevoudi
About the project
Given the focus of the course, we wanted to create a project that would provide us with information about the use of specific languages in social media. Considering that social media is the meeting point people all over the world today, we decided to use data from Twitter to document languages considered to be under-resourced. Social media research should be regarded as an interdisciplinary area. Undoubtedly, there is a clear need for new ways to access and demonstrate useful information for researchers and students in the fields of linguistics, language documentation, and anthropology. This map can be used as a tool to capture all the data that may reveal the community network of people using under-resourced languages in social media. This project should not be seen as a tool designed especially for Basque. Contrary to this, it could be used for a wide variety of under-resourced languages to provide information about the existing networks among people using them. The idea of creating a map that displays the location of every user tweeting in Basque constitutes an effort to capture the data described above and visualize it. Visualization of data is part of the interdisciplinary field of Digital Humanities, a field that flourished in the beginning of the 21st century, as it was an inevitable consequence of the culture of digitization. The abstract visual representations used in various projects in this field are a useful tool for analyzing textual content as well as related metadata such as spatiotemporal information. Therefore this project should be regarded as an interdisciplinary work which was made possible by combining knowledge, methods and tools from three different fields: linguistics, computer science, and digital humanities.
Methods and Tools
The project consisted of three parts:
- First, we created a twitter application (dev.twitter.com) in order to get streaming data from twitter. In particular all the data was collected using the Twitter streaming application programming interface (Streaming API), which enabled us to filter the streaming data with specific keywords, user lists, etc. For the purposes of this project we used the python package tweepy, which supports all three twitter APIs. (http://tweepy.readthedocs.org/en/v2.3.0/) It is important to mention that before we started working on our code, we accessed two different corpora which have proven to be very useful for this project. The first corpus consisted of a wordlist of the most frequent words, affixes and suffixes in Basque, and made it possible for us to extract Basque tweets.
- The second corpus included a list of users who are known to tweet in Basque regularly.In addition to their username, this list included the latitude and longitude coordinates of each user which we used to update our map every time someone posted a new tweet. This list proved to be very important, as the feature to add location information to the tweets is off by default, meaning that there are many users whose coordinates are not being displayed when they post.
- At last, we created a map using Flask, a web application framework written in Python and based on Jinja2, which is a full featured template engine for Python. In particular, we created a html page which we serve to the user from the Flask Web App. In that page (index.html) we used the google maps API to display the map and create the points. The webpage polls the Flask Web App every second to check and get new data.
The whole project is available on Github: https://github.com/nadje/Real-time-map-for-tweets- written-in-Basque
This project was made possible through the help and support of everyone participating in the summer school “Coding for language communities”. First and foremost, we would like to thank Vera Ferreira, Peter Bouda, Rita Pedro, Felix Rau, Kevin Scannell and Dorothee Beermann for organizing this summer school. They gave us the opportunity to enhance our knowledge about projects working on under-resourced languages by collaborating with people from different disciplines. Especially, we would like to thank Kevin Scannell for contributing to our project by providing us with the Basque user corpus mentioned above. Secondly, we would like to thank José Ramom Pichel for his feedback. Finally, we would like to thank Ricardo Felipe. His help and support have proven invaluable throughout this process.
Project: Speech recognition for Icelandic
The speech recognition team worked on acoustic modelling of under-ressourced languages for automatic speech recgonition.
After discussing the process and experimenting with a pre-prepared corpus of US English, we turned our attention to a more challenging setup: Out of pure conincidence, two of the students of the groups were studying in Iceland, and one of them had brought an annotated icelandic speech corpus with him. We set up a training recipe for this corpus and created an initial, prototype acoustic model.
While our model for Icelandic telephony speech was certainly nowhere near accurate enough for practical usage, we were able to successfully show fundamental decoding capabilities on both pre-recorded audio samples and live microphone input.
Project: A Fala dictionary
The original idea to create a spellchecker for A Fala encountered a problem that there is no dictionary or database of words that could serve as a pattern. This is how the plan of A Fala Dictionary came out. However, to start with a dictionary a convention about orthography is necessary. For this reason the dictionary plan have various stages:
Agreement about a standard sound – grapheme match, which will be a basis of a standardized orthography.
There are various text that are available in electronic version: Andiriña (a magazine), Sierra de Gata Digital (web page), etc. These will create the text database for the basic version of the dictionary. The texts will have to be slightly edited so that they correspond with the standard orthography. It will also be necessary to sort the texts according to the three main varieties: Valverdeiru, Lagarteiru, Mañegu.
The program we created in Python counts the frequency of the words and at the same time it offers a word list. In the test run we used 8 texts written in Lagarteiru (approx. 8, 000 words) which resulted in 2, 400 different entries.
Editing. The word list can be exported to Excel where the entries are subject to further editing. At this stage it is necessary to make various strategic decisions regarding the information that will describe the individual word entries.
Further processing and printing.
The basic dictionary in Excel format needs to be converted to Toolbox format and printed or published on-line.
The basic version of the dictionary will be open to public participation with the help of mobile application that enables this. This way it will be expanded especially regarding patrimonial and rarely used words (plant names, tools, etc.).