The Isle
of Man branch of the British Computer Society had a fascinating
presentation on open data and
mash-ups on Friday. The talk was given by Prof. Robert Barr
OBE, and the gist of the session was that data should flow freely
to the people in a useful data structure, yet also that the
open-ness should be considered with attention to commercial
considerations such as intellectual property and the benefits to
the wider economy.
While listening to Robert, it struck me that I am in my very own
battle for the extraction of data that should be more readily
available. As you may know, I am learning Manx. As part of this, I
am generating my own revision notes, references, blog posts and the
like that may someday see the light of day. Part of this work is
the development of a Manx language dictionary for Windows
Phone 7.
To achieve my goal, I needed a copy of the Manx dictionary.
Having asked around and researching myself, I gathered a number of
links to existing on-line resources. These ranged from PDF
formatted documents to fully indexed dictionaries. The PDF version
(English to
Manx, Manx to
English) was unsuitable because it would be difficult to
accurately extract the words from the PDF "printed page". The RoadLingua
and FreeLang
dictionaries appeared promising, and the dictionaries appeared to
be out of copyright. But these were encoded in proprietary
dictionary file formats. So ironically, even though the dictionary
was "open", the software needed to be reverse engineered to access
the dictionary, itself a violation of copyright. So it was that I
was left with the remaining two options that may prove to be
useful. These were the Phil Kelly
dictionary and the Faragher's. These
were, however, only HTML sites. Between the two, Faragher's seemed
the best, as it provided value-added content such as use of the
words within sentences and Manx phrases - ideal if you are
interested in the many idioms in use in Manx Gaelic.
So it seemed that I would need to use the Faragher's site as a
"back end" to my application, essentially screen-scraping the site
for translations. And indeed, to accomplish this, I would be best
served if I wrote my own web site, which acted as a bridge between
my Windows Phone 7 application and the dictionary itself. This
would double my work, but the reasons were various; the extended
platform on a server would allow me to parse the HTML from the site
more reliably and by caching words as they were requested, I could
- over time - create a reliability buffer in case the original site
was to fail. I set about the task and have just launched the site
in a very early form of initial testing (take a look, at http://taggloo.im). This was
particularly challenging, as the HTML from the Faragher's
dictionary is flakey at best. However, by inserting that middle
layer, I could hide this trickery from the user.
All this, because the dictionary was not available
electronically in an indexed form. And this resonates with Robert
Barr's point about open data. Open data should not only be open,
but also be usefully formatted to allow for its use. An unindexed
dictionary is hardly a dictionary! More frustration was in the
encapsulation of the indexed dictionary within copyrighted software
which was quite closed! I approached RoadLingua about how they
would feel about releasing the file formats to their dictionary but
I received no response.
So it was with great surprise and relief when I realised that by
navigating to an unpublished URL (that should have been concealed
from internet users) I could extract the entire Faragher's
dictionary from the site, and put it to my own use! So, after
playing with MySQL scripts in order to format them into T-SQL, I
now have two 50,000 word dictionaries, one for each direction (Manx
to English, English to Manx). Am I going to keep this to
myself?
No. I've checked about copyright, and I'm informed that this is
not an issue, certainly in the spirit of expanding the availability
of Manx learning resources. So, as part of my Taggloo project,
which already has an effective and reliable API for XML and JSON
consumers, I'm going to make the entire database available for use
by other applications (maybe mobile phone applications, competing
with my own) and web-sites (it becomes possible to "embed" Manx
dictionaries on even the simplest of sites). Although the final API
has yet to be defined, and there will likely be changes to it in
the coming weeks, this data will obviously be free for use by
anyone and everyone (subject to fair use - ie. not crashing my
server), the API will ask for one thing: the opportunity to record
the words being indexed. This itself, over time, will create a
second rich data-set. What words are people regularly using? Do
these correlate to students' progress in classes, or do the
translations point to any cultural significance such as house
names, which are regularly seen in Manx, yet seldom understood?
I have many plans around this project, with further data-sets
springing from them, and adding further depth to what will
hopefully become reliable and rich data-set containing both formal
dictionary content and community contributions. This complements
the already available learning resources for the user, particularly
those found at LearnManx.com. I'll be
blogging about them very soon, hopefully in line with an exciting
new blog design.