FIDA
FIDA, the Corpus of Slovene Language, represents a reference
corpus of the Slovene language and was compiled within the framework
of a joint project involving four partners; two from the academic/research
sphere and two commercial ones: the Faculty
of Arts (University of Ljubljana), the Jožef
Stefan Institute, the DZS,
General Publishing and the Amebis
software company. Corpus compilation started in spring 1997 and was concluded
by the end of 2000. The project was funded by the two commercial partners.
The FIDA Corpus of Slovene Language is typologically categorised
as follows:
reference corpus: A reference corpus is an extensive electronic
text collection composed of proportional samples of texts in a language.
Its primary aim is to give an insight into language use within the widest
possible range of levels and fields, thus representing a fundamental resource
for applied and theoretical linguistics, e.g. lexicography in all forms
(mono- and multilingual dictionaries, specialised and terminological dictionaries,
other reference works), language teaching (textbooks and teaching aids),
language technologies (spelling and grammar aids, speech analysis and generation)
as well as other social and human sciences such as literary studies, psychology
and sociology.
monolingual corpus: The corpus is composed of contemporary Slovene
texts, foreign language passages may appear only as parts of Slovene texts,
while all texts composed entirely in a foreign language - e.g. Italian
texts from bilingual publications in the Coastal region - were excluded.
synchronic corpus : The corpus represents contemporary Slovene from
the second half of the 20th century, with the majority of texts having
been produced in the 90s.
corpus of (originally) written texts: The corpus is composed of
written texts and texts originally produced as written for speaking purposes;
speech transcripts - parliamentary proceedings - are the only spoken component
of the corpus.
The FIDA Corpus of Slovene Language contains just
over 100 million words of contemporary Slovene texts, encompassing
a broad range of Slovene language variants and registers as found in the
Slovene press, complemented by some texts from the Internet and speech
transcripts. For a detailed description see also the overview
of corpus texts categorised by text types.
The text collection alone is not a sufficient resource for linguistic
research, corpus querying and processing of results. A prerequisite for
the above tasks is an appropriate corpus toolkit, commonly known as concordance
software, which enables the user to query the corpus according to special
criteria, sort the hits and process them with statistical measures. The
FIDA project involved the development of the ASP32 concordance software
which represents the web interface to searching the FIDA corpus. You can test it here.
The team involved in the building of
the FIDA corpus wish to thank all who supported the project through text
contributions or otherwise, at the same time we welcome further contributions
towards corpus expansion. We also invite
you to share your ideas and opinions about the FIDA corpus by sending an
e-mail to the address fida@dzs.si.