FIDA

FIDA, the Corpus of Slovene Language, represents a reference corpus of the Slovene language and was compiled within the framework of a joint project involving four partners; two from the academic/research sphere and two commercial ones: the Faculty of Arts (University of Ljubljana), the Jožef Stefan Institute, the DZS, General Publishing and the Amebis software company. Corpus compilation started in spring 1997 and was concluded by the end of 2000. The project was funded by the two commercial partners.

The FIDA Corpus of Slovene Language is typologically categorised as follows:
 

  • reference corpus: A reference corpus is an extensive electronic text collection composed of proportional samples of texts in a language. Its primary aim is to give an insight into language use within the widest possible range of levels and fields, thus representing a fundamental resource for applied and theoretical linguistics, e.g. lexicography in all forms (mono- and multilingual dictionaries, specialised and terminological dictionaries, other reference works), language teaching (textbooks and teaching aids), language technologies (spelling and grammar aids, speech analysis and generation) as well as other social and human sciences such as literary studies, psychology and sociology.

  •  
  • monolingual corpus: The corpus is composed of contemporary Slovene texts, foreign language passages may appear only as parts of Slovene texts, while all texts composed entirely in a foreign language - e.g. Italian texts from bilingual publications in the Coastal region - were excluded.

  •  
  • synchronic corpus : The corpus represents contemporary Slovene from the second half of the 20th century, with the majority of texts having been produced in the 90s.

  •  
  • corpus of (originally) written texts: The corpus is composed of written texts and texts originally produced as written for speaking purposes; speech transcripts - parliamentary proceedings - are the only spoken component of the corpus.

  • The FIDA Corpus of Slovene Language contains just over 100 million words of contemporary Slovene texts, encompassing a broad range of Slovene language variants and registers as found in the Slovene press, complemented by some texts from the Internet and speech transcripts. For a detailed description see also the overview of corpus texts categorised by text types.

    The text collection alone is not a sufficient resource for linguistic research, corpus querying and processing of results. A prerequisite for the above tasks is an appropriate corpus toolkit, commonly known as concordance software, which enables the user to query the corpus according to special criteria, sort the hits and process them with statistical measures. The FIDA project involved the development of the ASP32 concordance software which represents the web interface to searching the FIDA corpus. You can test it here.

    The team involved in the building of the FIDA corpus wish to thank all who supported the project through text contributions or otherwise, at the same time we welcome further contributions towards corpus expansion. We also invite you to share your ideas and opinions about the FIDA corpus by sending an e-mail to the address fida@dzs.si.