N-gramu modeļi sabalansētam latviešu valodas teksta korpusam

Pole, Sandra

View/Open

304-9619-Lazukina_Sandra_Mate040020.pdf (488.1Kb)

Author

Pole, Sandra

Co-author

Latvijas Universitāte. Fizikas un matemātikas fakultāte

Advisor

Siņenko, Nadežda

Date

2008

Metadata

Show full item record

Abstract

Darbā aplūkoti un salīdzināti n-gramu visbiežāk lietotie valodas modeļi, kā arī noteikts vispiemērotākais modelis n-gramu varbūtību aprēķināšanai. Praktiskajā darba daļā tiek noteikti visbiežāk lietotie n-grami latviešu valodā (n=1, 2, 3), ņemot vērā, ka izmantotie teksta resursi ir sastādīti tā, lai teksts aptvertu visu latviešu valodu. Darbs sastāv no divām daļām un pielikuma ar izmantotajām programmām, pirmā daļa ir teorētiskais pamatojums katram modelim, un praktiskā daļa ir šo modeļu pielietojums izvēlētajam teksta failam. Nepieciešamā informācija no teksta failiem ir iegūta ar programmēšanas valodas Turbo Pascal Version 7.0 palīdzību, bet paši aprēķini veikti Microsoft Excel.

In this work are compared the most used language models, and elect the best of these models for n-gram probability calculations. In practical part are shown the most used n-grams (n=1, 2, 3) in Latvian language, considering, that text corpus is built in such a way, that it covers all Latvian language. This bachelor thesis consists of two parts and appendix with used programs, first part is theoretical motivation for each model, and the other is practical these model usage for chosen text corpus. Necessary information from text corpus is computed with programming language Turbo Pascal Version 7.0, all other calculations are made in Microsoft Excel.

URI

https://dspace.lu.lv/dspace/handle/7/21811

Collections

Bakalaura un maģistra darbi (FMOF) / Bachelor's and Master's theses [2730]