Integrēta sistēma sintaktiski anotēta latviešu valodas teksta korpusa izveidei

Pretkalniņa, Lauma

dc.contributor.advisor	Grūzītis, Normunds	en_US
dc.contributor.author	Pretkalniņa, Lauma	en_US
dc.contributor.other	Latvijas Universitāte. Datorikas fakultāte	en_US
dc.date.accessioned	2015-03-24T07:05:11Z
dc.date.available	2015-03-24T07:05:11Z
dc.date.issued	2011	en_US
dc.identifier.other	18370	en_US
dc.identifier.uri	https://dspace.lu.lv/dspace/handle/7/16986
dc.description.abstract	Darbā aplūkota sintaktiski anotētu korpusu (treebank) izstrādes problemātika ar mērķi radīt stabilu tehnoloģisko pamatu sintaktiski anotēta latviešu valodas korpusa izstrādei. Darbā apskatīti klasiskie sintaktiskās analīzes (reprezentācijas) modeļi — vārdkopu struktūru un atkarību gramatikas — un SemTi-Kamola hibrīdais gramatikas modelis valodām ar relatīvi brīvu vārdu secību. Darbā analizēta pasaulē lielāko sintaktiski anotēto korpusu pieredze un formāti, īpašu uzmanību pievēršot vadošā atkarību pieejā balstītā korpusa — Prāgas atkarību korpusa (Prague Dependency Treebank — PDT) — vairāklīmeņu anotāciju struktūrai. Darbā izstrādāts SemTi-Kamola gramatikas modeļa paplašinājums, kas nodrošina sintaktiski neierobežotu teikumu anotēšanu. Izveidots PML (Prague Markup Language) profils SemTi-Kamols datu aprakstīšanai starptautiski atzītā mašīnlasāmā formā. Izstrādātais XML balstītais datu formāts ir integrēts ar SemTi-Kamola automātiskās sintaktiskās analīzes rīkiem un vizuālo kokveida datu struktūru redaktoru TrEd, kas ir izmantots PDT izveidē. Tādejādi ir radīts tehnoloģiskais un metodoloģiskais pamats latviešu valodas sintaktiski anotēta korpusa radīšanai — vide (integrētu rīku un formātu kopums), kas ļauj tekstus formāli anotēt atbilstoši SemTi-Kamols modelim, bet neprasa specifiskas tehnoloģiju zināšanas no lietotāja (valodnieka). Izstrādātā vide tiek sekmīgi pielietota praksē — izveidotas anotācijas apmēram 200 teikumiem.	en_US
dc.description.abstract	The problem of developing syntactically annotated text corpus (treebank) is considered in this work. The aim of this work is to develop a sound technological base for developing Latvian Treebank. General approaches of the syntactic analysis are described — the phrase structure approach and the dependency approach. The SemTi-Kamols hybrid dependency based grammar for languages with rather free word order is also described. The experience of world’s largest treebanks, particularly Prague Dependency Treebank (PDT) and its multi-level annotation structure, is analysed as well. An extension of the SemTi-Kamols model has been developed to cover syntactically unrestricted sentences of Latvian language. A PML (Prague Markup Language) profile for displaying SemTi-Kamols data in the internationally acknowledged machine-readable form has been developed. This XML based format is integrated with SemTi-Kamols parser and visual tree editor TrEd originally developed for PDT. The main result of this work is the technological and methodological base for creating Latvian Treebank — a framework consisting of integrated tools and formats that allows to annotate treebank data accordingly to the SemTi-Kamols model without requiring deep technological knowledge from the end-user (linguist). Approximately 200 sentences have been annotated using the developed framework.	en_US
dc.language.iso	N/A	en_US
dc.publisher	Latvijas Universitāte	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Datorzinātne	en_US
dc.title	Integrēta sistēma sintaktiski anotēta latviešu valodas teksta korpusa izveidei	en_US
dc.title.alternative	An integrated system for the development of Latvian Treebank	en_US
dc.type	info:eu-repo/semantics/masterThesis	en_US

Files in this item

Name:: 302-18370-Pretkalnina_Lauma_lp ...
Size:: 1.182Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Bakalaura un maģistra darbi (DF) / Bachelor's and Master's theses [3177]

Show simple item record