Teksta sentimenta klasificēšana latviešu valodā, izmantojot lielo valodas modeļu vaicājumus un reddit datu kopu

Šajā bakalaura darbā tiek apskatīta lielo valodu modeļu (LVM) izmantošana sentimenta analīzē, un tiek piedāvāta jauna pieeja latviešu valodas datu kopu veidošanai, izmantojot Reddit foruma datus un sistemātiskas vaicājumu izstrādes (prompt engineering) metodi. Validācijas datu kopā tika sasniegta vairāk nekā 82\% pareizība ar nulles šāviena metodi, izstrādājot vaicājumus GPT-3.5-turbo modelim, kas vairāk nekā divas reizes uzlabo iepriekšējo pareizību trīs klašu sentimenta analīzē šajā datu kopā. Pētījums parāda, ka LVM un vaicājumu izstrāde var daļēji aizstāt cilvēku marķētājus, padarot lielu datu kopu veidošanu ekonomiski izdevīgāku. Turpmākie pētījumi varētu aplūkot arī citus modeļus sentimenta analīzei latviešu valodā, analizēt dažādu valodas iezīmju ietekmi uz vaicājumiem, izpētīt LVM pielietojumu datu kopas ģenerēšanā, lai pielāgotu esošos modeļus kā arī izmantot šajā darbā iegūto datu kopu jauna sentimenta analīzes modeļa izstrādei. Šis darbs veicina sentimenta analīzes attīstību, izmantojot LVM dotās iespējas, kā arī apskata vaicājumu izstrādes uzdevumu latviešu valodas datu apstrādē. Izveidotā LVReddit datu kopa satur vairāk nekā 90000 paraugus un ir publicēta github repozitorijā, kļūstot par lielāko pieejamo atvērto datu kopu latviešu valodas sentimenta analīzes uzdevumam. No darba rezultātiem izrietošais zinātniskais raksts "Using Large Language Models to Improve Sentiment Analysis in Latvian language" ir pieņemts prezentēšanai "7th international conference on innovations and creativity" konferencē.
This bachelor's thesis explores the use of large language models (LLMs) in sentiment analysis and proposes a new approach to creating a dataset for the Latvian language using Reddit forum data and systematic prompt engineering. In the validation dataset, an accuracy of over 82\% was achieved with the zero-shot method by generating prompts for the GPT-3.5-turbo model, which improved the previous accuracy in sentiment analysis of this dataset by more than two times across three sentiment classes. The study demonstrates that LLMs and prompt engineering can partially replace human annotators, making the creation of large datasets more cost-effective. Further research could also consider other models for sentiment analysis in Latvian, analyze the impact of different linguistic features on prompts, explore the application of LLMs in dataset generation to fine-tune existing models, and utilize the obtained dataset in this work for developing a new sentiment analysis model. This work contributes to the advancement of sentiment analysis using the capabilities of LLMs and examines the task of prompt engineering in Latvian language data processing. The created LVReddit dataset contains over 90,000 samples and has been published on the GitHub repository, becoming the largest available open dataset for sentiment analysis in the Latvian language. The scientific article "Using Large Language Models to Improve Sentiment Analysis in Latvian Language," based on the results of the work, has been accepted for presentation at the "7th International Conference on Innovations and Creativity"

Keywords

Datorzinātne, lielie valodas modeļi, dabiskās valodas apstrāde, vaicājumu izstrāde, sentimenta analīze, GPT

URI

https://dspace.lu.lv/handle/7/64303

Collections

Bakalaura un maģistra darbi (EZTF) / Bachelor's and Master's theses

Full item page

Teksta sentimenta klasificēšana latviešu valodā, izmantojot lielo valodas modeļu vaicājumus un reddit datu kopu

Files

Date

Authors

Co-author

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Language

Abstract

Keywords

Citation

Relation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By