Metožu salīdzinājums automatizētu aptauju atvērto teksta atbilžu analīzei izglītības sektorā

Gulbe, Elīza

dc.contributor.advisor	Paikens, Pēteris
dc.contributor.author	Gulbe, Elīza
dc.contributor.other	Latvijas Universitāte. Datorikas fakultāte
dc.date.accessioned	2024-06-01T01:02:40Z
dc.date.available	2024-06-01T01:02:40Z
dc.date.issued	2024
dc.identifier.other	100836
dc.identifier.uri	https://dspace.lu.lv/dspace/handle/7/65596
dc.description.abstract	Atgriezeniskās saites analīzes izglītības sektorā ir svarīga izglītības procesu uzlabošanai. Salīdzinot ar Likerta skalas jautājumiem, atvērto teksta atbilžu jautājumi var sniegt papildus detaļas, kas nav apspriestas aptaujā. Tajā pašā laikā ir nepieciešams manuāls darbs, lai analizētu atvērtās teksta atbildes. Šī pētījuma mērķis ir saprast, vai teksta klāsterizācija var tikt lietota kā rīks atvērto teksta atbilžu analīzei izglītības nozarē. Teksta klāsterizācija iekļauj piecus soļus – datu pirmsapstrādi, atbildes pārvēršanu numeriskā vektorā, attāluma mērīšanu starp vektoriem, klāsterizācijas algoritmu un novērtēšanu. Sākotnēji dažādas teksta vektorizācijas metodes, attāluma mērījumi un klāsterizācijas algoritmi tika salīdzināti, izmantojot Normalized Mutual Information un Adjusted Rand Index mērījumus. Lai saprastu, vai teksta klāsterizācija ir piemērots risinājums atvērto aptauju atbilžu analīzei, pēc tam, kad tika noteikts labākās metodes teksta vektorizācijai, attāluma mērījumiem, un klasterizācijas algoritmiem, tika mērīts atvērto atbilžu kategorizēšanas laiks starp nesakārtotiem datiem un klāsterētiem datiem, iesaistot 20 cilvēkus. Noslēgumā tika veikta kvalitatīvā analīze, lai identificētu praktiskus izaicinājumus un panākumus, kas saistīti ar klāsterizācijas procesa rezultātiem. Pētījumā izvelētā datu kopa sastāv no diviem jautājumiem - "Domājot par pēdējiem sešiem mēnešiem, kas jums ir licis justies neapmierinātam savā darbavietā?", "Domājot par pēdējiem sešiem mēnešiem, kas jums ir licies justies labi savā darbavietā?" un satur aptuveni 200 respondentu atbildes katram jautājumam. Pētījuma rezultāti rāda, ka lielie valodu modeļi sniedz labākus rezultātus, kā tradicionālas teksta vektorizācijas metodes, kā, piemēram, Bag-Of-Words un Word2Vec. Voyage-lite-01-instruct vektorizācijas modelis kombinācijā ar Expectation-Maximization vai Agglomerative klasterizācijas algoritmu, izmantojot Eiklīda attāluma metodi, sniedza labākos rezultātus. Cilvēku izvērtēšanas eksperimentam tika izvēlēts voyage-lite-01-instruct vektorizācijas modelis, Eiklīda attāllums un Expectation-Maximization klasterizācijas algoritms. Cilvēku novērtējuma rezultāti parādīja statistiski nozīmīgu atšķirību laikā, kas nepieciešams, lai kategorizētu atgriezeniskās saites vienai no datu kopām, kas atbildēja uz jautājumu “Domājot par pēdējiem sešiem mēnešiem, kas jums ir licies justies labi savā darbavietā?”. Kvalitatīvajā analīzē tika secināts, ka, ja datiem nav nepieciešams veikt pirmsapstrādi, izmantojot teksta klasterizāciju ir iespējams grupēt datus, balstoties uz atbilžu tematiku. Tāpat kvalitatīvajā analīzē tika secināts, ka eksistējošie izaicinājumi ir nepareiza klasifikācija, balstoties uz sākotnējo frāzējumu, izņēmumu iekļaušana klāsterī un līdzīgu atbilžu atrašanās vairākos klāsteros. Šis pētījums ir pirmais solis, lai izveidotu automatizētu risinājumu atgriezeniskās saites analīzei izglītības sektorā. Šis pētījums pierāda, ka ir iespējams statistiski nozīmīgi samazināt laiku atvērto atbilžu analīzei, izmantojot teksta klāsterizāciju. Šī pētījuma limitācijas ir neesošā teksta segmentācija datu pirmsapstrādē, neesošā tēmas modelēšana, lai nosauktu klāsterus, un tas, ka testēšana tika veikta tikai ar diviem jautājumiem. Lai pilnībā automatizētu atgriezeniskās saites teksta analīzi ir nepieciešams salīdzināt eksistējošās teksta segmentācijas un tēmas modelēšanas metodes, kā arī nepieciešami uzlabojumi klāstera skaita noteikšanai. Tāpat risinājumu nepieciešams testēt ar vairākām datu kopām. Šis bakalaura darbs ir rakstīts angļu valodā un sastāv no 84 lapām, iekļaujot 6 attēlus un 22 tabulām. Atslēgvārdi : atgriezeniskās saites analīze, teksta klāsterizācija, izglītības nozare
dc.description.abstract	Feedback analysis in Education sector is important for the development of education processes. Compared to Likert-scale questions, open-ended text responses can provide additional insights that might not be covered in the questionnaire. Yet, it takes a lot of manual effort to analyze open-ended text responses. This research aims to understand if text clustering can be used as a solution for analyzing survey responses in the Education sector. Text clustering process consist of 5 steps – data pre-processing, embedding model, distance measure, clustering algorithm and evaluation. To understand if text clustering solution is applicable for open-ended feedback analysis in Education sector various embedding models, distance measures and clustering algorithms were compared using Normalized Mutual Information and Adjusted Rand Index scores. After determining the top-performing combination of embedding models, distance measures and clustering algorithm, to determine if text clustering has practical application in analyzing survey text responses, the time difference between categorizing raw feedback and clustered feedback was compared among 20 people. To identify practical challenges and successes of the clustered output, additional qualitative analysis was performed. The selected dataset had two questions – “Thinking about the last six months, what have made you feel dissatisfied in your workplace?”, “Thinking about the last six months, what have made you feel good in your workplace?” and had approximately 200 respondents for each question. The results showed that Large Language Model embeddings perform better than traditional embedding methods, such as Bag-Of-Words, and Word2Vec. Voyage-lite-01-instruct embedding model in combination with Expectation-Maximization (EM) algorithm or Agglomerative clustering using Euclidean distance showed the best results. For the human evaluation of the clustering algorithm voyage-lite-01-instruct embedding model, Euclidean distance and EM clustering algorithm were selected. The human evaluation results showed statistically significant difference in time taken to categorize the feedback responses for one of the datasets that responded to question “Thinking about the last six months, what have made you feel good in your workplace?”. The qualitative analysis of the clustering output showed that if data does not require any pre-processing the clustering algorithm is able to distinguish between different topics and cluster responses based on the theme of the response. The challenges identified with text clustering using qualitative analysis were incorrect clustering based on initial phrasing, inclusion of outliers within a cluster, and fragmentation of similar topics across multiple clusters. This research serves as an initial step for creating automated solution for feedback analysis in the Education sector. This research proves that it is possible to significantly decrease time spent on categorizing open-ended text feedback using text clustering. The limitations of this research are the absence of text segmentation in pre-processing phase, absence of topic modeling to name the clusters and the fact that the testing was done only on two questions. In future work, to improve the quality of clustering and reduce time spent analyzing text data even further different text segmentation, topic modeling and methods for determining optimal number of clusters should be compared. Additionally, the solution needs to be tested on more datasets. This bachelor thesis consists of 86 pages, including 6 figures, 22 tables and 64 references to information sources. Keywords : feedback analysis, text clustering, Education sector
dc.language.iso	lav
dc.publisher	Latvijas Universitāte
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	Datorzinātne
dc.title	Metožu salīdzinājums automatizētu aptauju atvērto teksta atbilžu analīzei izglītības sektorā
dc.title.alternative	Comparison of Methods for Automated Survey Text Response Analysis in Education Sector
dc.type	info:eu-repo/semantics/bachelorThesis

Files in this item

Name:: 302-100836-Gulbe_Eliza_eg20091.pdf
Size:: 1.392Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Bakalaura un maģistra darbi (EZTF) / Bachelor's and Master's theses [5688]

Show simple item record