Valodu modeļa spraudnis ļaunprogrammatūras analīzei

Šajā darba tiek pētīta valodas modeļu (LLM) integrācija ar dekompilatoriem, lai uzlabotu ļaunprogrammatūras reverso inženieriju. Ļaunprātīga programmatūra, kas ietver dažādas formas, rada ievērojamas problēmas kiberdrošības jomā, un tās analīzei un mazināšanai ir nepieciešami mūsdienīgi rīki. Dekompilācija, kompilētas programmatūras atjaunošana pirmkodā, ir ļoti svarīga, lai izprastu ļaunprogrammatūras funkcionalitāti, kas ir priekšdarbs tās apkarošanai. Šis process saskaras ar daļēju informācijas zudumu un maskētiem elementiem. Darbā tiek analizēts LLM spēja uzlabot dekompilācijas procesu. Iepriekšējos pētījumos tika aplūkotas LLM spējas tādos uzdevumos kā datu tipu un nosaukumu atjaunošana un abstraktā sintakses koka ģenerēšana, kas ir būtiskas spējas, lai dekompilēto kodu padarītu lasāmu, loģiski un pareizi strukturētu. Iespējamie LLM tipi ir aplūkoti saistībā ar programmatūras izpētes un koda atjaunošanas rīku Ghidra. Pētījuma mērķis ir attīstīt ļaunprogrammatūras izpētes jomu, izmantojot mākslīgā intelekta, konkrētāk LLM iespējas, lai izstrādātu efektīvākus un iedarbīgākus programmatūras analīzes rīkus. Atslēgas vārdi: Ghidra, programmatūras reversā inženierija, dekompilācija, valodas modelis, spraudnis The thesis is written in English, consists of 51 pages (of which 10 pages of Annexes), and uses 32 sources, 9 figures and DD tables.
The thesis addresses the growing challenge of understanding and mitigating the threats posed by malicious software in the cybersecurity domain. Malware, in its many forms, continues to pose significant risks to information security, demanding innovative and effective tools for analysis and containment. Decompilation is a critical process in understanding malware and following containment, but it is fraught with complexities such as information loss and obfuscation. This thesis aims to investigate the potential of Large Language Models (LLMs) to improve the decompilation process in malware analysis. The research methodology employed in this study includes integrating LLMs with Ghidra, a widely used decompiler, to enhance the decompilation output. This integration aims to improve the quantitative aspects such as abstract syntax tree (AST) recognition and data type recovery, as well as qualitative aspects such as accuracy, understandability, and readability. Using the Likert scale, the thesis evaluates the decompiled code's quality through automated and manual benchmarks and human surveys. The results of this research are promising. The integration of LLMs with decompilers has been shown to enhance data type recovery performance when compared to Ghidra's base performance. Specifically, the variable name recovery was measured as in general successful. In data type recovery LLM CodeLlama:Phind34Bv2 showed a 95.76% match rate in data type recovery, marginally higher than base performance, which is indicative of the potential benefits of this approach. The results of the study reveal that the plugin with LLMs did not affect accuracy. However, a significant effect is shown for the understandability, readability, variable name recovery comprehensibility, and appropriateness. In conclusion, the results provide preliminary evidence that LLMs have a positive potential impact to enhance certain aspects of the decompilation process when integrated with tools like Ghidra in most of the aspects evaluated. However, the thesis acknowledges limitations in the current state of technology. Integrating LLMs with decompilers, particularly in local environments, poses significant challenges, including computational resource demands and the complexity of the integration process. Future work is to be aimed at refining the integration of LLMs and decompilers further. Potential areas of improvement include enhanced recognition of software fragments, de-obfuscation capabilities, and the ability to leverage user-provided context to improve the accuracy of decompilation. This thesis provides a foundation for future research in the field of malware reverse engineering. The benefits of the results could impact cybersecurity analysts' work, allowing for more efficient and accurate malware analysis. By continuing to explore the synergy between artificial intelligence and software reverse engineering, we can anticipate more robust defenses against cyber threats. Keywords: Ghidra, software reverse engineering, decompilation, large language model, plugin

Keywords

Datorzinātne

URI

https://dspace.lu.lv/handle/7/65599

Collections

Bakalaura un maģistra darbi (EZTF) / Bachelor's and Master's theses

Full item page

Valodu modeļa spraudnis ļaunprogrammatūras analīzei

Files

Date

Authors

Co-author

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Language

Abstract

Keywords

Citation

Relation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By