Automatizēta semistrukturētas informācijas atpazīšana un analīze no WWW lapām

Jirgens, Krists

View/Open

304-31259-Jirgens_Krists_DatZ020052.pdf (1.037Mb)

Author

Jirgens, Krists

Co-author

Latvijas Universitāte. Fizikas un matemātikas fakultāte

Advisor

Karnītis, Ģirts

Date

2008

Metadata

Show full item record

Abstract

Internets šobrīd ir lielākā cilvēka radītā un publiski pieejamā zināšanu krātuve pasaulē. Diemžēl HTML formāts, kādā ir pieejama lielākā daļa datu tīmeklī, ir paredzēts datu vizuālā attēlojuma aprakstīšanai pārlūkprogrammām, un glabā ļoti ierobežotu semantisko informāciju par datiem un to savstarpējām relācijām. Automatizēta datu atpazīšana un integrācija ar citos tīmekļa resursos pieejamajiem datiem joprojām ir nopietna problēma. Šajā darbā tiek aplūkotas pasaules literatūrā piedāvātās datu ekstrakcijas pieejas un piedāvātas jaunas idejas šo problēmu risināšanai. Izstrādāta metode, kas ļauj atpazīst un atlasīt klasifikatorus tīmekļa lapās, un balstoties uz klasifikatoriem ģenerēt datu ekstrakcijas veidnes tīmekļa vietnēm. Darba gaitā tika izstrādāts rīks, ar kura palīdzību eksperimentālā ceļā tika noskaidrots, kuras no idejām ir veiksmīgākās un sniedz labākos rezultātus. Rezultātā izveidotais rīks, apvieno veiksmīgākās metodes, un spēj lielā mērā automatizēti, atlasīt datus no semistrukturētām tīmekļa lapām, kas tālāk izmantojami patvaļīgai programmatiskai apstrādei.

The Internet is the largest knowledge base ever developed by men and made available to public. However the way to retrieve data from the websites - HTML format - is in some sense modern legacy, because HTML is meant to describe the visualization of data for the web browser and provides very limited semantic information about data or relations between data. Automated data extraction, recognition and integration with other web resources are considerable problems. In this paper, there are accumulated the most successful methods of data extraction from web pages, described in literature, and some new ideas of solving these problems are introduced. A method to recognize and extract classifiers from web pages has been developed. Then the knowledge about classifiers is used to automatically generate data extraction templates for web sites. During the work process, there a tool has been developed to experimentally test witch of the ideas introduced gives the best results in practice. The result will be an automatic data extraction tool that operates with the most successful methods and gathers data from the websites, presentable to any further processing.

URI

https://dspace.lu.lv/dspace/handle/7/17667

Collections

Bakalaura un maģistra darbi (FMOF) / Bachelor's and Master's theses [2775]