Pārtikas sastāvdaļu skenēšana, izmantojot optisko rakstzīmju atpazīšanu, lai noteiktu produkta uzturvērtības profilu
Author
Greiliha, Klinta Madara
Co-author
Latvijas Universitāte. Datorikas fakultāte
Advisor
Cinks, Ronalds
Date
2023Metadata
Show full item recordAbstract
Pieaug pārstrādāto pārtikas produktu ražošana un patēriņš. Eksistē vairāk nekā 3000 pārtikas sastāvdaļu, un ir sarežģīti zināt katras iespējamās sastāvdaļas izcelsmi, lai gan tas var būt svarīgi dažādu iemeslu dēļ, kā izvēlētās diētas vai alerģiju dēļ (International Food Information Council & U.S. Food and Drug Administration, 2004). Sevišķi pasaulē ir augusi interese par augu valsts uzturu un tādu produktu patēriņa samazināšanu, kas satur dzīvnieku izcelsmes sastāvdaļas – ētisku, veselības vai vides apsvērumu dēļ (World Health Organization & Regional Office for Europe, 2021). Tomēr ne vienmēr ir elementāri atpazīt dzīvnieku izcelsmes sastāvdaļas uz pārtikas preču mārķējumiem, jo nepastāv regulas, kas liktu ražotājiem obligāti norādīt to klātbūtni produktā. Attēlu apstrāde ir risinājums, kas var palīdzēt apkopot un izvērtēt informāciju, kas norādīta uz pārtikas produktu marķējumiem – gan sastāvdaļu sarakstu, gan uzturvērtību. Lai palielinātu teksta atpazīšanas precizitāti, ir izstrādāti dažādi mašīnmācīšanās algoritmi un apstrādes metodes. Pētījumi liecina, ka nozīmīgs ir ne tikai optiskās rakstzīmju atpazīšanas (OCR) process, bet arī attēla priekšapstrāde un OCR rīka izgūtā teksta pēcapstrāde. Šajā darbā ir izveidots risinājums, kas veic pārtikas produkta sastāvdaļu saraksta (latviešu valodā) atpazīšanu no attēla ar produkta marķējumu, izmantojot OCR tehnoloģiju. Projekts ietver gan OCR, gan priekšapstrādes un pēcapstrādes paņēmienus. Risinājums klasificē produktu kā augu valsts izcelsmes vai nē, un, ja tas nav, tad norāda, kuras sastāvdaļas ir dzīvnieku izcelsmes. Bakalaura darbs izstrādāts angļu valodā. Tas sastāv no 60 lapām. Darbā izmantotas 37 atsauces; tajā iekļautas 10 tabulas un 5 ilustrācijas (neieskaitot pielikumus), kā arī 6 pielikumi. Atslēgvārdi: attēlu apstrāde, pārtikas sastāvdaļas, OCR There is a growing interest in consuming plant-based products and limiting products derived from animal-based ingredients, either for ethical, environmental, or health reasons (World Health Organization & Regional Office for Europe, 2021). However, no regulations exist that would require producers to indicate ingredients which are derived from animals. While it is not too hard to recognize “milk” or “chicken”, there are many lesser-known ingredients to look out for in a product’s ingredient list. There are more than 3000 ingredients that can be included in a food product (International Food Information Council & U.S. Food and Drug Administration, 2004). Thus, it is hard to be knowledgeable about the origin of every possible ingredient. This is where technology can assist. This thesis project aims to develop a solution that can extract an ingredient list from an image of a food label (in Latvian) and determine whether it is plant-based or if it contains animal-based ingredients. A sample of 200 images was gathered with a wide variety of food product labels, 100 of which contained at least one animal-based ingredient. Image pre-processing techniques tested included converting the image to a greyscale color scheme, to black and white (binary), and removing background or noise. Tesseract was the chosen optical character recognition (OCR) engine, and five of the page segmentation modes available for the engine were tested. Post-processing included redundant text removal, exception handling (when ingredients should not be flagged as animal-based), and common OCR spelling error handling. A spell-check method was tested, too – the measure utilized was the Levenshtein distance between the identified ingredient and the items on the prepared list of undesirable ingredients; if the distance was below the threshold, ingredients were assumed to be the same. The solution achieved 80% ingredient retrieval accuracy (F1 score) and correctly classified 87.5% of food products as plant-based or not. Both pre-processing and post-processing significantly improved the accuracy. The biggest improvement was achieved when the input image was converted to greyscale and had noise removed. The choice of segmentation mode significantly influenced the results, too. Identical match post-processing (when the keywords must match 100% to be flagged) performed the best. Incorporating Levenshtein distance also improved accuracy; however, the results were worse than for identical match comparison. An important limitation is that the solution does not precisely identify ingredients as they are listed in the ingredient list, but rather finds keywords within the text that best describe the ingredient. Further research could explore ways to identify full ingredients, not just the keywords. Other limitations include high complexity, rotation inflexibility (text orientation must be ~ 0°), and the engine’s sensitivity to image quality combined with the fact that food labels can be very diverse in terms of font, colors, composition, printing quality, and packaging. Not every pre-processing or post-processing technique was tried. Further research could explore other techniques to find the best combination to achieve the highest accuracy. More sophisticated spell-check methods could be implemented to improve accuracy, especially paying attention to Latvian-specific letters which the OCR engine often misinterprets because the marks (macrons, carons, cedillas) are more difficult to notice or look like the classic Latin alphabet letters. Machine learning algorithms could be included in the post-processing to analyze text, e.g., to recognize the possible ingredients and most likely word combinations (such as “produktā var atrasties xxx daļiņas”). The paper is written in English. It consists of 60 pages, 9 of which contain 6 appendices. It contains 37 references, 10 tables, and 5 figures (not counting the tables and figures within the appendices). Keywords: image processin