Project: №AP09259309. Information model and software of automatic search and analysis system of multilingual illegal web content based on an ontological approach – Институт информационных и вычислительных технологий

Research group:

1) Mamyrbaev O.Zh. – Head of Research, Deputy General Director, PhD, Senior Researcher (https://orcid.org/0000-0001-8318-3794);

2) Khairova N.F. – d.t.s., Professor, Chief Researcher (https://orcid.org/0000-0002-9826-0286)

3) Sharonova N.V. – d.t.s., Professor, Chief Researcher (https://orcid.org/0000-0002-7555-1507)

4) Mukhsina K.Zh. – PhD, Senior Researcher (https://orcid.org/0000-0002-8627-1949)

5) Ybytaeva G.S. – Junior Researcher (https://orcid.org/0000-0002-4243-0928)

6) Kartbaev A.Zh. – PhD, Senior Researcher (https://orcid.org/0000-0003-0592-5865)

Goal of the project:

Тo develop an information model of the automatic identification of illegal texts in Kazakh, Russian and English in Internet networks. Information model includes the “Illegal Internet Content” ontology, specialized text corpora and software tools designed to support analysts of state services in identifying texts of illegal content.

Project objectives:

Creating a primary basic metaontology “Illegal Internet content”, which will have a limited size and structure. The task should include creating terminological thesaurus of Kazakh, Russian, and English languages based on the existing corpora of this subject of the three specified languages, as well as defining ontology classes, their properties, and relations between classes.
Development of a method for automatically filling in and adding to the created base ontology “Illegal Internet content”, based on the existing corpus of criminalized text information of Web networks. The developed method should use statistical approaches and the previously created information and linguistic model Open Information Extraction, designed to extract triplets of facts from unstructured texts.
Implementation of automatic filling of the “Illegal Internet content” ontology, based on the existing corpus of texts containing criminalized information. The ontology should include the vocabulary of Kazakh, Russian and English languages and have a volume sufficient for its practical use in information search models. At this stage, it is necessary to carry out preliminary linguistic processing of corpus texts, to identify formal legal dimensions between certain linguistic formalisms in texts and real entities, entity classes and relationships for each of the three languages
Development of a method and tools of semantic markup of Kazakh, Russian, and English text corpora of criminally significant Internet content information. The method should be based on the created ontology and aligned parallel Kazakh-Russian corpus of criminalized texts. To solve this problem and form a set of semantic labels, it is proposed to use both the developed ontology classes and the existing approaches in solving the Entity Recognition problem of NLP. At this stage, it is necessary to carry out semantic markup of the existing corpus of criminally colored texts.
Development of an integrated technology of searching and analyzing illegal content in social networks and other Internet sources in three languages, which will include both supervised machine learning methods and additional differentiating semantic features of criminal tinge of texts obtained on the basis of an ontological approach
Creation of an effective algorithm and software of automatic monitoring of Internet resources that will allow automatic search and analysis of multilingual illegal Internet content. At the current stage, the effectiveness of the developed technology for identifying illegal text information should be proved, based on an ontological approach.

The scientific novelty of this project

lies in a new integrated approach to the semantic analysis of the text content of the Internet, based on the simultaneous use of machine learning methods and reinforcing differentiating features obtained from the ontology of the subject area.

The project also includes the development of a method for automatically generating a linguistic ontology “Illegal Internet Content” based on a logical-linguistic model for extracting facts from unstructured documents.

Using this model allows you to automate the filling of the ontology with entities and relationships between them, extracted from the created text corpora containing criminally colored texts.

During the implementation of the project, it is supposed for the first time in the Republic of Kazakhstan to develop an ontology of the subject area of illegal Internet text content for three languages: Kazakh, Russian and English. It should be noted that in open world sources there is no available information about such ontologies that is sufficient for practical application.

Object of research:

Models and methods of automatic search and analysis of illegal textual information in the Kazakh, Russian and English languages based on the ontological approach.

The main design and technical and economic indicators, efficiency:

The implementation of this project allows to increase the efficiency of semantic processing of texts in Kazakh, Russian and English; the created highly specialized ontology “Illegal Internet Content” represents a new linguistic resource of the Kazakh language, which increases the scientific potential of subsequent developments.

Scope:

Law enforcement and special government organizations; social services; educational institutions and other government institutions.

Expected results:

The main results expected in the course of the project.

In the course of the project, a method will be developed and an automatic generation of the Illegal Internet Content ontology for Kazakh, Russian and English languages will be implemented.
Corpuses of criminally significant information contained in Internet networks for Kazakh, Russian and English languages will be supplemented.
A method of semantic analysis and semantic markup of the created dynamically filled multilingual text corpora will be developed, with an emphasis on the allocation of linguistic and lexical markers of illegal content.
An integrated technology of searching and analyzing illegal content in social networks and other Internet sources in Kazakh, Russian and English will be created, including machine learning methods and an ontological approach. On the basis of the technology, an effective algorithm and software of the system of automatic monitoring of Internet resources will be developed, which will allow automatic search and analysis of multilingual illegal Internet content. The effectiveness of the created models, methods and algorithms will be proved by practical experiments.

The results obtained:

1) a basic terminological thesaurus of the illegal vocabulary of the Kazakh, Russian and English languages, representing a meta-ontology of a limited size and structure;

2) extended corpora of criminally significant texts of group online discussion communities;

3) a method of automatic ontology generation based on the available corpora and the developed approach for extracting events from the OdEE text.

List of publications:

Nina Khairova, Anastasiia Kolesnyk, Orken Mamyrbayev, Galiya Ybytayeva, Yuliia Lytvynenko. Automatic Multilingual Ontology Generation Based on Texts Focused on Criminal Topic / Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems. – 2021. – Vol.1. – P. 108-117.
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. Development of security systems using DNN and i & x-vector classifiers // Eastern-European Journal of Enterprise Technologies. – 2021. – Vol. 4/9 (112). – P. 32–45 // https://doi.org/10.15587/1729-4061.2021.239186.
Г.С. Ыбытаева, О.Ж. Мамырбаев, Н.Ф. Хайрова, Б.Ж. Жумажанов. Қазақ тіліндегі мәтіндерде коллокацияларды анықтаудың статистикалық әдістерін талдау // Матер. VI Междунар. науч. конф. «Информатика и прикладная математика». – Алматы, Казахстан, 2021. – С. 256-262.
Kartbayev A., Mamyrbayev O., Khairova N., Ybytayeva G., Abilkaiyr N., Mussayeva D. Correction of Kazakh synthetic text using finite state automata // Journal of Theoretical and Applied Information Technology. – 2021. – Vol 99, Issue 23 (в печати).
Г.С. Ыбытаева, Н.Ф. Хайрова, К.Ж. Мухсина, Б.Ж. Жумажанов. Лингвистикалық онтологияны қолдану және қалыптастыру мәселелеріне шолу//News of the National Academy of Sciences of the Republic of Kazakhstan. Physics and information technology series. Volume 1, Number 341 (2022), pp. 96-106 https://doi.org/10.32014/2022.2518-1726.121