Project: №АР05131073. Methods, models of retrieval and analyses of criminal contained information in semi-structured and unstructured textual arrays.

The project is aimed:

at solving the general fundamental problem of forming methodological foundations for creating logical and linguistic models for identifying cognitive and semantic meaning identifiers in natural language texts. Within the limits of this general problem, the project has solved a concrete applied problem: the information-linguistic technology of automatic identification, selection, search and analysis of criminally significant component in the unstructured and weakly structured test arrays of the Kazakh, Russian and English languages, based on modeling of the human intellect understanding function.

In order to achieve this goal the project solved the following tasks.

  1. An analytical review of the major problems in the field of technology to search for illegal information in text data:
  • Analyzed the status and prospects of development of formalization and information retrieval methods in unstructured text arrays;
  • developed a general approach to formalization and identification of criminally relevant information;
  • A review of existing opportunities for using Information Extraction methods to extract criminally relevant information has been made.
  1. A logical and linguistic model for fact extraction from natural language text arrays is developed:
  • Justified the use of finite predicate algebra as a mathematical tool to model the semantics of unstructured and weakly structured texts;
  • a logical and linguistic model for fact extraction from weakly structured texts in Russian was developed;
  • information technology for fact extraction from weakly structured English texts was created;
  • existing problems of formalization and automation of the Kazakh language were analyzed;
  • created a logical and linguistic model of Open Information Extraction for the texts of the Kazakh language.
  1. The corpus of modern web content of Kazakh, Russian and English languages was developed:
  • peculiarities of formation of the Kazakh-Russian parallel corpus of criminal texts are considered;
  • information technology of identification and analysis of criminally significant information in text corpus was developed;
  • an information technology of alignment of the created parallel corpus of Kazakh-Russian corpus of texts on criminal topics was created;
  • the practical results of the implementation of the developed model of Open IE on three corpora of Russian, Kazakh and English texts are shown.
  1. The correlation between the linguistic formalisms in the texts of web content and the real essence of the socially significant event was investigated:
  • A review of existing approaches to generate structured machine-readable information from unstructured texts has been conducted;
  • developed a formal model of grammatical ways of expression of the fact of inducement to action in English;
  • epistemological aspects of information processes to identify semantic/lexical and grammatical identifiers of criminality have been considered;
  • the method for detecting semantic identifiers of criminality in a corpus of texts has been developed;
  • The technology to search semantically close short text fragments has been developed [2, 8].
  1. The effectiveness of the developed technologies for the identification of criminally significant information on the basis of the created corpus has been evaluated.
  • A comparative analysis of metrics for evaluating the effectiveness of machine learning models has been carried out. Use of numerical evaluation metric that uses tuple including completeness coefficient, accuracy coefficient and Van Riesbergen measure as objectively measurable characteristics of machine learning models effectiveness is justified.
  • The implementation and experimental results of Open IE model are considered;
  • a methodology for expert quality assessment of the technology to determine the semantic proximity of texts to illegal topics has been created;
  • built a model to assess the quality of technology to determine the semantic proximity of the document to a highly specialized subject;
  • Recommendations on creation of information technology for identification of knowledge in textual arrays in order to select information, important for prevention of illegal actions, were developed.


  1. Khairova N., Lewoniewski W., Węcel K., Mamyrbayev О., Mukhsina K. Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources // Business Information Systems. Lecture Notes in Business Information Processing. – Springer, Cham, 2018. — Vol 320. – 
Р. 333-347
  2. Khairova, S. Petrasova, W. Lewoniewski, O. Mamyrbayev, K. Mukhsina. Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus // Proceedings of the Federated Conference on Computer Science and Information Systems. – 2018. — Vol. 15. — Р. 485–488
  3. Хайрова Н. Ф., Мамырбаев О. Ж., Мухсина К. Ж., Пилипенко А. А. Моделирование грамматических способов выражения семантики факта в английском предложении // Матер. ІІІ Междунар. науч. конф. «Информатика и прикладная математика», посв. 80-летнему юбилею проф. Бияшева Р.Г. и 70-летию проф. Айдарханова М.Б. – Алматы, 2018. – Т. 2. – С. 136–144.
  4. Хайрова Н. Ф., Мамырбаев О.Ж., Избасаров Е.Ж., Мухсина К. Ж. Формальная модель оценивания качества экстракции и идентификации знаний из слабоструктурированной тестовой информации // Матер. науч. конф. института информационных и вычислительных технологий МОН РК «Современные проблемы информатики и вычислительных технологий». – Алматы, 2018. – С. 306-310.
  5. Мамырбаев О. Ж., Мухсина К. Ж., Хайрова Н. Ф., Колесник А. С. Лингвистические инструменты выявления криминально окрашенной текстовой информации веб-контента // Вестник казахстанско-британского технического университета. – 2018. – № 3(46). – С. 112-117.
  6. Khairova, N.; Kolesnyk, A.; Mamyrbayev, O. and Mukhsina, K. (2019). The Influence of Various Text Characteristics on the Readability and Content Informativeness. In Proceedings of the 21st International Conference on Enterprise Information Systems — Volume 1: ICEIS, ISBN 978-989-758-372-8, pages 462-469. DOI: 10.5220/0007755004620469
  7. Мамырбаев О. Ж., Хайрова Н. Ф., Мухсина К. Ж. Қазақ тіліндегі мәтіндердегі қылмыстық мәнді коллакцияларды анықтау / Вестник КазАТК им. М. Тынышпаева, рекомендуемый ККСОН МОН РК. – №3(110). – 2019. – 170 -175 c.
  8. Khairova, S. Petrasova, O. Mamyrbayev and K. Mukhsina (2019) Detecting Collocations Similarity via Logical-Linguistic Model. In Proceedings of the Workshop on meaning relations between phrases and sentences — May 23, 2019, Gothenburg, Sweden, pages 15-22.
  9. Nina Khairova, Orken Mamyrbayev, Kuralay Mukhsina, Anastasiia Kolesnyk. Logical-Linguistic model for multilingual open information extraction // Cogent Engineering, 2020, 7:1, 
  10. Open Information Extraction as Additional Source for Kazakh Ontology Generation / Nina Khairova, Svitlana Petrasova, Orken Mamyrbayev, Kuralay Mukhsina // Proceedings Asian Conference on Intelligent Information and Database Systems ACIIDS 2020, Phuket, Thailand, March 23–26, 2020. — Cham, 2020. — Part I. — P. 86–96
  11. The Aligned Kazakh–Russian Parallel Corpus Focused on the Criminal Theme / Nina Khairova, Anastasiia Kolesnyk, Orken Mamyrbayev, Kuralay Mukhsina // Proceedings of the Conference Computational Linguistics and Intelligent Systems, CoLInS 2019. — 2019. — P. 116–125.
  12. Хаирова Н., Колесник А., Мамырбаев О., Мухсина К. Выровненный казахско-русский параллельный корпус, ориентированный на криминальную тематику / Вестник Алматинского университета энергетики и связи № 1 (48) 2020. – c.84-92.
  13. Nina Khairova, Anastasiia Kolesnyk, Orken Mamyrbayev, Svitlana Petrasova. Applying VSM to Identify the Criminal Meaning of Texts. COLINS 2020, р. 20-31
  14. Similar text fragments extraction for identifying common wikipedia communities / Svitlana Petrasova, Nina Khairova, Włodzimierz Lewoniewski, Orken Mamyrbayev, Kuralay Mukhsina // Data. — 2018. — Vol. 3, № 4. — P. 66. — DOI: 10.3390/data3040066.

Author's certificates:


Хайрова Н. Ф. Некоторые аспекты технологии идентификации криминально значимой информации в многоязычных текстовых массивах / Хайрова Н. Ф., Мамырбаев О. Ж., Мухсина К. Ж. – Алматы: Институт информационных и вычислительных технологий, 2020. – 92 с.




Хайрова Н. Ф., Мамырбаев О. Ж., Петрасова С. В., Мухсина К. Ж.

Современные технологии обработки текстовых данных на базе пакета NLTK Python : учеб. пособ. / Н. Ф. Хайрова, О. Ж. Мамырбаев, С. В. Петрасова, К. Ж. Мухсина. Харьков : ООО «В деле», 2020. 134 с. На русском языке.