enrukz

Project: №AP09259556 Development of methods and systems for integrated learning and natural language processing, based on artificial intelligence technologies

Project manager and members:

Project Manager – PhD Rakhimova Diana Ramazanovna

Senior Research Fellow, PhD А.С. Karibayeva

Senior Research Fellow, PhD М. Turdalyuly

Senior Research Fellow, Candidate of Technical Sciences Y.R.Suleimenov

Junior Research Fellow A.О.Turganbayeva

Junior Research Fellow А. Suleimenova

software engineer N. Lonovenko

software engineer D.Suleimenov

Project goal:

The goal of the project is to create technology (algorithms, methods, electronic resources) for a system for processing and studying the state language using modern methods and approaches of artificial intelligence, adapted to the peculiarities of the Kazakh language.

Project tasks:

To achieve this goal, it is necessary to solve the following main tasks:

– Creation of large data sets both for user training tasks and for artificial intelligence tasks such as machine translation, speech recognition and deep learning.

– Development of an intelligent “alignment” algorithm for identifying parallel pairs of sentences from parallel texts

– Development of an automated morphological analyzer for text processing

– Development and integration of services and modules for studying the Kazakh language with machine translation and speech recognition systems.

Creation of Internet services and applications for the practical use of the obtained tools and algorithms in real life.

Results:

The following scientific and technical results were obtained:

  • Text data was collected using a material scraping system, automated data acquisition from the Internet on the topic of interest.
  • A method for alignment a parallel corpus has been developed. This method consists of a two-stage alignment. The first part of the alignment uses the Hunalign tool. The second part of the alignment is based on the dictionary base. As a result of the work done, the following linguistic data were collected and processed:

– over 100 thousand small texts in the Kazakh language: news, materials from magazines, etc.

– over 300 books in the Kazakh language, Kazakh and foreign authors, including fiction, collections of songs, books on self-development, business, etc.

– more than 2 million Kazakh-Russian parallel sentences

– 200 thousand Kazakh-Russian dictionary entries.

For Kazakh language processing tools, approaches based on neural and deep learning were developed and the following work was implemented:

  • A morphological analyzer has been developed for the Kazakh language based on machine learning;
  • Neural machine translation has been developed for the English-Kazakh and Russian-Kazakh language pairs, based on RNN, BRNN and Transformer models;
  • An approach has been developed for recognition and synthesis of speech in the state language, based on machine learning (BLSTM, ResNet).

The conducted research was accompanied by software development of approaches and testing of algorithms. The results obtained were tested and evaluated using special metrics such as BLEU, TER and WER.

The practical result of the project is the development of a web application called “Oqulyq”. The results of the research work carried out within the framework of this project were tested and introduced into the educational process of the following disciplines “Language Resources”, “Machine Translation Technologies”, “Machine Learning in Natural Language Processing” of the educational master’s program 7M06101-“Computational Linguistics” Al Farabi KazNU, as well as in the educational process of the discipline “Foreign Language” (professional) of the 1st year master’s degree in the educational program 7M06101-“Software Engineering” and 7M07204-“Technology and Engineering of Food Production” of the International University of Engineering and Technology.

Based on the results of the project for 2021-2023, 26 publications were published: in foreign publications – publications indexed in the WoS and/or Scopus databases – 6 publications; in domestic publications recommended by CQASES MES RK (Committee for Quality Assurance in the Sphere of Education and Science of the Ministry of Education and Science of the Republic of Kazakhstan) – 2 publications; One monograph was published in a domestic publication and one collective monograph in a foreign publication. 3 copyright certificates for the developed computer programs were received. The results of the study were tested at international conferences and scientific seminars.

Video description of the system "Oqulyq"