Project: №АР05132950. Development of an information and analytical data retrieval system in the Kazakh language – Институт информационных и вычислительных технологий

Project manager and members::

The project manager is PhD Rakhimova Diana Ramazanovna.

Key members of the study group:

Doctor of technical sciences, Professor Tukeev Walsher Anuarbekovich,

junior researcher, Zhumanov Zh.M.,

junior researcher, Shormakova A.N.,

Engineer Turganbayeva A.O.,

Engineer Abduali B.,

Engineer Amirova D.

The aim of the project:

Is to develop effective algorithms and models for text data processing, based on modern technologies in the field of natural language processing using the latest advances in computer linguistics to obtain new information and knowledge from unstructured sources, large data sets and texts in the Kazakh language.

To achieve this goal, the project solved the following tasks:

Full system of classification of endings and suffixes of the Kazakh language was developed. Developed a lexicon-free algorithm using the developed system of classification of Kazakh language endings. Distinctive features of the constructed algorithm are its speed and easy enough reproducibility.

Model and system of marked corpus of Kazakh language is developed, the distinctive features are the developed modules (tokenization, lemmatization, morphological analysis) of data processing in view of Kazakh language features;

Algorithm of automatic replenishment of texts in the Kazakh language and an algorithm for indexing documents by means of attributes were developed;

The knowledge base of synonyms and phrases classified by structural formation of phrases and types of appointments for the Kazakh language, which improve the quality of information-analytical search system is developed;

Developed a module of information-analytical processing by creating an application software solution for various purposes, using artificial intelligence for the processing and analysis of both structured and unstructured big data. Algorithms and methods of this module can be further applied both individually and in complex for solving analysis of big text data:

– Algorithm for extracting key words (phrases) from documents in the Kazakh language;

– Algorithm of semantic analysis of the text, using machine learning technology (Machine Learning);

-Method of summarization of the text in the Kazakh language;

The architecture is constructed and developed a prototype of information-analytical retrieval system, taking into account modern technologies and methods in information retrieval and semantic processing of natural language. Sub-modules of the information retrieval module of the search system have been developed. For the purpose of technological decision the flexible architecture of information system has been developed. All program modules of the system are interconnected by integration modules (intermediate data storages) which serve as connecting links and allow to obtain a loosely-connected architecture. This design approach allows for relatively easy scalability and upgradability of the modules.

Publications:

The main results of the project’s research and technical activities are presented in the following publications:

Publications indexed in the Web of Science and/or Scopus databases:

Rakhimova, D., Turganbayeva, A. Auto-abstracting of texts in the Kazakh language // Proceedings of the 6th International Conference on Engineering & MIS. – 2020. – P. 1-5 // https://doi.org/10.1145/3410352.3410832.
Shormakova A., Zhumanov Zh., Rakhimova D. Post-editing of words in Kazakh sentences for information retrieval // Journal of Theoretical and Applied Information Technology. – 2019. – Vol. 96, №6. – P. 1896-1908
Rakhimova D., Turganbayeva A. Lemmatization of big data in the Kazakh language // Cборник трудов 5-ой Междунар. конф. по инжинирингу и информационным системам управления ICEMIS2019. – 2019. – С. 73-77.
Shormakova A., Zhumanov Zh., Abduali B., Rakhimova D., Amirova D. Analytical Processing of Textual Resources and Documents in the Kazakh Language // Journal of Engineering and Applied Sciences. – 2019. – 14, Issue: 20. – P. 7714-7721. // DOI: 10.36478/jeasci.2019.7714.7721
Rakhimova D., Shormakova A. Problems of semantics of words of the Kazakh Language in the information retrieval //Lecture Notes in Artificial Intelligence Computational Collective Intelligence. –Springer, 2019. – 11684, Part II. – P. 70-81. https://doi.org/10.1007/978-3-030-28374-2_7
Ualsher Tukeyev, Diana Rakhimova, Aliya Turganbayeva, Dina Amirova, Balzhan Abduali, Aidana Karibayeva. Lexicon-free stemming for Kazakh language information retrieval // IEEE 12^thInternational Conference on Application of Information and Communication Technologies. Almaty, 2018. – P. 95-98

Publications recommended by CCES of RK:

Rakhimova D., Turganbayeva A. Semantic analysis of the Kazakh language based on the approach of neural networks // News of the national academy of sciences of the Republic of Kazakhstan, Physico-mathematical series. – 2020. – Vol. 5, No 333. – P. 68-75 // https://doi.org/10.32014/2020.2518-1726.84.
Рахимова Д.Р., Турганбаева А.О. Задача нормализации слов казахского языка // Научно-технический вестник информационных технологий, механики и оптики. – 2020. – Т. 20. – № 4. – С. 545-551 Санкт -Петербург, Россия // doi: 10.17586/2226-1494-2020-20-4-545-551.
Рахимова Д.Р., Сатыбалдиев А.Р. Алгоритм сбора текстовых данных на казахском языке // Вестник КазНПУ им. Абая. Серия «Физико-математические науки». – 2020. – № 2 (70). – С. 261-267.
Абдуали Б.А., Әмірова Д.Т., Рахимова Д.Р., Кәрібаева А.С. Аналитическая обработка текстовых ресурсов и документов на казахском языке // Вестник КазНИТУ. – 2019. – №2 (132). – C. 356-362.
Рахимова Д.Р., Шормакова А.Н., Тұрғанбаева Ә.О. Разработка электронных ресурсов для казахского языка // Вестник КазНИТУ. – 2019. – №3 (133). – C. 161-166.
А.Н. Шормакова. Екі табиғи тілдегі аударылған мәтінді туралау // Вестник КазНИТУ. – 2018. –№4(128). –C. 344-349.

Proceedings of international conferences:

Рахимова Д.Р., Турганбаева А.О., Сатыбалдиев А. Исследование подходов по извлечению ключевых слов из текста // Матер. V Межд. науч. конф. “Информатика и прикладная математика”. – Алматы, 2020. – С. 252-258.
Рахимова Д.Р., Аблатип А.Ж., Мәтіндердегі террористік бағыттағы сөздерді анықтау // Сб. ст. по матер. CLXVI междунар. науч.-практ. конф. «Молодой исследователь: вызовы и перспективы». – М., Изд. «Интернаука», 2020. – № 19 (166). – С. 439-444.
Рахимова Д.Р., Жуманов Ж.М. Разработка архитектуры информационно-аналитической поисковой системы обработки данных на казахском языке // Матер. науч. конф. «Современные проблемы информатики и вычислительных технологий». – Алматы: ИИВТ МОН РК, 2020. – С. 287-291.
Abduali B., Karibayeva A., Amirova D. Formation of the synthetic corpora for Kazakh on the base of endings complete system // Сборник матер. Шестой Междунар. конф. по компьютерной обработке тюркских языков «TurkLang-2018». – Ташкент, Узбекистан, 2018. – C. 114-120 (связи с поздней печатью публикация не была включена в 2018г.).
Рахимова Д.Р., Нурхан А.К., Исследование и создание размеченного корпуса текстов для казахского языка // Сборник матер. Шестой Междунар. конф. по компьютерной обработке тюркских языков «TurkLang-2018». – Ташкент, Узбекистан, 2018. – C. 127-133 (связи с поздней печатью публикация не была включена в 2018г.).
Рахимова Д.Р., Сейтжаппар М.А. Қазақ тілінің автоматтандырылған маркерлік корпусын әзірлеу // Матер. науч. конф. ИИВТ МОН РК «Cовременные проблемы информатики и вычислительных технологий». – Алматы, 2019. – C. 66-74.
Amirova D., Karibayeva A. Rakhimova D., Problems of lexical polysemy for the Kazakh language // Матер. 3-й междунар. науч. конф. «Информатика и прикладная математика» посв. 80-летию проф. Бияшева Р.Г. и 70-летию проф. Айдарханова М.Б. – Алматы, 2018. –Ч.2– C. 18-28.
Рахимова Д.Р. Жомартова Л.М., Мусаев М.С., Семантический поиск на основе модели векторного представления слов // Матер. 3-й междунар. науч. конф. «Информатика и прикладная математика» посв. 80-летию проф. Бияшева Р.Г. и 70-летию проф. Айдарханова М.Б. – Алматы, 2018. –Ч.2– C. 95-103.
Рахимова Д., Жуманов Ж., Давлетова С. Экономическая эффективность комплексной технологии расширения ресурсов для казахского языка. // Матер. 14-й междунар. азиатской школы-семинара «Проблемы оптимизации сложных систем».- Кыргызская республика, 2018. –Ч. 2 , – C. 151-159.
Рахимова Д.Р. Жомартова Л.М., Исследование реккурентных нейронных сетей для моделирования естественных языков флективных классов // Матер. науч. конф. Института информационных и вычислительных технологий МОН РК «Современные проблемы информатики и вычислительной технологий». – Алматы,2018. – C. 103-107.
Рахимова Д.Р. Мусаев М.С., Особенности обработки текстов естественного языка в разработке интеллектуальной поисковой системы. // Матер. науч. конф. Института информационных и вычислительных технологий МОН РК «Современные проблемы информатики и вычислительной технологий». –Алматы, 2018. – C. 185-189.

Books:

«Вычислительная обработка казахского языка»

Expert opinion

Expert opinion on the information-analytical data retrieval system in the Kazakh language, carried out under project AP05132950

Practical results

Kazakh ASR

As a result of this research, a speech recognition mobile application has been implemented to teach Kazakh language. This mobile application developed by IICT is made by KazVoice, which is available to the user in test mode. To work with this application it is necessary to go online https://t.me/kazakhASRB.t. When recording speech, the microphone button is pressed and speech signals are received from the microphone. The speech signals are then automatically read out, at which point the user can see the result as text.