TY - JOUR
T1 - Search and classify topics in a corpus of text using the latent dirichlet allocation model
AU - Iparraguirre-Villanueva, Orlando
AU - Sierra-Liñan, Fernando
AU - Salazar, Jose Luis Herrera
AU - Beltozar-Clemente, Saul
AU - Pucuhuayla-Revatta, Félix
AU - Zapata-Paulini, Joselyn
AU - Cabanillas-Carbonell, Michael
N1 - Publisher Copyright:
© 2023 Institute of Advanced Engineering and Science. All rights reserved.
PY - 2023/4
Y1 - 2023/4
N2 - This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.
AB - This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.
KW - Classify
KW - Discovering
KW - Latent dirichlet allocation
KW - Text corpus
KW - Topics
UR - http://www.scopus.com/inward/record.url?scp=85147159751&partnerID=8YFLogxK
U2 - 10.11591/ijeecs.v30.i1.pp246-256
DO - 10.11591/ijeecs.v30.i1.pp246-256
M3 - Original Article
AN - SCOPUS:85147159751
SN - 2502-4752
VL - 30
SP - 246
EP - 256
JO - Indonesian Journal of Electrical Engineering and Computer Science
JF - Indonesian Journal of Electrical Engineering and Computer Science
IS - 1
ER -