Analysis and Implementation of Similarity Measurement in Documents Using Semantic Methods

Authors

  • Satria Yudha Prayogi Universitas Islam Sumatera Utara Author
  • Sony Bahagia Sinaga STMIK Mulia Darma Author

Keywords:

Similarity, Document, Text Mining, Semantic

Abstract

The number of documents available in digital form is increasing. Meanwhile, one document and another document may be related to each other, but they must not be plagiarized without including the reference source. For this reason, a mechanism for detecting similarities is needed. This research only discusses similarity in documents. In this research, the technique used to solve the above problem is to use text mining techniques to categorize the documents searched according to keywords. Meanwhile, to search for documents according to keywords, the indexing process is used to display documents that are searched for according to keywords. Semantics is a technique used by search engines to match key words on one page with another page. This method has been used very often before, because it is very precise and easy. The weight values (W) of D1 and D2 are the same. If the document weight sorting results cannot be sorted quickly, because both W values are the same, then a calculation process using the vector-space model algorithm is needed. The idea of this method is to calculate the cosine value of the angle of two vectors, namely W from each document and W from keywords. From the research results, it can be seen that document 3 (D3) has the highest level of similarity to keywords, followed by D2 and D1.

References

Ramos, J. (2003). Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the First International Conference on Machine Learning (pp. 133-142).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

H. Yan, N. Yang, Y. Peng, and Y. Ren, “Data mining in the construction industry Present status, opportunities, and future trends,” Automation in Construction, vol. 119. Elsevier B.V., p. 103331, Nov. 01, 2020. doi 10.1016/j.autcon.2020.103331.

Z. Wang, Y. Li, D. Li, Z. Zhu, and W. Du, “Entropy and gravitation based dynamic radius nearest neighbor classification for imbalanced problem,” Knowl Based Syst, vol. 193, no. xxxx, p. 105474, 2020, doi 10.1016/j.knosys.2020.105474.

J. Mahasiswa and U. Negeri, “View metadata, citation and similar papers at core.ac.uk”.

M. Nurjannah and I. Fitri Astuti, “PENERAPAN ALGORITMA TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) UNTUK TEXT MINING Mahasiswa S1 Program Studi Ilmu Komputer FMIPA Universitas Mulawarman Dosen Program Studi Ilmu Komputer FMIPA Universitas Mulawarman,” J. Inform. Mulawarman, vol. 8, no. 3, pp. 110–113, 2013.

Gandhis Ulta Abriania, “Implementasi Metode Semantic Similarity Untuk Pengukuran Kemiripan Antar Kalimat”, ILKOMNIKA, Vol. 1, No. 2

Davis Valentino, “Indexing dan Searching Document Menggunakan Metode Semantic Suffix Tree Clustering Berbasis Android”, Jurnal Infra,

Downloads

Published

2024-07-05

How to Cite

Prayogi, S. Y., & Sinaga, S. B. (2024). Analysis and Implementation of Similarity Measurement in Documents Using Semantic Methods. Pascal: Journal of Computer Science and Informatics, 1(02), 69-72. https://jurnal.devitara.or.id/index.php/komputer/article/view/87