THE RUSSIAN LANGUAGE TEXT CORPUS FOR TESTING ALGORITHMS OF TOPIC MODEL
Rubrics: ARTICLES
Abstract and keywords
Abstract (English):
This paper proposes a special corpus for testing algorithms Topic model SCTM-ru. In the conditions of the prompt growth of quantity of data, the problem of development of tools and systems for their automatic processing. To create systems and testing algorithms should be suitable datasets. Existence of free collections of documents, text corpora in Russian, is necessary for researches methods of natural language processing, considering linguistic features of language. Designated special housing requirements: must be distributed under a free license, the number of documents should be sufficient for the study, must include the text of documents in natural language should contain demanded algorithms Topic model information. The comparative analysis of corpus in Russian and foreign languages is carried out, discrepancy of characteristics of the existing corpus with the designated requirements is revealed.

Keywords:
text corpora, topic model, natural language processing, Russian language
References

1. Papadimitriou Ch. H., Raghavan P., Hisao Tamaki, Vempala S. Latent semantic indexing: A probabilistic analysis. – 1998.

2.

3. Hoffman Th. Probabilistic Latent Semantic Indexing // Proc. 22 Annual Int. SIGIR Conf. Res. Dev. Inform. Retrieval, 1999.

4.

5. Blei D. M., Ng A. Y., Jordan M. I. Latent Dirichlet Allocation // J. Mach. Learn. Res. 2003.

6.

7. Daud A., Li J., Zhou L., Muhammad F. Knowledge discovery through directed probabilistic topic models: a survey // Proc. Front. Comput. Sci. Chin. 2010. R. 280-301.

8.

9. Nacional'nyy korpus russkogo yazyka NKRYa. URL: www.ruscorpora.ru (data obrascheniya 12.01.2015).

10.

11. Zaharov V. P. Mezhdunarodnye standarty v oblasti korpusnoy lingvistiki // Strukturnaya i prikladnaya lingvistika. 2012. № 9. S. 201-221.

12.

13. Granovskiy D. V., Bocharov V. V., Bichineva S. V. Otkrytyy korpus: principy raboty i perspektivy // Komp'yuternaya lingvistika i razvitie semanticheskogo poiska v Internete: tr. nauch. seminara XIII Vseros. Ob'edinen. konf. «Internet i sovremennoe obschestvo». Sankt-Peterburg, 19-22 okt. 2010 g. /pod red. V. Sh. Rubashkina. – SPb., 2010. 94 s.

14.

15. Otkrytyy korpus. URL: opencorpora.org (data obrascheniya 10.01.2015).

16.

17. Small corpus of Associated Press. URL: www.cs.princeton.edu/~blei/lda-c (data obrascheniya 6.01.2015).

18.

19. The New York Times Annotated Corpus. URL: catalog.ldc.upenn.edu/LDC2008T19 (data obrascheniya 14.01.2015).

20.

21. The 20 Newsgroups data set. URL: qwone.com/~jason/20Newsgroups (data obrascheniya 24.01.2015).

22.

23. Reuters Corpora. URL: trec.nist.gov/data/reuters/reuters.html (data obrascheniya 24.01.2015).

24.

25. Reuters-21578 Text Categorization Collection Data Set. URL: archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection (data obrascheniya 24.01.2015).

26.

27. Vinogradova V. B., Kukushkina O. V., Polikarpov A. A., Savchuk S. O. Komp'yuternyy korpus tekstov russkih gazet konca 20-go veka: sozdanie, kategorizaciya, avtomatizirovannyy analiz yazykovyh osobennostey // Russkiy yazyk: istoricheskie sud'by i sovremennost': Mezhdunar. kongress rusistov-issledovateley. Moskva, filologicheskiy f-t MGU im. M. V. Lomonosova 13-16 marta 2001 g. Trudy i materialy. – M.: Izd-vo Moskov. un-ta, 2001. S. 398.

28.

29. Komp'yuternyy korpus tekstov russkih gazet konca XX veka. URL: www.philol.msu.ru/~lex/corpus/corp_descr.html (data obrascheniya 24.01.2015)

30.

31. Vencov A. V., Grudeva E. V. O korpuse russkogo literaturnogo yazyka (narusco.ru) // Rus. lingvistika. 2009. T. 33, № 2. S. 195-209.

32.

33. Korpus russkogo literaturnogo yazyka. URL: www.narusco.ru (data obrascheniya 24.01.2015).

34.

35. Hel'sinkskiy annotirovannyy korpus russkih tekstov HANKO. URL: www.helsinki.fi /venaja/russian/e-material/hanco/index.htm (data obrascheniya 24.01.2015).

36.

37. Krizhanovsky A. A., Smirnov A. V. An approach to automated construction of a general-purpose lexical ontology based on Wiktionary // J. Comput. Syst. Sci. Int. 2013. Vol. 52, № 2. P. 215-225.

38.

39. Smirnov A. V., Kruglov V. M., Krizhanovskiy A. A., Lugovaya N. B., Karpov A. A., Kipyatkova I. S. Kolichestvennyy analiz leksiki russkogo WordNet i vikislovarey // Tr. SPIIRAN. 2012. Vyp. 23. S. 231-253.

40.

41. Programma morfologicheskogo analiza tekstov na russkom yazyke MyStem. URL: api.yandex.ru/mystem (data obrascheniya 12.12.2014).

42.

43. Xu S., Shi Q., Qiao X. et al. Author-Topic over Time (AToT): a dynamic users’ interest model, in Mobile, Ubiquitous,and Intelligent Computing. – Berlin (Germany): Springer, 2014. R. 239-245.

44.

45. Ramage D., Hall D., Nallapati R., Manning C. D. Labeled LDA. A supervised topic model for credit attribution in multilabeled corpora // Empirical Methods Nat. Lang. Proc. 2009. P. 248-256.

46.

47. Xuerui Wang, McCallum A. Topics over Time: A Non-Markov ContinuousTime Model of Topical Trends // Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Philadelphia, USA, Aug. 20-23, 2006.

48.

49. Gruber A., Rosen-Zvi M., Weiss Ya. Hidden Topic Markov Models // Proc. Artifi cial Intel. Statistics (AISTATS), San Juan, Puerto Rico, USA, March 21-24, 2007.

50.

51. Zaharov V. P., Azarova I. V. Parametrizaciya special'nyh korpusov tekstov // Strukturnaya i prikladnaya lingvistika: mezhvuz. sb. Vyp. 9. – SPb.: SPbGU, 2012. S. 176-184.

Login or Create
* Forgot password?