![]() |
Sistema de Consulta Abierta
Sistema de consulta abierta con módulo de análisis semántico
|
Funciones | |
| def | export_corpus |
| def | import_corpus |
| def | import_model |
| def | export_model |
vsm.extensions.interop.ldac Module containing functions for import/export between VSM and lda-c, which is the original LDA implementation referenced in Blei, Ng, and Jordan (2003). lda-c is available at: http://www.cs.princeton.edu/~blei/lda-c/
| def vsm.extensions.interop.ldac.export_corpus | ( | corpus, | |
| outfolder, | |||
context_type = 'document' |
|||
| ) |
Converts a vsm.corpus.Corpus object into a lda-c compatible data file.
Creates two files:
1. "vocab.txt" - contains the integer-word mappings
2. "corpus.dat" - contains the corpus object in the format described in
[lda-c documentation](http://www.cs.princeton.edu/~blei/lda-c/readme.txt):
Under LDA, the words of each document are assumed exchangeable. Thus,
each document is succinctly represented as a sparse vector of word
counts. The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document. Note that [term_1] is an integer which indexes the
term; it is not a string.
:param corpus: VSM Corpus object to convert to lda-c file
:type corpus: vsm.corpus.Corpus
:param outfolder: Directory to output "vocab.txt" and "corpus.dat"
:type string: path
| def vsm.extensions.interop.ldac.import_corpus | ( | corpusfilename, | |
| vocabfilename, | |||
context_type = 'document', |
|||
path = None |
|||
| ) |
Converts an lda-c compatible data file into a VSM Corpus object. :param corpusfilename: path to corpus file, as defined in lda-c documentation. :type string: :param vocabfilename: path to vocabulary file, one word per line :type string:
1.8.8