Talk:Topic-based vector space model

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

On 10 June 2005, this article was nominated for deletion. See Wikipedia:Votes for deletion/Topic-based vector space model for a record of the discussion.


Plagiarism?[edit]

The 2nd reference (http://kuropka.net/files/HPI_Evaluation_of_eTVSM.pdf) contains portions of Wikipedia's LSA article word-for-word:

Wikipedia:

Some of LSA's drawbacks include:

  • The resulting dimensions might be difficult to interpret. For instance, in {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)} the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle". However, it is very likely that cases close to {(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)} will occur. This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language.
  • LSA cannot capture Polysemy (i.e., multiple meanings of a word), because it represents each word as a single point in space.
  • The probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA.[4]

The eTVSM technical report:

Some general LSI drawbacks are:

  • The resulting dimensions might be difficult to interpret. This leads to results which can be justified on the mathematical level, but have no interpretable mean-ing in natural language;
  • LSA, in general, assumes that words and documents form a joint Gaussian model (a Poisson distribution is observed). A newer alternative is a probabilistic Latent Semantic Analysis [29] based on a multinomial model. It is reported to give better results than standard LSA.

This, in addition to the fact that this model does not seem to be peer reviewed in any real IR literature (only Business Information Systems 2003), significantly weakens this article to such an extent that I do not feel that it meets Wikipedia standards. —Preceding unsigned comment added by 77.193.224.9 (talk) 23:55, 19 February 2010 (UTC)[reply]