Talk:Vector space model

Overcoming challenge of self promotion[edit]

In this article, there is a strong tendency to highlight very specific models which perform "better than the standard vector space model".

For example, I removed link to "topic based" vector space model which appears to be self promotion, plagarism, or both. When you follow the links cited here, some of those articles are not strongly referenced in other literature. For whatever reason, this is happening in this article than in most other articles on Machine Learning. — Preceding unsigned comment added by 98.207.93.61 (talk) 21:52, 31 May 2013 (UTC)[reply]

Limitations[edit]

This article talks about the limitations of the vector space model but doesn't discuss alternative approaches that might fare better on these aspects. I think the article would be much improved if someone with relevant expertise added an "Alternative Approaches" section or at least a list of links to articles about other ways of tackling the text indexing problem. I take it that an Inverted Index (already linked in the See Also section) is one such alternative but it would still be informative to know how those two compare (and what other approaches there are). Brianwc (talk) 19:41, 23 January 2010 (UTC)[reply]

I agree that the limitations section could use some work - in addition to there being alternative approaches, there are also many ways to work around these limitations while still taking advantage of a vector space model. The S-Space Package, for example, makes heavy use of vector space models but its algorithms variously manage to overcome some of these limitations (all of them, I believe, except for the loss of the order in which terms appear) using clever mathematical and linguistic tricks. I don't have time to work on it now (because I'm too busy with a thesis heavily employing vector space models!) but for the time being have put a little note clarifying that these are not impossible limitations to overcome. Sir Tobek (talk) 01:06, 17 August 2011 (UTC)[reply]

"using clever mathematical and linguistic tricks". It is rather stunning that terms "kernel" and "eigen" do not appear in this article once. In the case of kernel methods, I'm assuming this is just because kernel methods are a challenging topic and will be difficult to explain to a novice reader. Nonetheless, kernel methods and eigenvectors are standard machine learning techniques that can overcome limitations in generalized Euclidean vector space. Also the part of speech problem is not addressed, which is commonly handled by CRF, SVM, or a combination of both. Again these are standard in similarity search systems.

While I appreciate that this Wikipedia page is concise, it is also too terse for a motivated reader to further explore the topic. At present, there is little guidance about how someone would deal with 1. categorical attributes 2. "ngrams" 3. "part of speech". Many web documents will contain the independent tokens "I ran contra" with high term frequency, as opposed to "Iran contra" as the user intended. Tokens in sequence -- with respect to part of speech and sequential cooccurence -- are commonplace in NLP search systems.

Is this page intended to be for all types of Vector Space models, text based searching (the common use case), or a series of pointers to the more advanced techniques? — Preceding unsigned comment added by 98.207.93.61 (talk) 23:23, 31 May 2013 (UTC)[reply]