Talk:Latent semantic analysis

WikiProject Linguistics / Applied Linguistics

(Rated C-class)

This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.

C

This article has been rated as C-Class on the project's quality scale.

???

This article has not yet received a rating on the project's importance scale.

This article is supported by Applied Linguistics Task Force.

WikiProject Statistics

(Rated C-class, Low-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C

This article has been rated as C-Class on the quality scale.

Low

This article has been rated as Low-importance on the importance scale.

The contents of the Latent semantic indexing page were merged into Latent semantic analysis. For the contribution history and old versions of the redirected page, please see its history; for the discussion at that location, see its talk page.

1 Orthogonal matrices
2 Rank
3 Derivation
4 Removing Spam Backlinks
5 intuitive interpretation of transformed document-term space
6 "After the construction of the occurrence matrix, LSA finds a low-rank approximation "
7 Exact solution?
8 Relation to Latent Semantic Indexing page?
9 Reference LSA implementations
10 Derivation section: "mapping to lower dimensional space.
11 Polysemy?
12 Use of the word "Document" in Lede
13 Semantic hashing
14 External links modified
15 External links modified
16 Needs cleanup after merge with LSI page

Orthogonal matrices[edit]

It is not possible that U and V are orthogonal matrices since they are not square matrices Nulli (talk) 14:53, 23 November 2017 (UTC)

Rank[edit]

The section on the dimension reduction ("rank") seems to underemphasize the importance of this step. The dimension reduction isn't just some noise-reduction or cleanup step -- it is critical to the induction of meaning from the text. The full explanation is given in Landauer & Dumais (1997): A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge, published in Psychological Review. Any thoughts? --Adono 02:54, 3 May 2006 (UTC)

Derivation[edit]

I added a section describing LSA using SVD. If LSA is viewed as something broader, and does not necessarily use SVD, I can move this elsewhere. -- Nils Grimsmo 18:16, 5 June 2006 (UTC)

A few notes about the LSA using SVD description: the symbol capital sigma is incredibly confusing because of the ambiguity between sigma and the summation symbol. Many sources use the capital letter S instead. Also, I believe the part about comparing documents by their concepts is wrong. You cannot simply use cosine on the d-hat vectors, but instead you need to multiply the S (sigma) matrix and matrix V (not-transposed) and compare the columns of the resulting matrix. This weights the concepts appropriately. Otherwise, you are weighing the rth concept row as much as the first concept row, even though the algorithm clearly states that the first concept row is much more important. --Schmmd (talk) 17:28, 13 February 2009 (UTC)

It would also be cool to have discussion about the pros/cons to different vector comparisons (i.e. for document similarity) such as cosine, dot product, and euclidean distance. Also, in practice it seems that the sigma matrix has dimension l = n where n is the number of documents. Is this always true? Why is this not mentioned? --Schmmd (talk) 17:43, 13 February 2009 (UTC)

I'm thinking of removing the comment that t_i^T t_p gives the "correlation" between two terms, and the corresponding remark for documents. The most frequently used definition of correlation is Pearson's correlation coefficient, and after taking the inner product, you'd need to subtract mean(t_i)*mean(t_p) and then divide the whole thing by sd(t_i)*sd(t_p) thus rescaling to be in [-1, 1] to get the correlation. From what I've read, the reason for using the SVD on the term-documnent (or document-term) matrix is to find low-rank approximations to the original matrix, where the closeness of the approximation is measured by Froebenius norm on matrix space--this is actually referenced in the derivation. Any comments? Loutal7 (talk) 01:58, 17 December 2014 (UTC)

Removing Spam Backlinks[edit]

Removed link to 'Google Uses LSA in Keyword Algorithms: Discussion' (which pointed to www.singularmarketing.com/latent_semantic_analysis ), as this seems to be an out of date link. —Preceding unsigned comment added by 88.108.193.212 (talk • contribs) 08:15, 4 September 2006

I just removed two more obvious spam backlinks. One unsigned link to a google book page, no IP, no nothing, just a mass of text and a link to a book (for sale) and the other was an 8 year old link to a dead and irrelevant blogger page. I'm going to change this Section's Title to "Removing Spam Backlinks" as that seems to be something that people should regularly do here.Jonny Quick (talk) 04:23, 8 August 2015 (UTC)

intuitive interpretation of transformed document-term space[edit]

Many online and printed text resources describe the mechanics of the LSI via SVD, and the computational benefits of dimension reduction (reducing the rank of the transformed document-term space).

What these resources are not good at is talking about the intuitive interpretation of the transformed document-term space, and furthermore the dimension-reduced space.

It is not good enough to say that the procedure produces better results than pure term-matching information retrieval - it would be helpful to understand why.

perhaps a section called "intuitive interpretation". Bing Liu is the author who comes closest but doesn't succeed in my opinion [1]. —Preceding unsigned comment added by 82.68.244.150 (talk) 03:22, 4 October 2007 (UTC)

"After the construction of the occurrence matrix, LSA finds a low-rank approximation "[edit]

The current article states "After the construction of the occurrence matrix, LSA finds a low-rank approximation "

I'm not sure this is true. The LSA simply transforms the document-term matrix to a new set of basis axes. The SVD decomposition of the original matrix into 3 matrices is the LSA. The rank lowering is done afterwards by truncating the 3 SVD matrices. The LSA does not "find a low-rank approximation". The rank to which the matrices are lowered is set by the user, not by the LSA algorithm. —Preceding unsigned comment added by 82.68.244.150 (talk) 03:54, 11 October 2007 (UTC)

Exact solution?[edit]

The "implementation" section currently says: "A fast, incremental, low-memory, large-matrix SVD algorithm has recently been developed (Brand, 2006). Unlike Gorrell and Webb's (2005) stochastic approximation, Brand's (2006) algorithm provides an exact solution."

I don't think it's possible to have a polynomial-time exact solver for an SVD, because it reduces to root-finding; as with eigenvalue solvers. So SVD solvers must be incremental until a tolerance is reached, in which case I'm not sure about this "exact solution" claim. Glancing at the referred article, it looks like it's solving a much different problem than a general-purpose SVD algorithm. Maybe I'm confused about the context of this "implementation" section.

As Latent Semantic Analysis is very nearly equivalent to Principal Components Analysis (PCA), and they both usually rely almost in whole on an SVD computation, maybe we could find some way to merge and/or link these articles together in some coherent fashion. Any thoughts? —Preceding unsigned comment added by Infzy (talk • contribs) 17:02, 17 August 2008 (UTC)

Relation to Latent Semantic Indexing page?[edit]

This page overlaps a lot with the Latent semantic indexing page - it even links there. Should they perhaps be merged? gromgull (talk) 17:44, 23 June 2009 (UTC)

Yes, they should. They're actually near-synonyms; I believe the difference is that in LSI, one stores an LSA model in an index. Qwertyus (talk) 15:15, 13 July 2012 (UTC)

I agree. LSI was the term coined first as the SVD technique was used for document indexing and retrieval. LSA became used as the technique was applied to other textual analysis challenges such as recognizing synonyms. Dominic Widdows (talk) 16:43, 11 September 2012 (UTC).

It absolutely should. The Latent Semantic Indexing page explains critical methodology into the weighting scheme that is essential before computing SVD. LSA is incomplete without this invaluable information. Also, rank-reduction is well explained on the LSI page. Jobonki (talk) 19:38, 16 January 2014 (UTC)

Reference LSA implementations[edit]

The user Ronz recently cleaned up many of the external links on the page. However, they removed several links to LSA implementations. After discussing with Ronz, I would like raise the issue of whether such links to quality implementations should be re-added, and if so what implementations to include. The two implementations that were removed were

The Sense Clusters package. This is a mature implementation that incorporates LSA into a larger Natural Language Processing framework. The implementation has also be used for several academic papers, which provides further detail into how LSA can be applied.
The S-Space Package, which is a newer implementation based on Statistical Semantics. The package has several other LSA-like implementations and support for comparing them on standard test cases. In contrast to the Sense Clusters package, this shows how LSA can be applied to Linguistics and Cognitive Science research.

I argue that both links should be included for three reasons:

they provide an open source implementation of the algorithm
each highlight a specific, different application of LSA to a research area
both provide further references to related literature in their areas

Although Wikipedia is not a how-to, including links to quality implementations is consistent with algorithmic descriptions and is done for many algorithms, especially those that are non-trivial to implement. See Latent Dirichlet allocation or Singular value decomposition for examples.

If there area further implementations that would be suitable, let us discuss them before any are added.

Juggernaut the (talk) 06:22, 2 November 2009 (UTC)

Sorry I missed this comment.

Thanks for addressing my concerns discussed on my talk page. I still think the links are too far off topic (WP:ELNO #13) and too much just how-tos.

"both provide further references to related literature in their areas" I'm not finding this in either of them. What am I overlooking? --Ronz (talk) 22:09, 5 January 2010 (UTC)

Thanks for the response. Here are pointers to the related literature pages on each. For Sense Clusters, they have a list of LSA-using publications, which illustrate how LSA can be applied to Natural Language problems. In addition their work on sense clustering can be contrasted with the fundamental limitation of LSA in handling polysemy. For the S-Space Package, they have a list of algorithms and papers related to LSA. For researchers unfamiliar with other work being done in the area, I think this is a good starting point for finding alternatives to LSA.

Also, I was wondering if you could clarify your stance on the how-tos. Given that other algorithmic pages (see comments above for LDA and SVD) have similar links, is there something different about the links I am proposing that makes them unacceptable? Ideally, I would like to ensure that LSA has links (just as the other pages do) for those interested in the technical details of implementing them. If there is something specific you are looking for that these links do not satisfy, perhaps I can identify a more suitable alternate. Juggernaut the (talk) 02:31, 6 January 2010 (UTC)

Thanks for the links to the literature. Maybe we could use some of the listed publications as references?

I have tagged the External links section of each article. It's a common problem. --Ronz (talk) 18:14, 6 January 2010 (UTC)

Perhaps to clarify my position, I think there is a fundamental difference between the how-to for most articles and those for algorithms; an algorithm is in essence an abstract how-to. Computer scientists are often interested in how the algorithms might be reified as code, as different programming languages have different properties and often the algorithms are complex with many subtle details. For example, if you look at the links for any of the Tree data structures, almost all of them have links to numerous different ways the algorithms could be implemented. I don't think these kinds of links should fall into the how-to due to the nature of their respective Wikipedia page's content.Juggernaut the (talk) 23:31, 6 January 2010 (UTC)

Response to Third Opinion Request:

Disclaimers: I am responding to a third opinion request made at WP:3O. I have made no previous edits on Latent semantic analysis and have no known association with the editors involved in this discussion. The third opinion process (FAQ) is informal and I have no special powers or authority apart from being a fresh pair of eyes. Third opinions are not tiebreakers and should not be "counted" in determining whether or not consensus has been reached. My personal standards for issuing third opinions can be viewed here.

Opinion: I'd first like to compliment both Ronz and Juggernaut the for the exemplary good faith and civility both have shown here. My opinion is that the inclusion of Sense Clusters and S-Space Package in the article is acceptable, provided that it needs to be made clear that they are being included as examples of implementation of LSA. That kind of clarification is explicit at Latent Dirichlet allocation and Singular value decomposition and needs to be made explicit here if the links are to be re-added. A reasonable number of particularly-notable and/or particularly-illustrative implementation EL's (or examples in the text) is acceptable, especially (though not necessarily only) in highly-technical articles such as this in light of the fact that WP is an encyclopedia for general readers who may be in need of some additional illumination.

What's next: Once you've considered this opinion click here to see what happens next.—TRANSPORTERMAN (TALK) 19:00, 6 January 2010 (UTC)

Thanks for the 3PO. Some good arguments have been made, to the point where something on the lines of WP:ELYES #2 might apply. I'm thinking it could be acceptable it it was an external link to a quality implementation, with source code and a description of the algorithm's implementation, from an expert on the topic. I've asked for others' opinions at Wikipedia:External_links/Noticeboard#Links_to_implementations_of_algorithms. --Ronz (talk) 18:11, 7 January 2010 (UTC)

I can see the arguments of both sides. WP:ELNO#13 could be invoked by claiming that the sites are only indirectly related to the article's subject (they do not discuss the subject in the manner that the article does). However, given that this kind of topic is often best dealt with by a concrete implementation, and given that Juggernaut the has eloquently defended the links and has no apparent interest in spamming the sites, I would say that the links should be added, with a few words of explanation as per TransporterMan's comment. The only problem is what to do when a new editor turns up in a month and adds their link – I think that could be handled by applying WP:NOTDIR. Johnuniq (talk) 06:21, 8 January 2010 (UTC)

Derivation section: "mapping to lower dimensional space.[edit]

The original text is

The vector

{\hat {\textbf {t}}}_{i}

then has

k

entries mapping it to a lower dimensional space dimensions. These new dimensions do not relate to any comprehensible concepts. They are a lower dimensional approximation of the higher dimensional space. Likewise, the vector

{\hat {\textbf {d}}}_{j}

is an approximation in this lower dimensional space.

It sounds like you're trying to say that ${\textbf {t}}_{i}$ gets mapped to a lower dimensional space via the mapping ${\textbf {t}}_{i}\rightarrow {\hat {\textbf {t}}}_{i}\in \mathbb {R} ^{k}$ , right?

If so, then this paragraph should be cleaned up. What caught my attention at first was the redundant "dimensions" at the end of the first sentence. That should be fixed as well.

Justin Mauger (talk) 04:46, 21 October 2012 (UTC)

Polysemy?[edit]

The limitation "LSA cannot capture polysemy" contradicts "LSI overcomes two of the most problematic constraints of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy)." on the Latent_semantic_indexing page. --Nicolamr (talk) 20:05, 31 July 2014 (UTC)

Use of the word "Document" in Lede[edit]

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms

Unless the word "document" is a term of art that applies to more than just "documents", the lede seems outdated to me as (for example) it's well known that Google uses LSA for it's search algorithm, and I very much doubt that there are any "documents" (paper, MS Word, etc...) involved. I think of it as a "word cloud" personally. I'd like someone else to check my reaction to the use of this word as to whether it, or myself, is wrong.Jonny Quick (talk) 04:17, 8 August 2015 (UTC)

I think the term "document" is OK in the lede, and I am curious to know why anyone would disagree.

Do you agree that Google searches "web pages"? If not, please tell us how to fix the "Google Search" article (or fix it yourself).

Do you agree that "web pages" are a kind of document in the ordinary sense of the word? If not, please tell us how to fix the "web page" article (or fix it yourself).

--DavidCary (talk) 02:02, 2 June 2016 (UTC)

Semantic hashing[edit]

This topic is in need of attention from an expert on the subject.

The section or sections that need attention may be noted in a message below.

This paragraph is poorly written and confusing. "In semantic hashing [14] documents are mapped to memory addresses by means of a neural network in such a way that semantically similar documents are located at nearby addresses." Ok, makes sense. "Deep neural network essentially builds a graphical model of the word-count vectors obtained from a large set of documents. " What is a graphical model? What are word-count vectors? This terminology has not been used in the preceding part of the article. Are those just the counts of the words in those documents? Then in what sense are they obtained FROM a set of documents? And is it important that the set is large? "Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document." I get it, but the explanation is convoluted. It has to highlight that it's a faster/easier way to look for neighboring documents.

"This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method."

This is a contradiction. How can semantic hashing be faster than locality sensitive hashing, if the latter is the fastest current method ?! 184.75.115.98 (talk) 17:00, 9 January 2017 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Latent semantic analysis. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20120717020428/http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf to http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

As of February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete the "External links modified" sections if they want, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{sourcecheck}} (last update: 15 July 2018).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 02:42, 12 May 2017 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 2 external links on Latent semantic analysis. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

As of February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete the "External links modified" sections if they want, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{sourcecheck}} (last update: 15 July 2018).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 21:28, 17 December 2017 (UTC)

Needs cleanup after merge with LSI page[edit]

It's good and appropriate that the LSA and LSI pages have been merged, but the current result is quite confusing. Notably: - LSI is listed in the Alternative Methods section, even though it should now be framed as a synonym (as in the introduction) - after introducing LSI as an alternative method, the article proceeds to use LSI as the default term for its duration

As far as I can tell, the merge was basically a pure copy-paste job. This sort of thing predictably results in an article with confusing inconsistencies of terminology and redundancies of content. It badly needs a proper smoothing and revision pass by some generous soul with the requisite domain knowledge. — Preceding unsigned comment added by Gfiddler (talk • contribs) 17:13, 17 October 2018 (UTC)

Talk:Latent semantic analysis

Contents

Orthogonal matrices[edit]

Rank[edit]

Derivation[edit]

Removing Spam Backlinks[edit]

intuitive interpretation of transformed document-term space[edit]

"After the construction of the occurrence matrix, LSA finds a low-rank approximation "[edit]

Exact solution?[edit]

Relation to Latent Semantic Indexing page?[edit]

Reference LSA implementations[edit]

Derivation section: "mapping to lower dimensional space.[edit]

Polysemy?[edit]

Use of the word "Document" in Lede[edit]

Semantic hashing[edit]

External links modified[edit]

External links modified[edit]

Needs cleanup after merge with LSI page[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Interaction

Tools

Print/export

Languages