Latent Semantic Indexing is a method for discovering patterns that occur when looking at what words or phrases occur in a group of documents or web pages. We all know that certain words and phrases tend to occur with other words and phrases. For example “cowboy boots” tends to occur a lot with “western boots” because cowboys and the west are closely linked in our minds.
This simple relationship, that web pages that contain “cowboy boots” also tend to contain “western boots”, is a simple pattern that can be described by a co-occurrence relationship. That is, if phrase A appears in a web page, then that makes it more likely that phrase B will appear there also. Sometimes this simple co-occurrence relationship is thought of as Latent Semantic Indexing, but that is simply not true.
Latent Semantic Indexing goes much deeper than that. For example, suppose we have a group of documents in which “car” usually appears with engines, tires, brakes, and gas tanks. Suppose that “automobile” also tends to occur with those same words; however the word car and automobile never occur in the same document. Latent Semantic Indexing will be able to tell that car and automobile are very similar to each other even though their co-occurrence is zero.
This is because Latent Semantic Indexing looks at not only co-occurrence but also second order co-occurrence where words car and automobile don’t directly appear together, but there are words that tend to appear with each of them. As a matter of fact, Latent Semantic Indexing looks at all orders of co-occurrence and blends them together, resulting in a very sophisticated model that shows what words are similar to what other words based on what documents or web pages they appear in.
Computers using Latent Semantic Indexing do not have any idea what the words of phrases mean. They do not know that cowboys and the West are linked together because cowboys worked in the West and had a significant role in the development of the West. The only information they know is what words or phrases appear in what document. If cowboys and the West tend to occur in the same documents more often than chance would indicate, then Latent Semantic Indexing can start inferring that there is some deeper “semantic” relationship between the words.
Latent means hidden or buried in the documents, semantic means having to do with the meaning of words, and indexing means finding documents based on a word. Latent Semantic Indexing literally means finding documents based on the meaning of words as derived from how those words are used in the documents. Of course this is not the semantics that we mean when we use the word, but there is a surprising amount of information that can be obtained just by looking at the distribution of words across some collection of documents such as a group of web pages.





