Page 4

EETE FEB 2015

big data Out Googling Google on big data searches By R. Colin Johnson Almost every search algorithm through unstructured Big Data uses a technique called latent Dirichlet allocation (LDA). Northwestern University professor Luis Amaral became curious as to why LDA-based searches appear to be 90 percent inaccurate and unrepeatable 80 percent of the time, often delivering different “hit lists” for the same search string. To solve the conundrum Amaral took apart LDA, found its flaws, and fixed them. Now he is offering the improved version, which not only returns more accurate results but returns exactly the same list every time it is used on the same database. He’s offering all this for free to Google, Yahoo, Watson, and any other search engine makers — from recommendation systems to spam filtering to digital image processing and scientific investigation. “The common algorithmic implementation of the LDA model is incredibly naive,” Amaral told EE Times. “First, there is this unrealistic belief that one is able to detect topics when documents have a significant mixture of topics. Our systematic analysis reveals that as soon as the corpus is generated with a large value of alpha (which in LDA controls the amount of mixing of topics in documents), its algorithms fail miserably.” The other big problem with LDA is that it uses a technique that more often than not gets stuck in what are called local maximums. For instance, if looking for the highest mountain in the U.S., if it starts on the east coast it will get stuck in the Appalachia’s and never make it to the Rockies. Since there is no path that goes uphill from the Appalachia’s to the Rockies it never finds the correct peak. If it had started from the west coast moving east, it might have found the highest peak, making the algorithm unreliable and subject to giving different results each time it is run. “The common algorithm assumes that by pretty much using steepest ascent it can find the global maximum in the likelihood function landscape. Physicists know from the study of disordered systems that when the landscape is rough, one gets trapped in local maxima and that the specific local maxima found depends on the initial state. In the specific case of LDA, what this means is that depending on the initial guess of the parameter values one is estimating, one gets a different Topic mapping creates a network of related words, shown here in a graph that consistently reveals the most important topics. (Source: Northwestern University) estimate of the parameters,” Amaral told us. Instead, Amaral’s version of LDA makes an initial scan to determine the “roughness” of the landscape, allowing it to jump from a peak in the Appalachia’s to one in the Rockies, comparing the two, in order to come up with the correct answer every time. “We overcome the difficulty by using methods that we know are able to scan the likelihood function landscape effectively and get to a good maximum,” Amaral told us. Next, Amaral and his colleagues plan to optimize their search algorithm as well as use it to check all cases where the answer is already known, to make sure it is bullet proof — a technique he claims the search community has never even tried, even though it is so obviously necessary. “On a broader note, it is astonishing to me that no one had carefully checked the accuracy of the current LDA algorithms. Only applying the algorithm to cases where one does not know the correct answer is incredibly silly. Especially when the checking of the answer is so susceptible to confirmatory bias,” Amaral told us. Assisting Amaral was professor colleague Konrad Kording. Together they added Topic Mapping which replaces all search string words with their stems (for instance, treating “star and “stars” as the same word) then builds a network of connected words — a community that defines the topics in a document. In tests, the algorithm produced accurate repeatable results when separating 23,000 scientific papers and 1.2 million Wikipedia R. Colin Johnson is Advanced Technology Editor at EE Times articles by topic. 4 Electronic Engineering Times Europe February 2015 www.electronics-eetimes.com


EETE FEB 2015
To see the actual publication please follow the link above