Tuesday, October 20, 2015

Leximancer: A Software tool for Latent Semantic Analysis

Arun Aryal
Doctoral Candidate/Adjunct Instructor
Computer Informations Systems
Georgia State University

     Over the past decade, various text-mining approaches such as Latent Semantic Analysis(LSA) have been adopted to identify patterns and themes in text documents. While there are many available software tools for performing LSA, one that has been popular in the IS literature is Leximancer. Leximancer divides content analysis into two parts thematic analysis, and semantic analysis. A primary objective of the thematic analysis is to detect, uncover, and quantify predefined concepts with the text. Semantic analysis augments the conceptual analysis by quantifying the relationship between the identified concepts within the data corpus.

     Thematic analysis beings with Leximancer scanning the text to identify frequently used words (word counts) also known as seed words. Leximancer excludes the common stop words such as am, as, a, an and others. These seed words are weighted according to how frequently they occur within the focal concept. A focal concept is a word around which other words seem to cluster or travel together. For example, words such as “students”, “professors”, “class”, “campus” might cluster around “university”.

     Semantic analysis is a measurement of co-occurrence concepts within the text. Concepts are measured according to how frequently they occur within a two-sentence “chunks” of text or often referred to as a window. Leximancer moves this window, two sentences at a time, measuring co-occurrence of the concepts, throughout the entire text corpus. Leximancer stores this co-occurrence matrix of all concepts. This matrix can be downloaded into a spreadsheet for further analysis.

     Leximancer provides results from these thematic analyses and relational analysis in the form of “overall” visual maps, where the analyst can view the concepts, sub-concepts (keywords used in creating a concept), or themes. Each circle in the concept map represents a theme.  Theme names are derived from most dominant concepts, i.e. concepts that occur most within that theme. Once the initial overall map is created, the analyst can change the theme size to adjust the grouping of concepts on the map.

Short Illustration:

Assume a research question:  How do the research topics change over time, according to discrete time periods (1989-1998, 1999-2006)?

     To answer this question, researchers need to create a consolidated file for each of three time periods (1990-1998, 1999-2006).  Each consolidated file contains the title and abstract of all papers published during that time period. Since the aim is to understand how the research changed over time, researchers need to separate the papers for each of these time periods.  Next, common “stop words” (and, not, with, or, etc.) should be excluded and word variants (e.g., organize, organization, and organizations; also, project, projects, and projected, etc.) should be merged. Once these parameters for the stop words and merge words were established, researchers can allow Leximancer to analyze the entire consolidated file consisting of all words from each time period.

     Figure 1 shows the results of the analysis in each of these time periods.  Figure 1 shows themes (i.e., words enclosed in circles) that reflect the major concepts. For example, in this preliminary analysis, the main concepts were:  information (in Time Period 1) and technology (in Time Period 2), given that Leximancer displays these terms using with “hot” colors (red, orange, yellow).

Time Period 1 (1985-1998)
Time Period 2 (1999-2006)
Figure 1:  Sample results with key terms for Time Period 1 (1985-1998) and 2 (1999-2006

Leximancer is useful in identifying concepts within given texts, once the concepts are identified, factor analysis of these concepts provides more interrogative techniques to analyze the text even deeply.

No comments:

Post a Comment