In this article, I detail a method used to investigate a collection of text documents (corpus) and find the words (entities) that represent the collection of words in this corpus. I will use an example where R, together with Natural Language Processing (NLP) techniques, is used to find the component of the system under test with the most issues found.
Operation Buggy
Say you are a tester and you are called in to help a DevOps team with their issue management system. But the only thing you have been given are text documents made by the testers, which are exports of JIRA issues they reported. They are large documents, and no one (including you) has time to manually read through them.
As a data scientist and QA expert, it’s your job to make sense of the data in the text documents. What parts of the system were tested, and which system components had the most found issues? This is where Natural Language Processing (NLP) can enter to tackle the problem, and R, the statistical computing environment with different R packages, can be used to perform NLP methods on your data. (Some packages include: tm, test reuse, openNLP, etc.) The choice of package depends on what you want to analyze with your data.
In this example, the immediate objective is to turn a large library of text into actionable data to:
-
- Find the issues with the highest risks (not the most buggy components of the system, because this component can also contain a lot of trivial issues).
- Fix the component of the system with the most issues.
To tackle the problem, we need statistics. By using the statistical programming language R, we can make statistical algorithms to find the most buggy component of the system under test.
Retrieval of the data
First, we have to retrieve and preprocess the files to enable the search for the most buggy component. what R packages do we actually need?
These are mentioned in Table 1, including their functions.
Table 1: R packages used
The functions of these R packages will be explained when the R packages are addressed.
Before you start to build the algorithm in R, you first have to install and load the libraries of the R packages.
After installation, every R script first starts with addressing the R libraries as shown below.
library(tm) library(SnowballC) library(topicmodels)
You can start with retrieving the dataset (or corpus for NLP).
For this experiment, we saved three text files with bug reports from three testers in a separate directory, also being our working directory (use setwd(“directory”) to set the working directory).
#set working directory (modify path as needed) setwd(directory)
You can load the files from this directory in the corpus:
#load files into corpus #get listing of .txt files in directory filenames <- list.files(getwd(),pattern="*.txt") #getwd() represents working directory
Read the files into a character vector, which is a basic data structure and can be read by R.
#read files into a character vector files <- lapply(filenames,readLines)
We now have to create a corpus from the vector.
#create corpus from vector articles.corpus <- Corpus(VectorSource(files))
Preprocessing the data
Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. An essential aspect involves the reduction of the size of the feature space before analyzing the text, i.e. normalization. (Several preprocessing methods are available, such as case-folding, stop word removal, stemming, lemmatization, contraction simplification etc.) What preprocessing method is necessary depends on the data we retrieve, and the kind of analysis to be performed.
Here,we use case-folding and stemming.
Case-folding to match all possible instances of a word (Auto and auto, for instance).
Stemming is the process of reducing the modified or derived words to their root form.
This way, we also match the resulting root forms.
# make each letter lowercase articles.corpus <- tm_map(articles.corpus, tolower) #stemming articles.corpus <- tm_map(articles.corpus, stemDocument);
Create the DTM
The next step is to create a document-term matrix (DTM). This is critical, because to interpret and analyze the text files, they must ultimately be converted into a document-term matrix.
The DTM holds the number of term occurrences per document. The rows in a DTM represent the documents, and each term in a document is represented as a column. We’ll also remove the low-frequency words (or sparse terms) after converting the corpus into the DTM.
articleDtm <- DocumentTermMatrix(articles.corpus, control = list(minWordLength = 3));
articleDtm2 <- removeSparseTerms(articleDtm, sparse=0.98)
Topic modeling
We are now ready to find the words in the corpus that represent the collection of words used in the corpus: the essentials.
This is also called topic modeling.
The topic modeling technique we will use here is latent Dirichlet allocation (LDA). The purpose of LDA is to learn the representation of a fixed number of topics, and given this number of topics, learn the topic distribution that each document in a collection of documents has.
Explaining LDA goes far beyond the scope of this article. For now, just follow the code as written below.
#LDA k = 5; SEED = 1234; article.lda <- LDA(articleDtm2, k, method="Gibbs", control=list(seed = SEED)) lda.topics <- as.matrix(topics(article.lda)) lda.topics lda.terms <- terms(article.lda)
If you now run the full code in R as explained above, you will calculate the essentials, the words in the corpus that represent the collection of words used in the corpus.
For this experiment, the results were:
> lda.terms Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 "theo" "customers" "angela" "crm" "paul"
Topics 1 and 3 can be explained: theo and angela are testers.
Topic 5 is also easily explained: paul is a fixer.
Topic 4, crm, is the system under test, so it’s not surprising it shows up as a term in the LDA, because it is mentioned in every issue by every tester.
Now, we still have topic 2: customers.
Customers is a component of the system under test: crm.
Customers is most mentioned as a component in the issues found by all the testers involved.
Finally, we have found our most buggy component.
Wrap-up
This article described a method we can use to investigate a collection of text documents (corpus) and find the words that represent the collection of words in this corpus. For this article’s example, R (together with NLP techniques) was used to find the component of the system under test with the most issues found.
R code
library(tm) library(SnowballC) library(topicmodels) # TEXT RETRIEVAL #set working directory (modify path as needed) ld #load files into corpus #get listing of .txt files in directory filenames <- list.files(getwd(),pattern="*.txt") #read files into a character vector files <- lapply(filenames,readLines) #create corpus from vector articles.corpus <- Corpus(VectorSource(files)) # TEXT PROCESSING # make each letter lowercase articles.corpus <- tm_map(articles.corpus, tolower) # stemming articles.corpus <- tm_map(articles.corpus, stemDocument); # Ceate the Document Term Matrix (DTM) articleDtm <- DocumentTermMatrix(articles.corpus, control = list(minWordLength = 3)); articleDtm2 <- removeSparseTerms(articleDtm, sparse=0.98) # TOP MODELING k = 5; SEED = 1234; article.lda <- LDA(articleDtm2, k, method="Gibbs", control=list(seed = SEED)) lda.topics <- as.matrix(topics(article.lda)) lda.topics lda.terms <- terms(article.lda) lda.terms