Open Source Tools

Indexing a Book Using Open Source Tools on Linux


· ·

If you’ve ever indexed a book, you know that it’s not exactly a lot of fun, unless indexing is your thing. Creating an index requires a fair amount of tedium and manual processing. As far as I’m concerned, it’s way less fun than actually researching and writing a book.

Fortunately, using some basic open source tools like grep and sort, you can streamline a lot of the hard work that goes into making an index. Below, I’ll show you how by drawing on my experience indexing my latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, which is forthcoming with MIT Press in July, and which you should—you know, buy. (OK, that’s the end of my self-promoting pitch.)


To create your index using the tools discussed in this guide, you should have:

  1. A Linux system with the following utilities installed: grep, sort, cut, awk, sed, uniq and pdftotext. All of the tools except the last one should come preinstalled on most Linux distributions. The pdftotext utility is part of the Poppler package. You can install it from source, or by using a package from your distribution’s repository. (On Ubuntu, the package you want is called poppler-utils.)
  2. A PDF of your book manuscript, with page numbers set as they’ll appear in the final book. (It’s OK if page 1 of the PDF file is not page 1 of your actual book; we’ll deal with that below.)
  3. A list of words or terms that you want to include in your index. I can’t tell you what to include; only you know your book. I can tell you that I find it helpful to create a list of terms for the index as I am reading through the page proofs of my manuscript. To work with the scripts we’ll use below, your list should include each term on a separate line. The list doesn’t have to be alphabetized.

With those prerequisites in place, you can get started building your index using the following steps.

Step 1: Convert the PDF to Text

Since we can’t use grep or other tools directly on a PDF file, we need to convert the PDF file to text.

I did that using the following Bash script:


for ((i=1; i<=350; i++)); do
    let n=$i-32      # page offset
    pdftotext -f $i -l $i $pdf "$n.txt" 

This script takes the PDF file and exports each page of it as a separate text file. Each of the tile files is named [page number].txt. So the first page of your book is 1.txt, the second is 2.txt, and so on.

The variables in the script work as follows:

  1. $pdf is the name of your PDF file. (Remember to include the full path if the PDF is in a different directory than the script.)
  2. $i refers to the page of the PDF file that we are converting to text. We increment $i with each loop. Note that $i is the page of the PDF file itself, not necessarily the page of your book manuscript (since page 1 of your manuscript may not be page 1 of the PDF file.)
  3. $n is the page number of the page in your actual book. In my case, page 1 of my book is page 32 of the PDF file, which is why in the script above, I set $n equal to $i-32.

In the script above, I stop the loop when $i reaches 350, because otherwise the script would run forever (or until you drop a Control+C on it). 350 is sufficient for covering all of the pages in my book. If your book is longer or shorter, you can adjust this number accordingly.

After your script is written, execute it. It will generate a series of text files, with each text file corresponding to an individual page inside your book.

Step 2: Build Your Index

Now that we have the book split into individual text files, we can use grep to search each one and tell us whenever it finds a page that matches a term we want to include in our index.

I did this using the following script:


while read -r line
    grep -H -o -i "$line" * >> index
done < "$filename"

This script takes a file (in this case, it’s called words) that contains a list of the terms I want to include in my index. Then, for each term, it greps to see which files contain it, and adds any catches to the file index. The specific options I pass to grep in this script tell grep to include the filename and the term in its output, and to be case-insensitive.

Step 3: Clean Up Your Results

If you run the script above, you’ll get a file named index, full of entries that looks something like this:

130.txt:Red Hat
213.txt:Alexis de Tocqueville Institution
145.txt:ACC Corporation
178.txt:ACC Corporation

We can clean this up a bit using a command like the following:

cat index | awk 'BEGIN {FS=OFS=":"} {print $2,$1}' | sed 's/.txt//g' | sort | uniq

This reformats the list, alphabetizes it, and removes duplicate entries so that it looks more like this:

ACC Corporation:145
ACC Corporation:178
Alexis de Tocqueville Institution:213
Red Hat:130

What this list tells us is that AbiWord is mentioned on page 161 of the book, ACC Corporation is mentioned on pages 145 and 178, and so on. With this data on hand, you can move ahead with creating your final index.

Step 4: The Hard Part

Unfortunately, creating the actual index still requires some manual work. You’ll need to consolidate multiple entries for a given term into a single line. You’ll also need to do some manual work to break long entries down into subentries. And you’ll need to think hard about whether many terms should actually be a part of the index or not.

But that’s just the nature of this type of work. A good index requires a lot of careful thought, and you can’t automate that.

Still, I think that using scripts on Linux to help create your index saves you from a lot of the most boring work. Instead of having to search for terms manually and adding them to a list, you can generate a long list of terms and relevant page numbers quickly. In my case, I ended up with something like 1,300 lines in my raw index file. Those were 1,300 manual searches that I was glad I did not have to perform.


Last but not least, I should point out a few obvious limitations with the approach described above:

  1. This won’t work well if you have words that span multiple pages (that is, if a word starts on one page and is hyphenated into the next page).
  2. As my scripts are written, you can end up with erroneous entries because, for example, a search for Evolution (the email program) will also match Revolution.
  3. If your text contains non-English characters, or has strange encoding issues, this may not work well because grep may have problems.

If I were a better programmer, I’m sure I could find ways to work around these problems. But the scripts I wrote did exactly what Linux tools are supposed to do—They provided a quick, automated solution to a real-world problem I was facing, and saved me lots of time.

On that note, if I were looking for perfection, I’d write the scripts in Python or some other more sophisticated language than Bash. But since I just wanted to get the job done and had limited needs, I used Bash to create the scripts, and it worked for me.

Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO.


Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

Skip to toolbar