Karakas Online

5.20.1. Automatic Index generation

LyX provides an easy way to insert an Index entry (see Section 5.20): from the menu, choose either Insert-->”Index entry of preceding word” (which I personally find easier), or Insert-->”Index entry”, then enter the required word. This method works fine - if you have a small document, with only a few keywords to insert. But what if your document has grown to hundreds of pages, with hundreds (or even thousands) of index entries to insert? See the Index of the PHP-Nuke HOWTO for an example of an Index that cannot be generated manually - unless you want to drive yourself crazy!

Clearly, for a comprehensive Index of large documents, an automatic procedure is necessary. However, the general problem of automatic Index generation is subject of extensive (and still not conclusive) research and I am not going to address it in its full generality here. For our purposes, even a semi-automatic procedure would be very helpful. To this end, I have created the following 4 scripts:

They can be used in the following semi-automatic Index generation procedure:

  1. Optional: create a list of all existing index entries in your document. This is useful not only because you are going to eliminate all index entries from the document in the next step, but also as a backup of the index entries that were currently in use - you might want to reuse them in some later step.

    To create a list of all existing index entries in your document, type:

    sedscr_list_index_items document.lyx > indexitems
    

    The generated indexitems file will contain a list of all index entries in document.lyx, one index entry per line, with a semicolon at the end of each line. The semicolon will be used later as a record delimiter in the awk scripts that follow, so don't let it irritate you.

    To get an alphabetically sorted list of index items, without duplicate entries and with all symbols at the beginning of the list, use the sort and uniq utilities as follows:

    cat indexitems | sort | uniq > indexitems.sorted
    mv indexitems.sorted indexitems
    
  2. Remove all previous index entries from the LyX document. You need this preliminary step because, if you forget to remove already existing index entries, a subsequent run of the awkscr_insert_index_items script may substitute even the existing index terms (those already inside the LyX \index commands) with LyX \index commands. This may or may not happen, depending on the regular expressions used in the current implementation of awkscr_insert_index_items, but it is better to err on the side of caution. What will happen, however, is that repeated invocations of awkscr_insert_index_items will add index entries besides already existing ones. You will thus end up with a document that contains double index entries for each index term in your indexitems file.

    Besides, there is another reason why you might want to remove all index entries from your LyX document: a LyX text cluttered with index entries may still be a breeze to read for a computer, but quite a headache to read for humans.

    To remove all index entries from a LyX document, type:

    sedscr_delete_index_items document.lyx > document-noindexitems.lyx
    

    The generated document-noindexitems. lyx will contain everything from document.lyx - except the index entries.

  3. Create a list of all index entries to be used in the LyX document. This is the most difficult part: as said above, this problem is not trivial. We will thus content ourselves with a list of all words used in the document. Once we have all words, we can still edit the list manually and delete all unwanted entries. This is what makes this procedure semi-automatic and not automatic. The idea is that it is still better having to delete 10000 lines from a 12000 line document, than having to insert 2000 index entries from the LyX Insert menu.

    To create a list of all words used in a LyX document, type:

    awkscr_create_index_items document.lyx > words
    

    There is even some code in awkscr_create_index_items that checks whether the current word is in some “trivia” list of trivial words and discards it. In such a case, you would call the script with two arguments, as follows:

    awkscr_create_index_items trivia document.lyx > words
    

    However, this part of the code is either too slow, or buggy, so it is commented for the moment (feel free to send corrections or suggestions).

    It is a good idea to sort your words alphabetically and delete double entries, so do:

    cat words | sort | uniq > words-unique
    mv words-unique words
    

    Once the list of all words of your document is created, all you have to do is open it with a text editor and delete all unwanted words or correct the ones that are in plural or have some punctuation at the end and so on. This is still hard if your document is large, but still a faster alternative than targeting the Insert menu with the mouse 8000 times (I guess each one of my 2000 index entries appears 4 times in my document, which gives me an estimate of 8000 menu selections with the mouse - unfortunately no keyboard bindings were found to work on my system).

    You should delete all lines containing characters that could be interpreted as metacharacters of regular expressions: *, +, ?, $, &, ^, \ - and probably many others. Don't try to escape them, it will not work: awkscr_create_index_items will replace the correct, string with the escaped string, adding an index entry for the escaped string too! This is not what you will want. What is rather needed here is a mechanism to search for the escaped string, but replaced it with the verbatim one (i.e. the string without the escaping backslashes). This is still work to be done (FIXME).

    Practically, this restriction means that you will have to add your index entries for symbols like *, +, ?, $, &, ^, \ manually, each time after you run awkscr_create_index_items.

  4. Once you have a file, say indexitems, with all words that should appear in the Index of a LyX file, type:

    awkscr_insert_index_items indexitems - < document-noindexitems.lyx > document-indexitems.lyx
    

    to create from document-noindexitems. lyx a document with index entries (document-indexitems.lyx) for all words in indexitems.

Warning Long execution time!
 

The current implementation of awkscr_insert_index_items takes really long to execute, if the indexitems file is large: For 3000 words in indexitems, producing about 9000 index entries in the final document (of which 3000 are duplicate), the script may well need 1-2 hours on a Pentium 3.4 GHz - go get a cup of coffe! smile

Some notes on awkscr_insert_index_items's mode of operation:

Some tips regarding the (necessary) manual editing of the words file, the file output by awkscr_create_index_items above:

Finally, there are a few known limitations of the collateindex script that creates the index (see Automatic Indexing with the DocBook DSSSL Stylesheets:

Last updated Mon Sep 24 01:19:25 CEST 2007 Permalink: http://www.karakas-online.de/mySGML/lyx-automatic-index-generation.html All contents © 2002-2007 Chris Karakas