Karakas Online

6.2. Automatic Index generation

LyX provides an easy way to insert an Index entry: from the menu, choose either Insert-->"Index entry of preceding word" (which I personally find easier), or Insert-->"Index entry", then enter the required word. This method works fine - if you have a small document, with only a few keywords to insert. But what if your document has grown to hundreds of pages, with hundreds (or even thousands) of index entries to insert? See the Index of the PHP-Nuke HOWTO for an example of an Index that cannot be generated manually - unless you want to drive yourself crazy!

Clearly, for a comprehensive Index of large documents, an automatic procedure is necessary. However, the general problem of automatic Index generation is subject of extensive (and still not conclusive) research and I am not going to address it in its full generality here. For our purposes, even a semi-automatic procedure would be very helpful. To this end, I have created the following 4 scripts:

They can be used in the following semi-automatic Index generation procedure:

  1. Optional: create a list of all existing index entries in your document. This is useful not only because you are going to eliminate all index entries from the document in the next step, but also as a backup of the index entries that were currently in use - you might want to reuse them in some later step.

    To create a list of all existing index entries in your document, type:

    sedscr_list_index_items document.lyx > indexitems

    The generated indexitems file will contain a list of all index entries in document.lyx, one index entry per line.

  2. Remove all previous index entries from the LyX document. You need this preliminary step because, if you forget to remove already existing index entries, a subsequent run of the awkscr_insert_index_items script may substitute even the existing index terms (those already inside the LyX \index commands) with LyX \index commands. This may or may not happen, depending on the regular expressions used in the current implementation of awkscr_insert_index_items, but it is better to err on the side of caution. Besides, a LyX text cluttered with index entries may still be a breeze to read for a computer, but quite a headache to read for humans.

    To remove all index entries from a LyX document, type:

    sedscr_delete_index_items document.lyx > document-noindexitems.lyx

    The generated document-noindexitems.lyx will contain everything from document.lyx - except the index entries.

  3. Create a list of all index entries to be used in the LyX document. This is the most difficult part: as said above, this problem is not trivial. We will thus content ourselves with a list of all words used in the document. Once we have all words, we can still edit the list manually and delete all unwanted entries. This is what makes this procedure semi-automatic and not automatic. The idea is that it is still better having to delete 10000 lines from a 12000 line document, than having to insert 2000 index entries from the LyX Insert menu.

    To create a list of all words used in a LyX document, type:

    awkscr_create_index_items document.lyx > words

    There is even some code in awkscr_create_index_items that checks whether the current word is in some "trivia" list of trivial words and discards it. In such a case, you would call the script with two arguments, as follows:

    awkscr_create_index_items trivia document.lyx > words

    However, this part of the code is either too slow, or buggy, so it is commented for the moment (feel free to send corrections or suggestions).

    Once the list of all words of your document is created, all you have to do is open it with a text editor and delete all unwanted words or correct the ones that are in plural or have some punctuation at the end and so on. This is still hard if your document is large, but still a faster alternative than targeting the Insert menu with the mouse 8000 times (I guess each one of my 2000 index entries appears 4 times in my document, which gives me an estimate of 8000 menu selections with the mouse - unfortunately no keyboard bindings were found to work on my system).

  4. Once you have a file, say indexitems, with all words that should appear in the Index of a LyX file, type:

    awkscr_insert_index_items indexitems - < document-noindexitems.lyx > document-indexitems.lyx

    to create from document-noindexitems.lyx a document with index entries (document-indexitems.lyx) for all words in indexitems.

Some notes on awkscr_insert_index_items's mode of operation:

The "-" in the above invocation is important: it forces the awk script to continue reading from standard input, after it has read indexitems. This, together with the code

FILENAME == "indexitems" {
        indexentry[$1] = $1

in awkscr_insert_index_items, causes the words in indexitems to be imported into the indexentry[] associative array.

The file separator in awkscr_insert_index_items is set to the semicolon ";", instead of the default, which is space. This makes it possible to enter index entries with more than one words. Accordingly, the awkscr_create_index_items script appends a semicolon at the end of each word it prints.

awkscr_insert_index_items follows a simple algorithm to insert the index entries at the right places in the document: to insert an index entry, we have to know what LyX environment we are in. In essence, this means we have to parse the LyX document. Since the \layout commands in the LyX file do NOT have what we would call "closing tags" in other markup languages, we cannot tell awk "if you are between the start and the end of the Paragraph environment, do the following", or anything like that - there is no easy way to find the "end " of an environment, given all the environment nestings that are possible. Luckily, another easy way exists: whenever a \layout command is encountered, we are in the environment specified by that \layout command, so we only need to set a variable, call it layout, accordingly:

/\\layout SGML/ { layout = "SGML"; print; next }
/\\layout Chapter/ { layout = "Chapter"; print; next }
/\\layout Section/ { layout = "Section"; print; next }
/\\layout Subsection/ { layout = "Subsection"; print; next }
/\\layout Subsubsection/ { layout = "Subsubsection"; print; next }
/\\layout Standard/ { layout = "Standard"; print; next }
...and so on

Clearly, we should not insert index entries everywhere, e.g. in the "Code" environment. That's why we check if we are in the "Standard", "Itemize", "Quotation", "Description" environment (warning: the way sedscr works currently, you should not insert index entries in the "Caption" environment) and, if we are (and only then), we substiture every word in the indexentry[] array with the LyX "insert index entry" command:

        if (layout == "Standard" || layout == "Itemize" || layout == "Quotation" 
|| layout == "Description" ) {
                for (item in indexentry) {
                        if (gsub(item "$", item "\n\\begin_inset LatexCommand 
\\index{" indexentry[item] "}\n\n\\end_inset \n")) { print; next }
                        ...other substitutions here

Some tips regarding the (necessary) manual editing of the words file, the file output by awkscr_create_index_items above: