# 6.2. Automatic Index generation

LyX provides an easy way to insert an Index entry: from the menu, choose either Insert-->"Index entry of preceding word" (which I personally find easier), or Insert-->"Index entry", then enter the required word. This method works fine - if you have a small document, with only a few keywords to insert. But what if your document has grown to hundreds of pages, with hundreds (or even thousands) of index entries to insert? See the Index of the PHP-Nuke HOWTO for an example of an Index that cannot be generated manually - unless you want to drive yourself crazy!

Clearly, for a comprehensive Index of large documents, an automatic procedure is necessary. However, the general problem of automatic Index generation is subject of extensive (and still not conclusive) research and I am not going to address it in its full generality here. For our purposes, even a semi-automatic procedure would be very helpful. To this end, I have created the following 4 scripts:

• sedscr_list_index_items: lists all index entries contained in a LyX document.

• sedscr_delete_index_items: deletes all index entries from a LyX document.

• awkscr_create_index_items: creates a list of words used in a LyX document. The list can be subsequently edited manually, mostly deleting unwanted or uninteresting words, to yield a list of words that are used in the document and are interesting enough to be part of its Index.

• awkscr_insert_index_items: uses an externally supplied document containing a list of index entries to insert an index entry in a LyX document for every word appearing in that list.

They can be used in the following semi-automatic Index generation procedure:

1. Optional: create a list of all existing index entries in your document. This is useful not only because you are going to eliminate all index entries from the document in the next step, but also as a backup of the index entries that were currently in use - you might want to reuse them in some later step.

To create a list of all existing index entries in your document, type:

 sedscr_list_index_items document.lyx > indexitems 

The generated indexitems file will contain a list of all index entries in document.lyx, one index entry per line.

2. Remove all previous index entries from the LyX document. You need this preliminary step because, if you forget to remove already existing index entries, a subsequent run of the awkscr_insert_index_items script may substitute even the existing index terms (those already inside the LyX \index commands) with LyX \index commands. This may or may not happen, depending on the regular expressions used in the current implementation of awkscr_insert_index_items, but it is better to err on the side of caution. Besides, a LyX text cluttered with index entries may still be a breeze to read for a computer, but quite a headache to read for humans.

To remove all index entries from a LyX document, type:

 sedscr_delete_index_items document.lyx > document-noindexitems.lyx 

The generated document-noindexitems.lyx will contain everything from document.lyx - except the index entries.

3. Create a list of all index entries to be used in the LyX document. This is the most difficult part: as said above, this problem is not trivial. We will thus content ourselves with a list of all words used in the document. Once we have all words, we can still edit the list manually and delete all unwanted entries. This is what makes this procedure semi-automatic and not automatic. The idea is that it is still better having to delete 10000 lines from a 12000 line document, than having to insert 2000 index entries from the LyX Insert menu.

To create a list of all words used in a LyX document, type:

 awkscr_create_index_items document.lyx > words 

There is even some code in awkscr_create_index_items that checks whether the current word is in some "trivia" list of trivial words and discards it. In such a case, you would call the script with two arguments, as follows:

 awkscr_create_index_items trivia document.lyx > words 

However, this part of the code is either too slow, or buggy, so it is commented for the moment (feel free to send corrections or suggestions).

Once the list of all words of your document is created, all you have to do is open it with a text editor and delete all unwanted words or correct the ones that are in plural or have some punctuation at the end and so on. This is still hard if your document is large, but still a faster alternative than targeting the Insert menu with the mouse 8000 times (I guess each one of my 2000 index entries appears 4 times in my document, which gives me an estimate of 8000 menu selections with the mouse - unfortunately no keyboard bindings were found to work on my system).

4. Once you have a file, say indexitems, with all words that should appear in the Index of a LyX file, type:

 awkscr_insert_index_items indexitems - < document-noindexitems.lyx > document-indexitems.lyx 

to create from document-noindexitems.lyx a document with index entries (document-indexitems.lyx) for all words in indexitems.

Some notes on awkscr_insert_index_items's mode of operation:

The "-" in the above invocation is important: it forces the awk script to continue reading from standard input, after it has read indexitems. This, together with the code

 FILENAME == "indexitems" { n++ indexentry[$1] =$1 next } 

in awkscr_insert_index_items, causes the words in indexitems to be imported into the indexentry[] associative array.

The file separator in awkscr_insert_index_items is set to the semicolon ";", instead of the default, which is space. This makes it possible to enter index entries with more than one words. Accordingly, the awkscr_create_index_items script appends a semicolon at the end of each word it prints.

awkscr_insert_index_items follows a simple algorithm to insert the index entries at the right places in the document: to insert an index entry, we have to know what LyX environment we are in. In essence, this means we have to parse the LyX document. Since the \layout commands in the LyX file do NOT have what we would call "closing tags" in other markup languages, we cannot tell awk "if you are between the start and the end of the Paragraph environment, do the following", or anything like that - there is no easy way to find the "end " of an environment, given all the environment nestings that are possible. Luckily, another easy way exists: whenever a \layout command is encountered, we are in the environment specified by that \layout command, so we only need to set a variable, call it layout, accordingly:

 /\\layout SGML/ { layout = "SGML"; print; next } /\\layout Chapter/ { layout = "Chapter"; print; next } /\\layout Section/ { layout = "Section"; print; next } /\\layout Subsection/ { layout = "Subsection"; print; next } /\\layout Subsubsection/ { layout = "Subsubsection"; print; next } /\\layout Standard/ { layout = "Standard"; print; next } ...and so on 

Clearly, we should not insert index entries everywhere, e.g. in the "Code" environment. That's why we check if we are in the "Standard", "Itemize", "Quotation", "Description" environment (warning: the way sedscr works currently, you should not insert index entries in the "Caption" environment) and, if we are (and only then), we substiture every word in the indexentry[] array with the LyX "insert index entry" command:

 { if (layout == "Standard" || layout == "Itemize" || layout == "Quotation" || layout == "Description" ) { for (item in indexentry) { if (gsub(item "\$", item "\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { print; next } ...other substitutions here } } } 

Some tips regarding the (necessary) manual editing of the words file, the file output by awkscr_create_index_items above:

• You will see a lot of words (or their declinations) that are not useful. It is one thing to have a lot of words and another to have a set of really useful words and phrases. That's the price we pay for the simplicity of our method.

• You may need to supply some extra terms you feel are missing from that file. Feel free to do this, awkscr_insert_index_items does not know how you created the indexitems file you give it.

• Keep backups of your word lists from subsequent runs of the scripts. Combine word lists from other projects. No matter how long your word list, only the terms that really appear somewhere, will make it to the Index, so don't worry if your list is too long - given enough computing time, that is.

• Take care to delete everything in your word list that looks like a regular expression with metacharacters - because it will be interpreted as such, with unpredictable results (unless you really know what you are doing). I once had ".*" on one line and I forgot to delete it. I then wondered how come that my document was full of index entries to ".*" while the text was almost gone! See regular expressions, for a brief introduction to regular expressions.

• Take out any ":", ";", "?" from the end of the words, as well as enclosing double quotes. Those characters are already taken care of when it comes to inserting the entries, i.e. the indexentries file should contain only the "pure" words, without any punctuation signs.

• Don't leave in "config" if your LyX file contains "config.php". If you do, the latter will look ugly in the LyX editor, as it will contain an index entry for "config" just in the middle of it. This will not affect the rendered formats, however.

• Don't leave in words that might form parts of a LyX command. I once left "Enumerate" in my word list. The resulting LyX file contained an index entry for "Enumerate" in front of every item in every enumeration list! Clearly, the awk script "sees" the LyX commands in the file that are invisible to you.