LyX provides an easy way to insert an Index entry (see Section 5.20): from the menu, choose either Insert-->”Index entry of preceding word” (which I personally find easier), or Insert-->”Index entry”, then enter the required word. This method works fine - if you have a small document, with only a few keywords to insert. But what if your document has grown to hundreds of pages, with hundreds (or even thousands) of index entries to insert? See the Index of the PHP-Nuke HOWTO for an example of an Index that cannot be generated manually - unless you want to drive yourself crazy!
Clearly, for a comprehensive Index of large documents, an automatic procedure is necessary. However, the general problem of automatic Index generation is subject of extensive (and still not conclusive) research and I am not going to address it in its full generality here. For our purposes, even a semi-automatic procedure would be very helpful. To this end, I have created the following 4 scripts:
sedscr_list_index_items: lists all index entries contained in a LyX document.
sedscr_delete_index_items: deletes all index entries from a LyX document.
awkscr_create_index_items: creates a list of words used in a LyX document. The list can be subsequently edited manually, mostly deleting unwanted or uninteresting words, to yield a list of words that are used in the document and are interesting enough to be part of its Index.
awkscr_insert_index_items: uses an externally supplied document containing a list of index entries to insert an index entry in a LyX document for every word appearing in that list.
They can be used in the following semi-automatic Index generation procedure:
Optional: create a list of all existing index entries in your document. This is useful not only because you are going to eliminate all index entries from the document in the next step, but also as a backup of the index entries that were currently in use - you might want to reuse them in some later step.
To create a list of all existing index entries in your document, type:
sedscr_list_index_items document.lyx > indexitems |
The generated indexitems file will contain a list of all index entries in document.lyx, one index entry per line, with a semicolon at the end of each line. The semicolon will be used later as a record delimiter in the awk scripts that follow, so don't let it irritate you.
To get an alphabetically sorted list of index items, without duplicate entries and with all symbols at the beginning of the list, use the sort and uniq utilities as follows:
cat indexitems | sort | uniq > indexitems.sorted mv indexitems.sorted indexitems |
Remove all previous index entries from the LyX document. You need this preliminary step because, if you forget to remove already existing index entries, a subsequent run of the awkscr_insert_index_items script may substitute even the existing index terms (those already inside the LyX \index commands) with LyX \index commands. This may or may not happen, depending on the regular expressions used in the current implementation of awkscr_insert_index_items, but it is better to err on the side of caution. What will happen, however, is that repeated invocations of awkscr_insert_index_items will add index entries besides already existing ones. You will thus end up with a document that contains double index entries for each index term in your indexitems file.
Besides, there is another reason why you might want to remove all index entries from your LyX document: a LyX text cluttered with index entries may still be a breeze to read for a computer, but quite a headache to read for humans.
To remove all index entries from a LyX document, type:
sedscr_delete_index_items document.lyx > document-noindexitems.lyx |
The generated document-noindexitems. lyx will contain everything from document.lyx - except the index entries.
Create a list of all index entries to be used in the LyX document. This is the most difficult part: as said above, this problem is not trivial. We will thus content ourselves with a list of all words used in the document. Once we have all words, we can still edit the list manually and delete all unwanted entries. This is what makes this procedure semi-automatic and not automatic. The idea is that it is still better having to delete 10000 lines from a 12000 line document, than having to insert 2000 index entries from the LyX Insert menu.
To create a list of all words used in a LyX document, type:
awkscr_create_index_items document.lyx > words |
There is even some code in awkscr_create_index_items that checks whether the current word is in some “trivia” list of trivial words and discards it. In such a case, you would call the script with two arguments, as follows:
awkscr_create_index_items trivia document.lyx > words |
However, this part of the code is either too slow, or buggy, so it is commented for the moment (feel free to send corrections or suggestions).
It is a good idea to sort your words alphabetically and delete double entries, so do:
cat words | sort | uniq > words-unique mv words-unique words |
Once the list of all words of your document is created, all you have to do is open it with a text editor and delete all unwanted words or correct the ones that are in plural or have some punctuation at the end and so on. This is still hard if your document is large, but still a faster alternative than targeting the Insert menu with the mouse 8000 times (I guess each one of my 2000 index entries appears 4 times in my document, which gives me an estimate of 8000 menu selections with the mouse - unfortunately no keyboard bindings were found to work on my system).
You should delete all lines containing characters that could be interpreted as metacharacters of regular expressions: *, +, ?, $, &, ^, \ - and probably many others. Don't try to escape them, it will not work: awkscr_create_index_items will replace the correct, string with the escaped string, adding an index entry for the escaped string too! This is not what you will want. What is rather needed here is a mechanism to search for the escaped string, but replaced it with the verbatim one (i.e. the string without the escaping backslashes). This is still work to be done (FIXME).
Practically, this restriction means that you will have to add your index entries for symbols like *, +, ?, $, &, ^, \ manually, each time after you run awkscr_create_index_items.
Once you have a file, say indexitems, with all words that should appear in the Index of a LyX file, type:
awkscr_insert_index_items indexitems - < document-noindexitems.lyx > document-indexitems.lyx |
to create from document-noindexitems. lyx a document with index entries (document-indexitems.lyx) for all words in indexitems.
![]() |
Long execution time! |
|---|---|
|
The current implementation of awkscr_insert_index_items takes really long to execute, if the indexitems file is large: For 3000 words in
indexitems, producing about 9000 index entries in the final document (of which 3000 are duplicate), the script may well need 1-2 hours on a Pentium 3.4 GHz - go get a cup of coffe! |
Some notes on awkscr_insert_index_items's mode of operation:
The “-” in the above invocation is important: it forces the awk script to continue reading from standard input, after it has read indexitems. This, together with the code
FILENAME == "indexitems" {
n++
indexentry[$1] = $1
next
}
|
in awkscr_insert_index_items, causes the words in indexitems to be imported into the indexentry[] associative array.
The file separator in awkscr_insert_index_items is set to the semicolon “;”, instead of the default, which is space. This makes it possible to enter index entries with more than one words. Accordingly, the awkscr_create_index_items script appends a semicolon at the end of each word it prints - you should leave these untouched!
awkscr_insert_index_items follows a simple algorithm to insert the index entries at the right places in the document: to insert an index entry, we have to know what LyX environment (Section 5.1) we are in. In essence, this means we have to parse the LyX document. Since the \ layout commands in the LyX file do NOT have what we would call “closing tags” in other markup languages, we cannot tell awk “if you are between the start and the end of the Paragraph environment, do the following”, or anything like that - there is no easy way to find the “end “ of an environment, given all the environment nestings that are possible. Luckily, another easy way exists: whenever a \ layout command is encountered, we are in the environment specified by that \ layout command, so we only need to set a variable, call it layout, accordingly:
/\\layout SGML/ { layout = "SGML"; print; next }
/\\layout Chapter/ { layout = "Chapter"; print; next }
/\\layout Section/ { layout = "Section"; print; next }
/\\layout Subsection/ { layout = "Subsection"; print; next }
/\\layout Subsubsection/ { layout = "Subsubsection"; print; next }
/\\layout Standard/ { layout = "Standard"; print; next }
|
...and so on
Clearly, we should not insert index entries everywhere, e.g. in the “Code” environment. That's why we check if we are in the "Standard", "Itemize", “Enumerate”, "Quotation", "Description" environment (warning: the way sedscr works currently, you should not insert index entries in the “Caption” environment) and, if we are (and only then), we substiture every word in the indexentry[] array with the LyX “insert index entry” command:
{
if ( ( layout == "Standard" ||
layout == "Itemize" ||
layout == "Enumerate" ||
layout == "Quotation" ||
layout == "Description" ) && ( inset == 0 ) ) {
for (item in indexentry) {
if (gsub(" " item " "," " item " \n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub("^" item " "," " item " \n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item "$"," " item "\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item ":"," " item ":\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item "\\."," " item ".\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item ","," " item ",\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item "\\?"," " item "?\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item ";"," " item ";\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " item "\n"," " item "\n\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
else if (gsub(" " "\"" item "\""," " "\"" item "\"\n\\begin_inset LatexCommand \\index{" indexentry[item] "}\n\n\\end_inset \n")) { continue }
}
{ print; next }
}
}
|
Some tips regarding the (necessary) manual editing of the words file, the file output by awkscr_create_index_items above:
Leave the semicolons at the end of each line untouched! They are needed as record separators in the awk scripts.
You will see a lot of words (or their declinations) that are not useful. It is one thing to have a lot of words and another to have a set of really useful words and phrases. That's the price we pay for the simplicity of our method.
You may need to supply some extra terms you feel are missing from that file. Feel free to do this, awkscr_insert_index_items does not know how you created the indexitems file you give it.
Keep backups of your word lists from subsequent runs of the scripts. Combine word lists from other projects. No matter how long your word list, only the terms that really appear somewhere, will make it to the Index, so don't worry if your list is too long - given enough computing time, that is.
Take care to delete everything in your word list that looks like a regular expression with metacharacters - because it will be interpreted as such, with unpredictable results (unless you really know what you are doing). I once had “.*” on one line and I forgot to delete it. I then wondered how come that my document was full of index entries to “.*” while the text was almost gone! See regular expressions, for a brief introduction to regular expressions.
Take out any “:”, “;”, “?” from the end of the words, as well as enclosing double quotes. Those characters are already taken care of when it comes to inserting the entries, i.e. the indexentries file should contain only the “pure” words, without any punctuation signs.
Don't leave in “config” if your LyX file contains “config.php”. If you do, the latter will look ugly in the LyX editor, as it will contain an index entry for “config” just in the middle of it. This will not affect the rendered formats, however.
Don't leave in words that might form parts of a LyX command. I once left “Enumerate” in my word list. The resulting LyX file contained an index entry for “Enumerate” in front of every item in every enumeration list! Clearly, the awk script awkscr_insert_index_items “sees” the LyX commands in the file that are invisible to you. This bug has been fixed in the current version of the script, but there maybe others lurking around.
Finally, there are a few known limitations of the collateindex script that creates the index (see Automatic Indexing with the DocBook DSSSL Stylesheets:
Duplicate page numbers are not suppressed in the index. If the document contains three indexing hits on page 4, the generated index will contain 4, 4, 4.
Ranges are not automatically constructed. If the document contains indexing hits on pages 4, 5, 6, and 7, the generated index will contain 4, 5, 6, 7 instead of 4-7.
| Last updated Mon Sep 24 01:19:25 CEST 2007 | Permalink: http://www.karakas-online.de/mySGML/lyx-automatic-index-generation.html | All contents © 2002-2007 Chris Karakas |