auto_index/doc/auto_index.qbk
John Maddock c0e5727a80 Add more improved error handling.
Add docs on what containers can hold an index.
Fix tests not to generate bad Docbook!

[SVN r68458]
2011-01-26 18:13:05 +00:00

653 lines
27 KiB
Plaintext

[article AutoIndex
[quickbook 1.4]
[copyright 2008 John Maddock]
[license
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
[@http://www.boost.org/LICENSE_1_0.txt])
]
[authors [Maddock, John]]
[/last-revision $Date: 2008-11-04 17:11:53 +0000 (Tue, 04 Nov 2008) $]
]
[section:overview Overview]
AutoIndex is a tool for taking the grunt work out of indexing a
Quickbook\/Boostbook\/Docbook document that describes C\/C++ code.
Traditionally, in order to index a Docbook document you would
have to manually add a large amount of `<indexterm>` markup:
in fact one `<indexterm>` for each occurrence of each term to be
indexed.
Instead AutoIndex will scan one or more C\/C++ header files
and extract all the ['function], ['class], ['macro] and ['typedef]
names that are defined by those headers, and then insert the
`<indexterm>`'s into the XML document for you.
AutoIndex creates index entries as follows - for each occurrence of
each search term, it creates two index entries - one has the search term
as the primary index key and the title of the section it appears in as
a subterm, the other has the section title as the main index entry and the
search term as the subentry. Thus the user has two chances to find what their
looking for, based upon either the section name or the ['function], ['class], ['macro]
or ['typedef] name. Note that this behaviour can be changed so that only one
index entry is created - using the search term as the key - and not using the
section name except as a sub-entry of the search term.
So for example in Boost.Math the class name `students_t_distribution` has a primary
entry that lists all sections it appears in:
[$../students_t_eg_1.png]
Then those sections also have primary entries, which list all the search terms those
sections contain:
[$../students_t_eg_2.png]
Of course these automated index entries may not be quite
what you're looking for: often you'll get a few spurious entries, a few missing entries,
and a few entries where the section name used as an index entry is less than ideal.
So AutoIndex provides some powerful regular expression based rules that allow you
to add, remove, constrain, or rewrite entries. Normally just a few lines in
AutoIndex's script file are enough to tailor the output to match the author's
expectations.
AutoIndex also supports multiple indexes (as does Docbook), and since it knows
which search terms are ['function], ['class], ['macro] or ['typedef] names, it
can add the necessary attribubes to the XML so that you can have separate
indexes for each of these different types. These specialised indexes only contain
entries for the ['function], ['class], ['macro] or ['typedef] names, ['section
names] are never used as primary index terms here, unlike the main "include everything"
index.
Finally, while the Docbook XSL stylesheets create nice indexes complete with page
numbers for PDF output, the HTML indexes look a lot less good, as these use
section titles in place of page numbers... but as AutoIndex uses section titles
as index entries this leads to a lot of repetition, so as an alternative AutoIndex
can be instructed to construct the index itself. This is faster than using
the XSL stylesheets, and now each index entry is a hyperlink to the
appropriate section:
[$../students_t_eg_3.png]
With internal index generation there is also a helpful navigation bar
at the start of each Index:
[$../students_t_eg_4.png]
Finally, you can choose what kind of XML container wraps an internally generated index -
this defaults to `<section>...</section>` but you can use either command line options
or Boost.Build Jamfile features, to select an alternative wrapper - for example "appendix"
or "chapter" would be good choices, whatever fits best into the flow of the
document. You can even set the container wrapper to type "index" provided you turn
off index generation by the XSL stylesheets, for example by setting the following
build requirements in the Jamfile:
[pre
<format>html:<auto-index-internal>on # Use internally generated indexes
<auto-index-type>index # Use <index>...</index> as the XML wrapper
<format>html:<xsl:param>generate.index=0 # Don't let the XSL stylesheets generate indexes.
]
[endsect]
[section:tut Getting Started and Tutorial]
[h4 Step 1: Build the tool]
[/ [note This step is strictly optional, but can speed up build times.]]
cd into `tools/auto_index/build` and invoke bjam as:
bjam release
Optionally pass the name of the compiler toolset you want to use to bjam as well:
bjam release gcc
Now open up your user-config.jam file and at the end add the line:
[pre
using auto-index : ['full-path-of-executable] ;
]
[note
This declaration must go towards the end of user-config.jam, or in any case after the Boostbook initialisation.
Also note that Windows users must use forward slashes in the paths in user-config.jam]
Finally note that `tools/auto_index/auto-index.jam` gets copied into the same directory as the rest of the Boost.Build tools
(under `tools/build/v2/tools` in your main Boost tree): this is a temporary fix that will go away
if the tool is accepted into Boost.
[h4 Step 2: Configure Boost.Build]
Assuming you have a Jamfile for building your documentation that looks
something like:
[pre
boostbook standalone
:
type_traits
:
# build requirements go here:
;
]
Then add the line:
[pre using auto-index ; ]
to the start of the Jamfile, and then add whatever auto-index options
you want to the build requirements section, for example:
[pre
boostbook standalone
:
type_traits
:
# build requirements go here:
# this one turns on indexing:
<auto-index>on
# choose indexing method for pdf's:
<format>pdf:<auto-index-internal>off
# choose indexing method for html:
<format>html:<auto-index-internal>on
# set the name of the script file to use:
<auto-index-script>index.idx
;
]
The available options are:
[variablelist
[[<auto-index>off/on][Turns indexing of the document on, defaults to
"off", so be sure to set this if you want AutoIndex invoked!]]
[[<auto-index-internal>off/on][Chooses whether AutoIndex creates the index
itself (feature on), or whether it simply inserts the necessary DocBook
markup so that the DocBook XSL stylesheets can create the index. Defaults to "off".]]
[[<auto-index-script>filename][Specifies the name of the script to load.]]
[[<auto-index-no-duplicates>off/on][When "on" AutoIndex will only index a term
once in any given section, otherwise (the default) multiple index entries per
term may be created if the term occurs more than once in the section.]]
[[<auto-index-section-names>off/on][When "on" AutoIndex will use create two
index entries for each term found - one uses the term itself as the primary
index key, the other uses the enclosing section name. When off the index
entry that uses the section title is not created. Defaults to "on"]]
[[<auto-index-verbose>off/on][Defaults to "off". When turned on AutoIndex
prints progress information - generally useful only for debugging purposes.]]
[[<auto-index-prefix>filename][Specifies a directory to apply as a prefix to all relative file paths in the script file.]]
[[<auto-index-type>element-name][Specifies the name of the XML element to enclose internally generated indexes in:
defaults to "section", but could equally be "appendix" or "chapter" or some other block level element that has a formal title.
The actual list of available options depends upon the document type, the following table gives the available options:]]
]
[table
[[Document Type][Available Index Types]]
[[book][appendix index article chapter reference part]]
[[article][section appendix index sect1]]
[[library][See Chapter]]
[[chapter][section index sect1]]
[[part][appendix index article chapter reference]]
[[appendix][section index sect1]]
[[preface][section index sect1]]
[[qandadiv][N/A: an index would have to be placed within a subsection of the document.]]
[[qandaset][N/A: an index would have to be placed within a subsection of the document.]]
[[reference][N/A: an index would have to be placed within a subsection of the document.]]
[[set][N/A: an index would have to be placed within a subsection of the document.]]
]
It is possible to make the use of auto-index optional in Boost.Build, to allow
users who do not have auto-index installed to build your documentation. One
method of setting up optional auto-index support is to place all auto-index
configuration in a the body of a bjam if statement:
if --enable-index in [ modules.peek : ARGV ]
{
using auto-index ;
project : requirements
<auto-index>on
<auto-index-script>index.idx
#... other auto-index options here...
;
}
[h4 Step 3: Add indexes to your documentation]
To add a single index to a BoostBook\/Docbook document, then add
`<index/>` at the location where you want the index to appear. The
index will be rendered as a separate section when the documentation
is built.
To add multiple indexes, then give each one a title and set it's
`type` attribute to specify which terms will be included, for example
to place the ['function], ['class], ['macro] or ['typedef] names
indexed by ['auto_index] in separate indexes along with a main
"include everything" index as well, one could add:
[pre
<index type\="class_name">
<title>Class Index<\/title>
<\/index>
<index type\="typedef_name">
<title>Typedef Index<\/title>
<\/index>
<index type\="function_name">
<title>Function Index<\/title>
<\/index>
<index type\="macro_name">
<title>Macro Index<\/title>
<\/index>
<index\/>
]
[note Multiple indexes like this only work correctly if you tell the XSL stylesheets
to honor the "type" attribute on each index as by default [/[*they do not do this]].
You can turn the feature on by adding `<xsl:param>index.on.type=1` to your projects
requirements in the Jamfile.]
In quickbook, you add the same markup but enclose it in an escape:
'''<index/>'''
If you are using auto-index's internal index generation (usually recommended for HTML output)
then you can also decide what kind of XML wrapper the generated index is placed in.
By default this is a `<section>...</section>` XML block (this replaces the original
`<index>...</index>` block). However, depending upon the structure of the document
and whether or not you want the index on a separate page - or else on the front page after
the TOC - you may want to place the index inside a different type of XML block. For example
if your document uses `<chapter>` top level content rather than `<section>`'s then
it may be preferable to place the index in a `<chapter>` or `<appendix>` block.
You can also place the index inside an `<index>` block if you prefer, in which case the index
does not appear in on a page of it's own, but after the TOC in the HTML output.
You control the type of XML block used by setting the =<auto-index-type>element-name=
attribute in the Jamfile, or via the `index-type=element-name` command line option to
auto-index itself. For example, to place the index in an appendix your Jamfile might
look like:
[pre
using quickbook ;
using auto-index ;
xml type_traits : type_traits.qbk ;
boostbook standalone
:
type_traits
:
# indexing is on:
<auto-index>on
# PDF's rely on the XSL stylesheets to generate the index:
<format>pdf:<auto-index-internal>off
# HTML output uses auto-index to generate the index:
<format>html:<auto-index-internal>on
# Name of script file to use:
<auto-index-script>index.idx
# Set the XML wrapper for HML Indexes to "appendix":
<format>html:<auto-index-type>appendix
# Turn on multiple index support:
<xsl:param>index.on.type=1
]
[h4 Step 4: Create the script file]
AutoIndex works by reading a script file that tells it what to index,
at it's simplest it will scan one or more headers for terms that
should be indexed in the documentation. So for example to scan
"myheader.hpp" the script file would just contain:
!scan myheader.hpp
Or we can recursively scan through directories looking for all
the files to scan whose name matches a particular regular expression:
[pre !scan-path "..\/..\/..\/..\/boost\/math" ".*\.hpp" true ]
Note how each argument is whitespace separated and can be optionally
enclosed in "double quotes". The final ['true] argument indicates
that subdirectories in `../../../../boost/math` should be searched
in addition to that directory.
Often the ['scan] or ['scan-path] rules will bring in too many terms
to search for, so we need to be able to exclude terms as well:
!exclude type
Which excludes the term "type" from being indexed.
We can also add terms manually:
foobar
will index occurrences of "foobar" and:
foobar \<\w*(foo|bar)\w*\>
will index any whole word containing either "foo" or "bar" within it,
this is useful when you want to index a lot of similar or related
words under one entry, for example:
reflex
Will only index occurrences of "reflex" as a whole word, but:
reflex \<reflex\w*\>
will index occurrences of "reflex", reflexing" and
"reflexed" all under the same entry ['reflex].
This inclusion rule can also restrict the term to
certain sections, and add an index category that
the term should belong to (so it only appears in certain
indexes).
Finally the script can add rewrite rules, that rename section names
that are automatically used as index entries. For example we might
want to remove leading "A" or "The" prefixes from section titles
when AutoIndex uses them as an index entry:
!rewrite-name "(?i)(?:A|The)\s+(.*)" "\1"
[h4 Step 5: Add Manual Index Entries - Optional]
If you add manual `<indexentry>` markup to your docbook XML then these will be
passed through unchanged. Please note however, that if you are using
auto-index's internal index generation then it only recognises
`<primary>` and `<secondary>` elements within the `<indexterm>`.
`<tertiary>`, `<see>` and `<seealso>` elements are not currently recognised
and auto-index will emit a warning if these are used. Likewise none of the
attributes which can be applied to these elements are used when
auto-index generates the index itself, with the exception of the "type" attribute.
[h4 Step 6: Build the Docs]
Make sure that auto-index.jam is in your BOOST_BUILD_PATH, by either
setting the environment variable BOOST_BUILD_PATH to point to the directory
containing it, or by copying the file into
`boost-root/tools/build/v2/tools`. Then you build the docs with either:
bjam release
To build the html docs or:
bjam pdf release
To build the pdf.
During the build process you should see AutoIndex emit a message
such as:
[pre Indexing 990 terms... ]
If you don't see that, or if it's indexing 0 terms then something is wrong!
[h4 Step 7: Iterate]
Creating a good index is an iterative process, often the first step is
just to add a header scanning rule to the script file and then generate
the documentation and see:
* What's missing.
* What's been included that shouldn't be.
* What's been included under a poor name.
Further rules can then be added to the script to handle these cases
and the next iteration examined, and so on.
[tip If you don't understand why a particular term is present in the index, try adding a ['!debug regular-expression]
directive to the [link autoindex.script_ref script file].]
[endsect]
[section:script_ref Script File Reference]
The following elements can occur in a script:
[h4 Comments and blank lines]
Blank lines consisting of only whitespace are ignored, so are lines that start with a '#'.
[h4 Simple Inclusions]
term [regular-expression1 [regular-expression2 [category]]]
[variablelist
[[term][The term to index: this will form a primary entry in the Index
with the section title(s) containing the term as secondary entries, and
also will be used as a secondary entry beneath each of the section
titles that the term occurs in.]]
[[regular-expression1][An optional regular expression: each occurrence
of the regular expression in the text of the document will result
in one index term being emitted.
If the regular expression is omitted or is "", then the ['term] itself
will be used as the search text - and only occurrence of whole words matching
['term] will be indexed.]]
[[regular-expression2][A constraint that specifies which sections are
indexed for ['term]: only if the ID of the section matches
['regular-expression2] exactly will that section be indexed for occurrences
of ['term].
For example:
`myclass "" "mylib.examples.*"`
Will index occurrences of "myclass" as a whole word only in sections
whose ID begins "mylib.examples", while:
`myclass "" "(?!mylib.introduction.*).*"`
will index occurrences of "myclass" in any section, except those whose
ID's begin "mylib.introduction".
If this field is omitted or is "", then all sections are indexed for this term.]]
[[category][Optionally an index category to place occurrences of
['term] in. If you have multiple indexes then this is the name
assigned to the indexes "type" attribute.
]]
]
[h4 Source File Scanning]
!scan source-file-name
Scans the C\/C++ source file ['source-file-name] for definitions of
['function]'s, ['class]'s, ['macro]'s or ['typedef]'s and makes each of
these a term to be indexed. Terms found are assigned to the index category
"function_name", "class_name", "macro_name" or "typedef_name" depending
on how they were seen in the source file. These may then be included
in a specialised index whose "type" attribute has the same category name.
[important
When actually indexing a document, the scanner will not index just any old occurrence of the
terms found in the source files. Instead it searches for class definitions or function or
typedef declarations. This reduces the number of spurious matches placed in the index, but
may also miss some legitimate terms: refer to the /define-scanner/ command for information on how to
change this.
]
[h4 Directory and Source File Scanning]
!scan-path directory-name file-name-regex [recurse]
[variablelist
[[directory-name][The directory to scan: this should be a path relative
to the script file (or to the path specified with the prefix=path option on the command line)
and should use all forward slashes in it's file name.]]
[[file-name-regex][A regular expression: any file in the directory whose name
matches the regular expression will be scanned for terms to index.]]
[[recurse][An optional boolean value - either "true" or "false" - that
indicates whether to recurse into subdirectories. This defaults to "false"]]
]
[h4 Excluding Terms]
!exclude term-list
Excludes all the terms in whitespace separated ['term-list] from being indexed.
This should be placed /after/ any ['!scan] or ['!scan-path] rules which may
result in the terms becoming included. In other words this removes terms from
the scanners internal list of things to index.
[h4 Rewriting Section Names]
[pre !rewrite-id regular-expression new-name]
[variablelist
[[regular-expression][A regular expression: all section ID's that match
the expression exactly will have index entries ['new-name] instead of
their title(s).]]
[[new-name][The name that the section will appear under in the index.]]
]
!rewrite-name regular-expression format-text
[variablelist
[[regular-expression][A regular expression: all sections whose titles
match the regular expression exactly, will have index entries composed
of the regular expression match combined with the regex format string
['format-text].]]
[[format-text][The Perl-style format string used to reformat the title.]]
]
For example:
[pre
!rewrite-name "(?:A|An|The)\s+(.*)" "\1"
]
Will remove any leading "A", "An" or "The" from all index entries - thus preventing lots of
entries under "The" etc!
[h4 Defining or Changing the File Scanners]
!define-scanner type file-search-expression xml-regex-formatter term-formatter id-filter filename-filter
When a source file is scanned using the =!scan= or =!scan-path= rules, then the file is searched using
a series of regular expressions to look for classes, functions, macros or typedefs that should be indexed.
A set of default regular expressions are provided for this (see below), but sometimes you may want to replace
the defaults, or add new scanners. The arguments to this rule are:
[variablelist
[[type][The ['type] to which items found using this rule will assigned, index terms created from the
source file and then found in the XML, will have the type attribute set to this value, and may then appear in a
specialized index with the same type attribute]]
[[file-search-expression][A regular expression that is used to scan the source file for index terms, the result of
a match against this expression will be transformed by the next two arguments.]]
[[xml-regex-formatter][A regular expression format string that extracts the salient information from whatever
matched the ['file-search-expression] in the source file, and creates ['a new regular expression] that will
be used to search the document being indexed for occurrences of this index term.]]
[[term-formatter][A regular expression format string that extracts the salient information from whatever
matched the ['file-search-expression] in the source file, and creates the index term that will appear in
the index.]]
[[id-filter][Optional. A regular expression that restricts the section-id's that are searched in the document being indexed:
only sections whose ID attribute matches this expression exactly will be considered for indexing terms found by this scanner.]]
[[filename-filter][Optional. A regular expression that restricts which files are scanned by this scanner: only files whose file name
matches this expression exactly will be scanned for index terms to use. Note that the filename matched against this may
well be an absolute path, and contain either forward or backward slash path separators.]]
]
If, when the first file is scanned, there are no scanners whose ['type] is "class_name", "typedef_name", "macro_name" or
"function_name", then the defaults are installed. These are equivalent to:
!define-scanner class_name "^[[:space:]]*(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\<\w+\>([[:blank:]]*\([^)]*\))?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>)?[[:space:]]*(\{|:[^;\{()]*\{)" "(?:class|struct)[^;{]+\\<\5\\>[^;{]+\\{" \5
!define-scanner typedef_name "typedef[^;{}#]+?(\w+)\s*;" "typedef[^;]+\\<\1\\>\\s*;" "\1"
!define-scanner "macro_name" "^\s*#\s*define\s+(\w+)" "\\<\1\\>" "\1"
!define-scanner "function_name" "\w+\s+(\w+)\s*\([^\)]*\)\s*[;{]" "\\<\\w+\\>\\s+\\<\1\\>\\s*\\([^;{]*\\)\\s*[;{]" "\1"
Note that these defaults are not installed if you have provided your own versions with these ['type] names. In this case if
you want the default scanners to be in effect as well as your own, you should include the above in your script file.
It is also perfectly allowable to have multiple scanners with the same ['type], but with the other fields differing.
Finally you should note that the default scanners are quite strict in what they will find, for example the class
scanner will only create index entries for classes that have class definitions of the form:
class my_class : public base_classes
{
// etc
In the documentation, so that simple mentions of the class name will ['not] get indexed, only the class synopsis if there is one.
If this isn't how you want things, then include the ['class_name] scanner definition above in your script file, and change
the ['xml-regex-formatter] field to something more permissive, for example:
!define-scanner class_name "^[[:space:]]*(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\<\w+\>([[:blank:]]*\([^)]*\))?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>)?[[:space:]]*(\{|:[^;\{()]*\{)" "\\<\5\\>" \5
Will look for ['any] occurrence of whatever class names the scanner may find in the documentation.
[h4 Debugging]
If you see a term in the index, and you don't understand why it's there, add a ['debug] directive:
[pre
!debug regular-expression
]
Now, whenever ['regular-expression] matches either the found index term, or the section title it appears in,
or the ['type] field of a scanner, then
some diagnostic information will be printed that will look something like:
[pre
Debug term found, in block with ID: spirit.qi.reference.parser_concepts.parser
Current section title is: Notation
The main index entry will be : Notation
The indexed term is: parser
The search regex is: \[P\|p\]arser
The section constraint is: .*qi.reference.parser_concepts.*
The index type for this entry is: qi_index
]
[endsect]
[section:xml XML Handling]
Auto-index is rather simplistic in it's handling of XML:
* When indexing a document, all block content at the paragraph level gets collapsed into a single
string for matching against the regular expressions representing each index term. In other words,
for the most part, you can assume that you're indexing plain text when writing regular expressions.
* Named XML entities for &, ", ', < or > are converted to their corresponding characters before indexing
a section of text. However, decimal or hex escape sequences are not currently converted.
* Index terms are inserted into the XML sequence just as they are, and no attempt is made to
escape them to valid XML. Normally these are C++ identifiers anyway so that's not an issue, but
you should take care not to define scanners that create index terms containing &, ", ', < or >.
[endsect]
[section:comm_ref Command Line Reference]
The following command line options are supported by auto_index:
[variablelist
[[in=infilename][Specifies the name of the XML input file to be indexed.]]
[[out=outfilename][Specifies the name of the new XML file to create.]]
[[scan=source-filename][Specifies that ['source-filename] should be scanned
for terms to index.]]
[[script=script-filename][Specifies the name of the script file to process.]]
[[--no-duplicates][If a term occurs more than once in the same section, then
include only one index entry.]]
[[--internal-index][Specifies that auto_index should generate the actual
indexes rather than inserting `<indexterm>`'s and leaving index generation
to the XSL stylesheets.]]
[[--no-section-names][Prevents auto_index from using section names as index entries.]]
[[prefix=pathname][Specifies a directory to apply as a prefix to all relative file paths in the script file.]]
[[index-type=element-name][Specifies the name of the XML element to enclose internally generated indexes in:
defaults to "section", but could equally be "appendix" or "chapter" or some other block level element that has a formal title.]]
]
[endsect]