extract++(1) General Commands Manual extract++(1)
NAME
extract++ - SWISH++ text extractor
SYNOPSIS
extract++ [ options ] directory... file...
DESCRIPTION
extract++ is the SWISH++ text extractor, a utility to extract what text
there is from a (mostly) binary file (similar to the strings(1) com-
mand) prior to indexing. Original files are untouched.
Text is extracted from the specified files and files in the specified
directories; text from files in subdirectories of specified directories
is also extracted by default (unless the -r, --no-recurse, -f, or
--filter option or the RecurseSubdirs or ExtractFilter variable is
given).
Ordinarily, text is extracted from files either only if their filename
matches one of the patterns in the set specified with either the -e or
--pattern option or the IncludeFile variable (unless standard input is
used; see next paragraph) or is not among the set specified with either
the -E or --no-pattern option or the ExcludeFile variable.
If there is a single filename of `-', the list of directories and files
to extract is instead taken from standard input (one per line). In
this case, filename patterns of files to extract need not be specified
explicitly: all files, regardless of whether they match a pattern (un-
less they are among the set not to extract specified with either the -E
or --no-pattern option or the ExcludeFile variable), are extracted,
i.e., extract++ assumes you know what you're doing when specifying
filenames in this manner.
Ordinarily, the text extracted from a file is written to another file
in the same directory having the same filename but with the ``.txt''
extension appended by default, e.g., ``foo.doc'' becomes
``foo.doc.txt'' after extraction. (See also the -x or --extension op-
tion or the ExtractExtension variable.) However, extraction is not
performed if the extracted text file exists.
If either the -f or --filter option or the ExtractFilter variable is
given, then only a single file specified on the command line is ex-
tracted to standard output. In this case, filename patterns are not
used and the existence of an extracted text file is irrelevant.
Filters
Via the FilterFile configuration file variable, files having particular
patterns can be filtered prior to extraction. (See the examples in
swish++.conf(5).)
Character Mapping and Word Determination
extract++ performs the same character mapping, character entity conver-
sions, and word determination heuristics used by index++(1) but also
additionally:
1. Considers all PostScript Level 2 operators that are not also Eng-
lish words to be stop words. Such words in a file usually indicate
an encapsulated PostScript (EPS) file and such should not be in-
dexed.
2. Looks specifically for encapsulated PostScript (EPS) data between
everything between one of %%BeginSetup, %%BoundingBox, %%Creator,
%%EndComments, or %%Title and %%Trailer and discards it.
3. Discards strings of ASCII hex data Word_Hex_Min_Size characters or
longer, e.g., ``7F454C46.'' (Default is 5.)
Motivation
extract++ was developed to be able to index non-text files in propri-
etary formats such as Microsoft Office documents. There are a couple
of reasons why the functionality of extract++ isn't simply built into
index++(1):
1. Users who do not need to index such documents shouldn't have to pay
the performance penalty for doing the extra checks for PostScript
and hex data.
2. While index++(1) can uncompress files on the fly using filters
also, uncompressing them every time indexing is performed is exces-
sive. Text extraction, on the other hand, is done only once per
file; if the file is updated, the text-extracted version should be
deleted and recreated.
OPTIONS
Options begin with either a `-' for short options or a ``--'' for long
options. Either a `-' or ``--'' by itself explicitly ends the options;
however, the difference is that `-' is returned as the first non-option
whereas ``--'' is skipped entirely. Long option names may be abbrevi-
ated so long as the abbreviation is unambiguous.
For a short option that takes an argument, the argument is either taken
to be the remaining characters of the same option, if any, or, if not,
is taken from the next option unless said option begins with a `-'.
Short options that take no arguments can be grouped (but the last op-
tion in the group can take an argument), e.g., -lrv4 is equivalent to
-l -r -v4.
For a long option that takes an argument, the argument is either taken
to be the characters after a `=', if any, or, if not, is taken from the
next option unless said option begins with a `-'.
-?
--help Print the usage (``help'') message and exit.
-cc
--config-file=c The name of the configuration file, c, to use. (De-
fault is swish++.conf in the current directory.) A
configuration file is not required: if none is speci-
fied and the default does not exist, none is used;
however, if one is specified and it does not exist,
then this is an error.
-ep[,p...]
--pattern=p[,p...]
A filename pattern (or set of patterns separated by
commas), p, of files to extract text from. Case is
significant. Multiple -e or --pattern options may be
specified.
-Ep[,p...]
--no-pattern=p[,p...]
A filename pattern or patterns, p, of files not to
extract text from. Case is significant. Multiple -E
or --no-pattern options may be specified.
-f
--filter Extract a single file to standard output and exit.
-l
--follow-links Follow symbolic links during extraction. The default
is not to follow them. (This option is not available
under Microsoft Windows since it doesn't support sym-
bolic links.)
-r
--no-recurse Do not recursively extract the files in subdirecto-
ries, that is: when a directory is encountered, all
the files in that directory are extracted (modulo the
filename patterns specified via the -e, --pattern,
-E, or --no-pattern options or the IncludeFile or Ex-
cludeFile variables) but subdirectories encountered
are ignored and therefore the files contained in them
are not extracted. (This option is most useful when
specifying the directories and files to extract via
standard input.) The default is to extract the files
in subdirectories recursively.
-sf
--stop-file=f The name of a file, f, containing the set stop-words
to use instead of the built-in set. Whitespace, in-
cluding blank lines, and characters starting with #
and continuing to the end of the line (comments) are
ignored.
-S
--dump-stop Dump the built-in set of stop-words to standard out-
put and exit.
-vc
--verbosity=v The verbosity level, v, for printing additional in-
formation to standard output during indexing. The
verbosity levels, 0-4, are:
0 No output is generated (except for errors).
1 Only run statistics (elapsed time, number of
files, word count) are printed.
2 Directories are printed as extraction progresses.
3 Directories and files are printed with a word-
count for each file.
4 Same as 3 but also prints all files that are not
extracted and why.
-V
--version Print the version number of SWISH++ and exit.
-xe
--extension=e The extension to append to filenames during extrac-
tion. (It can be specified with or without the dot;
default is txt.)
CONFIGURATION FILE
The following variables can be set in a configuration file. Variables
and command-line options can be mixed.
ExcludeFile Same as -E or --no-pattern
ExtractExtension Same as -x or --extension
ExtractFilter Same as -f or --filter
FilterAttachment (See FILTERS in swish++.conf(5).)
FilterFile (See FILTERS in swish++.conf(5).)
FollowLinks Same as -l or --follow-links
IncludeFile Same as -e or --pattern
RecurseSubdirs Same as -r or --no-recurse
StopWordFile Same as -s or --stop-file
Verbosity Same as -v or --verbosity
EXAMPLES
Extraction
To extract text from all Microsoft Office files on a web server:
cd /home/www/htdocs
extract++ -v3 -e '*.doc' -e '*.ppt' -e '*.xls' .
Filters
(See the examples in swish++.conf(5).)
EXIT STATUS
Exits with one of the values given below:
0 Success.
1 Error in configuration file.
2 Error in command-line options.
20 File to extract does not exist.
30 Unable to read stop-word file.
CAVEATS
1. Text extraction is not perfect, nor can be.
2. As with index++(1), the word-determination heuristics employed are
heavily geared for English. Using SWISH++ as-is to extract files
in non-English languages is not recommended.
FILES
swish++.conf default configuration file name
SEE ALSO
index++(1), search++(1), strings(1), swish++.conf(5), glob(7)
Adobe Systems Incorporated. PostScript Language Reference Manual, 2nd
ed. Addison-Wesley, Reading, MA. pp. 346-359.
International Standards Organization. ``ISO/IEC 9945-2: Information
Technology -- Portable Operating System Interface (POSIX) -- Part 2:
Shell and Utilities,'' 1993.
AUTHOR
Paul J. Lucas <pauljlucas@mac.com>
SWISH++ November 1, 2002 extract++(1)
Generated by dwww version 1.14 on Tue Aug 26 14:40:34 CEST 2025.