1 <!-- Creator : groff version 1.18.1 -->
2 <!-- CreationDate: Fri Dec 9 14:11:37 2005 -->
5 <meta name="generator" content="groff -Thtml, see www.gnu.org">
6 <meta name="Content-Style" content="text/css">
11 <h1 align=center>catdoc</h1>
12 <a href="#NAME">NAME</a><br>
13 <a href="#SYNOPSIS">SYNOPSIS</a><br>
14 <a href="#DESCRIPTION">DESCRIPTION</a><br>
15 <a href="#OPTIONS">OPTIONS</a><br>
16 <a href="#CHARACTER SETS">CHARACTER SETS</a><br>
17 <a href="#CHARACTER SUBSTITUTION">CHARACTER SUBSTITUTION</a><br>
18 <a href="#RUNTIME CONFIGURATION">RUNTIME CONFIGURATION</a><br>
19 <a href="#BUGS">BUGS</a><br>
20 <a href="#SEE ALSO">SEE ALSO</a><br>
21 <a href="#AUTHOR">AUTHOR</a><br>
27 <table width="100%" border=0 rules="none" frame="void"
28 cols="2" cellspacing="0" cellpadding="0">
29 <tr valign="top" align="left">
32 <p>catdoc − reads MS-Word file and puts its content as
33 plain text on standard output</p>
36 <a name="SYNOPSIS"></a>
39 <table width="100%" border=0 rules="none" frame="void"
40 cols="2" cellspacing="0" cellpadding="0">
41 <tr valign="top" align="left">
44 <p><b>catdoc</b> [<b>-vlu8btawxV</b>] [<b>-m</b>
45 <i>number</i>] [ <b>-s</b> <i>charset</i>] [ <b>-d</b>
46 <i>charset</i>] [ <b>-f</b> <i>output-format</i>]
50 <a name="DESCRIPTION"></a>
53 <table width="100%" border=0 rules="none" frame="void"
54 cols="2" cellspacing="0" cellpadding="0">
55 <tr valign="top" align="left">
58 <p><b>catdoc</b> behaves much like <b>cat</b>(1) but it
59 reads MS-Word file and produces human-readable text on
60 standard output. Optionally it can use <b>latex</b>(1)
61 escape sequences for characters which have special meaning
62 for LaTeX. It also makes some effort to recognize MS-Word
63 tables, although it never tries to write correct headers for
64 LaTeX tabular environment. Additional output formats, such
65 is HTML can be easily defined.</p>
67 <p><b>catdoc</b> doesn’t attempt to extract formatting
68 information other than tables from MS-Word document, so
69 different output modes means mainly that different
70 characters should be escaped and different ways used to
71 represent characters, missing from output charset. See
72 CHARACTER SUBSTITUTION below</p>
74 <p><b>catdoc</b> uses internal <b>unicode</b>(4)
75 representation of text, so it is able to convert texts when
76 charset in source document doesn’t match charset on
77 target system. See CHARACTER SETS below.</p>
79 <p>If no file names supplied, <b>catdoc</b> processes its
80 standard input unless it is terminal. It is unlikely that
81 somebody could type Word document from keyboard, so if
82 <b>catdoc</b> invoked without arguments and stdin is not
83 redirected, it prints brief usage message and exits.
84 Processing of standard input (even among other files) can be
85 forced using dash ’-’ as file name.</p>
87 <p>By default, <b>catdoc</b> wraps lines which are more than
88 72 chars long and separates paragraphs by blank lines. This
89 behavior can be turned of by <b>-w</b> switch. In
90 <i>wide</i> mode <b>catdoc prints each paragraph as one long
91 line, suitable for import into</b> word processors which
92 perform word wrapping theirselves.</p>
95 <a name="OPTIONS"></a>
98 <table width="100%" border=0 rules="none" frame="void"
99 cols="4" cellspacing="0" cellpadding="0">
100 <tr valign="top" align="left">
101 <td width="11%"></td>
109 <p>- shortcut for -f ascii. Produces ASCII text as output.
110 Separates table columns with TAB</p>
112 <tr valign="top" align="left">
113 <td width="11%"></td>
121 <p>- process broken MS-Word file. Normally, <b>catdoc
122 checks if first 8 bytes</b> of file is Microsoft OLE
123 signature. If so, it processes file, otherwise it just
124 copies it to stdin. It is intended to use <b>catdoc</b> as
125 filter for viewing all files with <i>.doc</i> extension.</p>
129 <table width="100%" border=0 rules="none" frame="void"
130 cols="2" cellspacing="0" cellpadding="0">
131 <tr valign="top" align="left">
132 <td width="10%"></td>
134 <p><b>-d</b><i>charset</i></p></td>
137 <table width="100%" border=0 rules="none" frame="void"
138 cols="2" cellspacing="0" cellpadding="0">
139 <tr valign="top" align="left">
140 <td width="23%"></td>
142 <p>- specifies destination charset name. Charset file has
143 format described in CHARACTER SETS below and should have
144 <b>.txt</b> extension and reside in <b>catdoc library
145 directory ( ${exec_prefix}/lib/catdoc). By default,
146 current</b> locale charset is used if langinfo support
151 <table width="100%" border=0 rules="none" frame="void"
152 cols="2" cellspacing="0" cellpadding="0">
153 <tr valign="top" align="left">
154 <td width="10%"></td>
156 <p><b>-f</b><i>format</i></p></td>
159 <table width="100%" border=0 rules="none" frame="void"
160 cols="2" cellspacing="0" cellpadding="0">
161 <tr valign="top" align="left">
162 <td width="23%"></td>
164 <p>- specifies output format as described in CHARACTER
165 SUBSTITUTION below. <b>catdoc</b> comes with two output
166 formats - ascii and tex. You can add your own if you
171 <table width="100%" border=0 rules="none" frame="void"
172 cols="4" cellspacing="0" cellpadding="0">
173 <tr valign="top" align="left">
174 <td width="11%"></td>
182 <p>Causes <b>catdoc</b> to list names of available charsets
183 to the stdout and exit successfully.</p>
187 <table width="100%" border=0 rules="none" frame="void"
188 cols="2" cellspacing="0" cellpadding="0">
189 <tr valign="top" align="left">
190 <td width="10%"></td>
192 <p><b>-m</b><i>number</i></p></td>
195 <table width="100%" border=0 rules="none" frame="void"
196 cols="2" cellspacing="0" cellpadding="0">
197 <tr valign="top" align="left">
198 <td width="23%"></td>
200 <p>Specifies right margin for text (default 72). <b>-m 0</b>
201 is equivalent to <b>-w</b></p>
205 <table width="100%" border=0 rules="none" frame="void"
206 cols="2" cellspacing="0" cellpadding="0">
207 <tr valign="top" align="left">
208 <td width="10%"></td>
210 <p><b>-s</b><i>charset</i></p></td>
213 <table width="100%" border=0 rules="none" frame="void"
214 cols="2" cellspacing="0" cellpadding="0">
215 <tr valign="top" align="left">
216 <td width="23%"></td>
218 <p>Specifies source charset. (one used in Word document), if
219 Word document doesn’t contain UTF-16 text. When
220 reading rtf documents, it is typically not necessary,
221 because rtf documents contain ansicpg specification. But it
222 can be set wrong by Word (I’ve seen RTF documents on
223 Russian, where cp1252 was specified). In this case this
224 option would take precedence over charset, specified in the
225 document. But source_charset statement in the configuration
226 file have less priority than charset in the document.</p>
230 <table width="100%" border=0 rules="none" frame="void"
231 cols="4" cellspacing="0" cellpadding="0">
232 <tr valign="top" align="left">
233 <td width="11%"></td>
241 <p>- shortcut for <b>-f tex</b> converts all printable
242 chars, which have special meaning for <b>LaTeX</b>(1) into
243 appropriate control sequences. Separates table columns by
246 <tr valign="top" align="left">
247 <td width="11%"></td>
255 <p>- declares that Word document contain UNICODE (UTF-16)
256 representation of text (as some Word-97 documents). If
257 catdoc fails to correct Word document with default charset,
260 <tr valign="top" align="left">
261 <td width="11%"></td>
269 <p>- declares is Word document is 8 bit. Just in case that
270 catdoc recognizes file format incorrectly.</p>
272 <tr valign="top" align="left">
273 <td width="11%"></td>
281 <p>disables word wrapping. By default <b>catdoc</b> output
282 is splitted into lines not longer than 72 (or number,
283 specified by -m option) characters and paragraphs are
284 separated by blank line. With this option each paragraph is
287 <tr valign="top" align="left">
288 <td width="11%"></td>
296 <p>causes catdoc to output unknown UNICODE character as
297 \xNNNN, instead of question marks.</p>
299 <tr valign="top" align="left">
300 <td width="11%"></td>
308 <p>causes catdoc to print some useless information about
309 word document structure to stdout before actual start of
312 <tr valign="top" align="left">
313 <td width="11%"></td>
321 <p>outputs catdoc version</p>
324 <a name="CHARACTER SETS"></a>
325 <h2>CHARACTER SETS</h2>
327 <table width="100%" border=0 rules="none" frame="void"
328 cols="2" cellspacing="0" cellpadding="0">
329 <tr valign="top" align="left">
330 <td width="10%"></td>
332 <p>When processing MS-Word file <b>catdoc</b> uses
333 information about two character sets, typically different -
334 input and output. They are stored in plain text files in
335 <b>catdoc</b> library directory. Character set files should
336 contain two whitespace-separated hexadecimal numbers - 8-bit
337 code in character set and 16-bit Unicode code. Anything from
338 hash mark to end of line is ignored, as well as blank
341 <p><b>catdoc</b> distribution includes some of these
342 character sets. Additional character set definitions,
343 directly usable by <b>catdoc</b> can be obtained from
344 ftp.unicode.org. Charset files have <b>.txt</b> suffix,
345 which shouldn’t be specified in command-line or
346 configuration files.</p>
348 <p>Note that <b>catdoc</b> is distributed with Cyrillic
349 charsets as default. If you are not Russian, you probably
350 don’t want it, an should reconfigure catdoc at compile
351 time or in runtime configuration file.</p>
353 <p>When dealing with documents with charsets other than
354 default, remember that Microsoft never uses ISO charsets.
355 While letters in, say cp1252 are at the same position as in
356 ISO-8859-1, some punctuation signs would be lost, if you
357 specify ISO-8859-1 as input charset. If you use cp1252,
358 catdoc would deal with those signs as described in CHARACTER
359 SUBSTITUTION below.</p>
362 <a name="CHARACTER SUBSTITUTION"></a>
363 <h2>CHARACTER SUBSTITUTION</h2>
365 <table width="100%" border=0 rules="none" frame="void"
366 cols="2" cellspacing="0" cellpadding="0">
367 <tr valign="top" align="left">
368 <td width="10%"></td>
370 <p><b>catdoc</b> converts MS-Word file into following
371 internal Unicode representation:</p>
375 <table width="100%" border=0 rules="none" frame="void"
376 cols="2" cellspacing="0" cellpadding="0">
377 <tr valign="top" align="left">
378 <td width="10%"></td>
380 <p>1. Paragraphs are separated by ASCII Line Feed symbol
384 <table width="100%" border=0 rules="none" frame="void"
385 cols="2" cellspacing="0" cellpadding="0">
386 <tr valign="top" align="left">
387 <td width="10%"></td>
389 <p>2. Table cells within row are separated by ASCII Field
390 Separator symbol</p></td>
393 <table width="100%" border=0 rules="none" frame="void"
394 cols="2" cellspacing="0" cellpadding="0">
395 <tr valign="top" align="left">
396 <td width="17%"></td>
402 <table width="100%" border=0 rules="none" frame="void"
403 cols="2" cellspacing="0" cellpadding="0">
404 <tr valign="top" align="left">
405 <td width="10%"></td>
407 <p>3. Table rows are separated by ASCII Record Separator
411 <table width="100%" border=0 rules="none" frame="void"
412 cols="2" cellspacing="0" cellpadding="0">
413 <tr valign="top" align="left">
414 <td width="10%"></td>
416 <p>4. All printable characters, including whitespace are
417 represented with their</p></td>
420 <table width="100%" border=0 rules="none" frame="void"
421 cols="2" cellspacing="0" cellpadding="0">
422 <tr valign="top" align="left">
423 <td width="17%"></td>
425 <p>respective UNICODE codes.</p>
429 <table width="100%" border=0 rules="none" frame="void"
430 cols="2" cellspacing="0" cellpadding="0">
431 <tr valign="top" align="left">
432 <td width="10%"></td>
434 <p>This UNICODE representation is subsequently converted
435 into 8-bit text in target character set using following
436 four-step algorithm:</p>
440 <table width="100%" border=0 rules="none" frame="void"
441 cols="2" cellspacing="0" cellpadding="0">
442 <tr valign="top" align="left">
443 <td width="10%"></td>
445 <p>1. List of special characters is searched for given
446 Unicode character.</p></td>
449 <table width="100%" border=0 rules="none" frame="void"
450 cols="2" cellspacing="0" cellpadding="0">
451 <tr valign="top" align="left">
452 <td width="17%"></td>
454 <p>If found, then appropriate multi-character sequence is
455 output instead of character.</p>
459 <table width="100%" border=0 rules="none" frame="void"
460 cols="2" cellspacing="0" cellpadding="0">
461 <tr valign="top" align="left">
462 <td width="10%"></td>
464 <p>2. If there is an equivalent in target character set, it
468 <table width="100%" border=0 rules="none" frame="void"
469 cols="2" cellspacing="0" cellpadding="0">
470 <tr valign="top" align="left">
471 <td width="10%"></td>
473 <p>3. Otherwise, replacement list is searched and, if there
474 is multi-character</p></td>
477 <table width="100%" border=0 rules="none" frame="void"
478 cols="2" cellspacing="0" cellpadding="0">
479 <tr valign="top" align="left">
480 <td width="17%"></td>
482 <p>substitution for this UNICODE char, it is output.</p>
486 <table width="100%" border=0 rules="none" frame="void"
487 cols="2" cellspacing="0" cellpadding="0">
488 <tr valign="top" align="left">
489 <td width="10%"></td>
491 <p>4. If all above fails, "Unknown char" symbol
492 (question mark) is output.</p></td>
495 <table width="100%" border=0 rules="none" frame="void"
496 cols="2" cellspacing="0" cellpadding="0">
497 <tr valign="top" align="left">
498 <td width="10%"></td>
500 <p>Lists of special characters and list of substitution are
501 character set-independent, because special chars should be
502 escaped regardless of their existence in target character
503 set (usually, they are parts of US-ASCII, and therefore
504 exist in any character set) and replacement list is searched
505 only for those characters, which are not found in target
508 <p>These lists are stored in <b>catdoc</b> library directory
509 in files with prefix of format name. These files have
510 following format:</p>
512 <p>Each line can be either comment (starting with hash mark)
513 or contain hexadecimal UNICODE value, separated by
514 whitespace from string, which would be substituted instead
515 of it. If string contain no whitespace it can be used as is,
516 otherwise it should be enclosed in single or double quotes.
517 Usual backslash sequences like
518 <i>’\n’</i>,<i>’\t’</i> can be used
522 <a name="RUNTIME CONFIGURATION"></a>
523 <h2>RUNTIME CONFIGURATION</h2>
525 <table width="100%" border=0 rules="none" frame="void"
526 cols="2" cellspacing="0" cellpadding="0">
527 <tr valign="top" align="left">
528 <td width="10%"></td>
530 <p>Upon startup catdoc reads its system-wide configuration
531 file ( <b>catdocrc in catdoc</b> library directory) and then
532 user-specific configuration file
533 <b>${HOME}/.catdocrc.</b></p>
535 <p>These files can contain following directives:</p>
539 <table width="100%" border=0 rules="none" frame="void"
540 cols="2" cellspacing="0" cellpadding="0">
541 <tr valign="top" align="left">
542 <td width="10%"></td>
544 <p><b>source_charset =</b> <i>charset-name</i></p></td>
547 <table width="100%" border=0 rules="none" frame="void"
548 cols="2" cellspacing="0" cellpadding="0">
549 <tr valign="top" align="left">
550 <td width="23%"></td>
552 <p>Sets default source charset, which would be used if no
553 <b>-s</b> option specified. Consult configuration of nearby
554 windows workstation to find one you need.</p>
558 <table width="100%" border=0 rules="none" frame="void"
559 cols="2" cellspacing="0" cellpadding="0">
560 <tr valign="top" align="left">
561 <td width="10%"></td>
563 <p><b>target_charset =</b> <i>charset-name</i></p></td>
566 <table width="100%" border=0 rules="none" frame="void"
567 cols="2" cellspacing="0" cellpadding="0">
568 <tr valign="top" align="left">
569 <td width="23%"></td>
571 <p>Sets default output charset. You probably know, which one
576 <table width="100%" border=0 rules="none" frame="void"
577 cols="2" cellspacing="0" cellpadding="0">
578 <tr valign="top" align="left">
579 <td width="10%"></td>
581 <p><b>charset_path =</b> <i>directory-list</i></p></td>
584 <table width="100%" border=0 rules="none" frame="void"
585 cols="2" cellspacing="0" cellpadding="0">
586 <tr valign="top" align="left">
587 <td width="23%"></td>
589 <p>colon-separated list of directories, which are searched
590 for charset files. This allows you to install additional
591 charsets in your home directory.</p>
595 <table width="100%" border=0 rules="none" frame="void"
596 cols="2" cellspacing="0" cellpadding="0">
597 <tr valign="top" align="left">
598 <td width="10%"></td>
600 <p><b>map_path =</b> <i>directory-list</i></p></td>
603 <table width="100%" border=0 rules="none" frame="void"
604 cols="2" cellspacing="0" cellpadding="0">
605 <tr valign="top" align="left">
606 <td width="23%"></td>
608 <p>colon-separated list of directories, which are searched
609 for special character map and replacement map.</p>
613 <table width="100%" border=0 rules="none" frame="void"
614 cols="2" cellspacing="0" cellpadding="0">
615 <tr valign="top" align="left">
616 <td width="10%"></td>
618 <p><b>format =</b> <i>format name</i></p></td>
621 <table width="100%" border=0 rules="none" frame="void"
622 cols="2" cellspacing="0" cellpadding="0">
623 <tr valign="top" align="left">
624 <td width="23%"></td>
626 <p>Output format which would be used by default.
627 <b>catdoc</b> comes with two formats - <b>ascii</b> and
628 <b>tex</b> but nothing prevents you from writing your own
629 format (set two map files - special character map and
630 replacement map).</p>
634 <table width="100%" border=0 rules="none" frame="void"
635 cols="2" cellspacing="0" cellpadding="0">
636 <tr valign="top" align="left">
637 <td width="10%"></td>
639 <p><b>unknown_char =</b> <i>character
640 specification</i></p></td>
643 <table width="100%" border=0 rules="none" frame="void"
644 cols="2" cellspacing="0" cellpadding="0">
645 <tr valign="top" align="left">
646 <td width="23%"></td>
648 <p>sets character to output instead of unknown Unicode
649 character (default ’?’) Character specification
650 can have one of two form - character enclosed in single
651 quotes or hexadecimal code.</p>
655 <table width="100%" border=0 rules="none" frame="void"
656 cols="2" cellspacing="0" cellpadding="0">
657 <tr valign="top" align="left">
658 <td width="10%"></td>
660 <p><b>use_locale =</b><i>(yes|no)</i></p></td>
663 <table width="100%" border=0 rules="none" frame="void"
664 cols="2" cellspacing="0" cellpadding="0">
665 <tr valign="top" align="left">
666 <td width="23%"></td>
668 <p>Enables or disables automatic selection of output charset
669 (default <b>yes</b>), based on system locale settings (if
670 enabled at compile time). If automatic detection is enabled,
671 than output charset settings in the configuration files (but
672 not in the command line) are ignored, and current system
673 locale charset is used instead. There are no automatic
674 choice of input charset, based of locale language, because
675 most modern Word files (since Word 97) are Unicode
682 <table width="100%" border=0 rules="none" frame="void"
683 cols="2" cellspacing="0" cellpadding="0">
684 <tr valign="top" align="left">
685 <td width="10%"></td>
687 <p>Doesn’t handle fast-saves properly. Prints
688 footnotes as separate paragraphs at the end of file, instead
689 of producing correct LaTeX commands. Cannot distinguish
690 between empty table cell and end of table row.</p>
693 <a name="SEE ALSO"></a>
696 <table width="100%" border=0 rules="none" frame="void"
697 cols="2" cellspacing="0" cellpadding="0">
698 <tr valign="top" align="left">
699 <td width="10%"></td>
701 <p><b>xls2csv</b>(1), <b>cat</b>(1), <b>strings</b>(1),
702 <b>utf</b>(4), <b>unicode</b>(4)</p>
705 <a name="AUTHOR"></a>
708 <table width="100%" border=0 rules="none" frame="void"
709 cols="2" cellspacing="0" cellpadding="0">
710 <tr valign="top" align="left">
711 <td width="10%"></td>
713 <p>V.B.Wagner <vitus@wagner.pp.ru></p>