portrait

Поиск



[software] [catdoc] [tcl] [geography] [old things]

catdoc & xls2csv

Overview

catdoc is program which reads one or more Microsoft word files and outputs text, contained insinde them to standard output. Therefore it does same work for .doc files, as unix cat command for plain ASCII files.

It is now accompanied by xls2csv - program which converts Excel spreadsheet into comma-separated value file, and catppt - utility to extract textual information from Powerpoint files

Optionaly, catdoc is able to translate some non-ASCII chars into correspoindig TeX escape sequences and convert charsets from Windows ANSI codepage to local codepage of target machine. (Because catdoc is russian program, by default it converts cp1251 to koi8-r, when running under UNIX and to cp866 when running under DOS.

Catdoc has rudimentary table handling. In TeX mode it inserts & when encounters field delimiter and \\ when encounters end of table row. No table headers are produced although.

Catdoc doesn't even try to preserver MS-Word character formatting. It's goal is to extract plain text and allow you to read it and, probably, reformat with TeX, according to TeXnical rules, most Word users haven't even heard about.

If you are looking for tool which would preserve word formatting, look to wvWare or some portable office suite like OpenOffice.org.

xls2csv does roughly same for Excel files. It extracts data and leaves out any formatting info and formulas. Concept is that you want to see data, not the way it was created. Since version 0.94 program catppt which prints out text from PowerPoint files is also included

Supported platforms

  • Unix. Catdoc was initially developed for Linux and Sparc Solaris. It also runs on variety of other Unices. For instance it is included in FreeBSD ports collection.
  • MS-DOS. Catdoc also runs on MS-DOS, even on XT machines. MS-DOS is only platform for which compiled executables are provided. These executables are 16-bit real mode. I think that protected mode version of xls2csv might be useful, but don't have time to support it.

There is no support for catdoc under Windows

Not because I hate windows. Just because I don't use it. Note that DOS catdoc is not intended to be used under windows. For example, it doesn't support long file names.

Character encodings conversion

Catdoc doesn't use system provided charset conversion libraries. It might be considered a bug, but Oracle, Tcl and Perl do the same. Portable software really doesn't have any other choice, because some operating systems, which claim to be POSIX-compatible do not provide support for all neccessary encodings via their iconv(3) function.

Catdoc doesn't introduce its own incompatible format of charset descriptions. Instead it uses encoding description files, available from Unicode Consortium FTP site

Catdoc encoding conversion system has unique feature - it is able to replace character, which is not available in the target encoding, by multicharacther sequence. So, sometimes catdoc can be used as charset converter for plain text files.

Catdoc now doesn't support any multibyte encodings except utf-8. (of course, Word's internal UCS2 representation is supported). Problem is that no one have contributed me code for such support, which can be compiled as MS-DOS realmode program. And I refuse to add any patches which require 32-bit system.

License

catdoc and xls2csv are distributed under GNU Public License.

Current status

Current version of catdoc is 0.95

Versions 0.94.3 and 0.94.4, distributed with Debian are broken and unapporved by me. They don't handle russian RTF properly

See Changelog for details.

Download

catdoc-0.95.tar.gz
Source-only distribution for all platform
(SHA1 hash sum 58afc3f64d43c13d07070103b36cd83b81c94616)
catdoc-0.94.2.zip
Sources + DOS realmode executables THESE ARE NOT WINDOWS PROGRAMS
(SHA 1 hash sum 4b75f3a511fe3ec5304883931937eb1db73a4b70)
Previous versions can be found on archive page
GIT repository
git clone http://www.wagner.pp.ru/git/oss/catdoc.git

Documentation

Catdoc is documented in traditional Unix man pages. For MS-DOS users plain-text and postscript versions of man pages are included in the distrbution.

Html formatted versions of man pages are available here: catdoc(1) catppt(1) xls2csv(1).

Support

Catdoc has web based bugtracking system. To prevent me to accidently login there via insecure connection, access is allowed via https only.

If you don't already have CA certificate for my perlsonal CA, visit my CVStrac page and install certificate into your browser. Otherwise, go directly to the login page.

There is also WiKi and FAQ in the BTS. Anonymous users allowed to ask questions in the FAQ.