diff options
Diffstat (limited to 'lib/enca/man/enca.1')
-rw-r--r-- | lib/enca/man/enca.1 | 867 |
1 files changed, 0 insertions, 867 deletions
diff --git a/lib/enca/man/enca.1 b/lib/enca/man/enca.1 deleted file mode 100644 index 5eb8e11756..0000000000 --- a/lib/enca/man/enca.1 +++ /dev/null @@ -1,867 +0,0 @@ -.de XA -.RS -.PP -\\$1 -.RE -.PP -.. -.TH "enca" "1" "Sep 2009" "enca 1.11" " " -.SH "NAME" -.PP -enca \-\- detect and convert encoding of text files -. -. -.SH "SYNOPSIS" -.PP -\fBenca\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]... -.br -\fBenconv\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]... -. -.SH "INTRODUCTION AND EXAMPLES" -.PP -If you are lucky enough, the only two things you will ever need to know are: -command -.XA "enca \fIFILE\fR" -will tell you which encoding file \fIFILE\fR uses (without changing it), and -.XA "enconv \fIFILE\fR" -will convert file \fIFILE\fR to your locale native encoding. -To convert the file to some other encoding use the \fB-x\fR option -(see \fB\-x\fR entry in section \fBOPTIONS\fR and sections \fBCONVERSION\fR -and \fBENCODINGS\fR for details). -.PP -Both work with multiple files and standard input (output) too. -E.g. -.XA "enca \-x latin2 <sometext | lpr" -assures file `sometext' is in ISO Latin\~2 when it's sent to printer. -.PP -The main reason why these command will fail and turn your files into -garbage is that Enca needs to know their language to detect the encoding. -It tries to determine your language and preferred charset from locale -settings, which might not be what you want. -.PP -You can (or have to) use \fB\-L\fR option to tell it the right language. -Suppose, you downloaded some Russian HTML file, -`file.htm', it claims it's windows-1251 but it isn't. -So you run -.XA "enca \-L ru file.htm" -and find out it's KOI8\-R (for example). -Be warned, currently there are not many supported languages (see section -\fBLANGUAGES\fR). -.PP -Another warning concerns the fact several Enca's features, namely its -charset conversion capabilities, strongly depend on what other tools -are installed on your system (see section \fBCONVERSION)\fR\-\-run -.XA "enca \-\-version" -to get list of features (see section \fBFEATURES\fR). -Also try -.XA "enca \-\-help" -to get description of all other Enca options (and to find the rest of this -manual page redundant). -. -. -.SH "DESCRIPTION" -.PP -Enca reads given text files, or standard input when none are given, -and uses knowledge about their language (must be supported by you) -and a mixture of parsing, statistical analysis, guessing and -black magic to determine their encodings, which it then prints to standard -output (or it confesses it doesn't have any idea what the encoding could be). -By default, Enca presents results as a multiline human-readable descriptions, -several other formats are available\-\-see Output type selectors below. -.PP -Enca can also convert files to some other encoding \fIENC\fR -when you ask for it\-\-either using a built\-in converter, -some conversion library, or by calling an external converter. -.PP -Enca's primary goal is to be usable unattended, as an automatic conversion -tool, though it perhaps have not reached this point yet (please see section -\fBSECURITY\fR). -.PP -Please note except rare cases Enca really has to know the language of input -files to give you a reliable answer. -On the other hand, it can then cope quite well with files that are not purely -textual or even detect charset of text strings inside some binary file; -of course, it depends on the character of the non-text component. -.PP -Enca doesn't care about structure of input files, it views them as a uniform -piece of text/data. In case of multipart files (e.g. mailboxes), you have to -use some tool knowing the structure to extract the individual parts first. -It's the cost of ability to detect encodings of any damaged, incomplete or -otherwise incorrect files. -. -. -.SH "OPTIONS" -.PP -There are several categories of options: operation mode options, output type -selectors, guessing parameters, conversion parameters, general options and -listings. -.PP -All long options can be abbreviated as long as they are unambiguous, -mandatory parameters of long options are mandatory for short options too. -.PP -. -.SS Operation modes -.PP -are following: -.TP -\fB\-c\fR, \fB\-\-auto\-convert\fR -Equivalent to calling Enca as \fBenconv\fR. -.sp -If no output type selector is specified, detect file encodings, guess your -preferred charset from locales, and convert files to it (only available with -+target\-charset\-auto feature). -.TP -\fB\-g\fR, \fB\-\-guess\fR -Equivalent to calling Enca as \fBenca\fR. -.sp -If no output type selector is specified, detect file encodings and report -them. -.PP -. -.SS Output type selectors -.PP -select what action Enca will take when it determines the encoding; -most of them just choose between different names, formats and conventions -how encodings can be printed, but one of them (\fB\-x\fR) -is special: it tells Enca to recode files to some other -encoding \fIENC\fR. -These options are mutually exclusive; if you specify more than one output -type selector the last one takes precedence. -.sp -Several output types represent charset name used by some other program, -but not all these programs know all the charsets which Enca recognises. -Be warned, Enca makes no difference between unrecognised charset and -charset having no name in given namespace in such situations. -.TP -\fB\-d\fR, \fB\-\-details\fR -It used to print a few pages of details about the guessing process, but since -Enca is just a program linked against Enca library, this is not possible and -this option is roughly equivalent to \fB\-\-human\-readable\fR, -except it reports failure reason when Enca doesn't recoginize the encoding. -.TP -\fB\-e\fR, \fB\-\-enca\-name\fR -Prints Enca's nice name of the charset, i.e., perhaps the most generally -accepted and more or less human-readable charset identifier, -with surfaces appended. -.sp -This name is used when calling an external converter, too. -.TP -\fB\-f\fR, \fB\-\-human\-readable\fR -Prints verbal description of the detected charset and surfaces\-\-something -a human understands best. -This is the default behaviour. -.sp -The precise format is following: the first line contains charset name alone, -and it's followed by zero or more indented lines containing names of detected -surfaces. -This format is not, however, suitable or intended for further -machine-processing, and the verbal charset descriptions are like to change -in the future. -.TP -\fB\-i\fR, \fB\-\-iconv\-name\fR -Prints how \fIiconv\fR(3) (and/or \fIiconv\fR(1)) calls the detected charset. -More precisely, it prints one, more or less arbitrarily chosen, alias -accepted by iconv. -A charset unknown to iconv counts as unknown. -.sp -This output type makes sense only when Enca is compiled with iconv support -(feature +iconv\-interface). -.TP -\fB\-r\fR, \fB\-\-rfc1345\-name\fR -Prints RFC\~1345 charset name. -When such a name doesn't exist because RFC\~1345 doesn't define given -encoding, some other name defined in some other RFC or just the name which -author consideres `the most canonical', is printed. -.sp -Since RFC\~1345 doesn't define surfaces, no surface info is appended. -.TP -\fB\-m\fR, \fB\-\-mime\-name\fR -Prints preferred MIME name of detected charset. This is the name you should -normally use when fixing e-mails or web pages. -.sp -A charset not present in http://www.iana.org/assignments/character-sets -counts as unknown. -.TP -\fB\-s\fR, \fB\-\-cstocs\-name\fR -Prints how \fIcstocs\fR(1) calls the detected charset. -A charset unknown to cstocs counts as unknown. -.TP -\fB\-n\fR, \fB\-\-name=\fR\fIWORD\fR -Prints charset (encoding) name selected by \fIWORD\fR (can be abbreviated as -long as is unambiguous). -For names listed above, \fB\-\-name=\fR\fIWORD\fR is equivalent to -\fB\-\-\fR\fIWORD\fR. -.sp -Using \fBaliases\fR as the output type causes Enca to print list of all -accepted aliases of detected charset. -.TP -\fB\-x\fR, \fB\-\-convert\-to=\fR[\fB..\fR]\fIENC\fR -Converts file to encoding \fIENC\fR. -.sp -The optional `..' before encoding name has no special meaning, except you can -use it to remind yourself that, unlike in \fIrecode\fR(1), you should specify -\fIdesired\fR encoding, instead of current. -.sp -You can use \fIrecode\fR(1) recoding chains or any other kind of braindead -recoding specification for \fIENC\fR, provided that you tell Enca to use some -tool understanding it for conversion (see section \fBCONVERSION\fR). -.sp -When Enca fails to determine the encoding, it prints a warning and leaves the -the file as is; when it is run as a filter it tries to do its best to copy -standard input to standard output unchanged. -Nevertheless, you should not rely on it and do backup. -.PP -. -.SS Guessing parameters -.PP -There's only one: \fB\-L\fR setting language of input files. This option is -mandatory (but see below). -.TP -\fB\-L\fR, \fB\-\-language=\fR\fILANG\fR -Sets language of input files to \fILANG\fR. -.sp -More precisely, \fILANG\fR can be any valid locale name (or alias with -+locale\-alias feature) of some supported language. -You can also specify `none' as language name, only multibyte encodings are -recognised then. -Run -.sp -enca \-\-list languages -.sp -to get list of supported languages. -When you don't specify any language Enca tries to guess your language from -locale settings and assumes input files use this language. -See section \fBLANGUAGES\fR for details. -.PP -. -.SS Conversion parameters -.PP -give you finer control of how charset conversion will be performed. -They don't affect anything when \fB\-x\fR is not specified as output type. -Please see section \fBCONVERSION\fR for the gory conversion details. -.TP -\fB\-C\fR, \fB\-\-try\-converters=\fR\fILIST\fR -Appends comma separated \fILIST\fR to the list of converters that will -be tried when you ask for conversion. -Their names can be abbreviated as long as they are unambiguous. -Run -.sp -enca \-\-list converters -.sp -to get list of all valid converter names (and see section \fBCONVERSION\fR -for their description). -.sp -The default list depends on how Enca has been compiled, run -.sp -enca \-\-help -.sp -to find out default converter list. -.sp -Note the default list is used only when you don't specify \fB\-C\fR at all. -Otherwise, the list is built as if it were initially empty and every -\fB\-C\fR adds new converter(s) to it. Moreover, specifying -\fBnone\fR as converter name causes clearing the converter list. -.TP -\fB\-E\fR, \fB\-\-external\-converter\-program=\fR\fIPATH\fR -Sets external converter program name to \fIPATH\fR. -Default external converter depends on how enca has been complied, and the -possibility to use external converters may not be available at all. -Run -.sp -enca \-\-help -.sp -to find out default converter program in your enca build. -.PP -. -.SS General options -.PP -don't fit to other option categories... -.TP -\fB\-p\fR, \fB\-\-with\-filename\fR -Forces Enca to prefix each result with corresponding file name. -By default, Enca prefixes results with filenames when run on multiple files. -.sp -Standard input is printed as \fBSTDIN\fR -and standard output as \fBSTDOUT\fR -(the latter can be probably seen in error messages only). -.TP -\fB\-P\fR, \fB\-\-no\-filename\fR -Forces Enca to not prefix results with file names. -By default, Enca doesn't prefix result with file name when -run on a single file (including standard input). -.TP -\fB\-V\fR, \fB\-\-verbose\fR -Increases verbosity level (each use increases it by one). -.sp -Currently this option in not very useful because different parts of Enca -respond differently to the same verbosity level, mostly not at all. -.PP -. -.SS Listings -.PP -are all terminal, i.e. when Enca encounters some of them it prints -the required listing and terminates without processing any following options. -.TP -\fB\-h\fR, \fB\-\-help\fR -Prints brief usage help. -.TP -\fB\-G\fR, \fB\-\-license\fR -Prints full Enca license (through a pager, if possible). -.TP -\fB\-l\fR, \fB\-\-list=\fR\fIWORD\fR -Prints list specified by \fIWORD\fR (can be abbreviated as long as it is -unambiguous). -Available lists include: -.sp -\fBbuilt\-in\-charsets\fR. -All encodings convertible by built\-in converter, by group -(both input and output encoding must be from this list and belong to the same -group for internal conversion). -.sp -\fBbuilt\-in\-encodings\fR. -Equivalent to \fBbuilt\-in\-charsets\fR, but considered obsolete; will -be accepted with a warning, for a while. -.sp -\fBconverters\fR. -All valid converter names (to be used with \fB\-C\fR). -.sp -\fBcharsets\fR. -All encodings (charsets). -You can select what names will be printed with \fB\-\-name\fR or any -name output type selector (of course, only encodings having a name in given -namespace will be printed then), the selector must be specified \fIbefore\fR -\fB\-\-list\fR. -.sp -\fBencodings\fR. -Equivalent to \fBcharsets\fR, but considered obsolete; will -be accepted with a warning, for a while. -.sp -\fBlanguages\fR. -All supported languages together with charsets belonging to them. -Note output type selects language name style, not charset name style here. -.sp -\fBnames\fR. -All possible values of \fB\-\-name\fR option. -.sp -\fBlists\fR. -All possible values of this option. -(Crazy?) -.sp -\fBsurfaces\fR. -All surfaces Enca recognises. -.TP -\fB\-v\fR, \fB\-\-version\fR -Prints program version and list of features (see section \fBFEATURES\fR). -. -. -.SH "CONVERSION" -.PP -Though Enca has been originally designed as a tool for guessing encoding -only, it now features several methods of charset conversion. -You can control which of them will be used with \fB\-C\fR. -.PP -Enca sequentially tries converters from the list specified by \fB\-C\fR -until it finds some that -is able to perform required conversion or until it exhausts the list. -You should specify preferred converters first, less preferred later. -External converter (\fBextern\fR) -should be always specified last, only as last resort, since it's usually not -possible to recover when it fails. -The default list of converters always starts with \fBbuilt\-in\fR and then -continues with the first one available from: \fBlibrecode\fR, \fBiconv\fR, -nothing. -.PP -It should be noted when Enca says it is not able to perform the -conversion it only means none of the converters is able to perform it. -It can be still possible to perform the required conversion in several steps, -using several converters, but to figure out how, human intelligence is -probably needed. -.PP -. -.SS Built\-in converter -.PP -is the simplest and far the fastest of all, can perform only -a few byte-to-byte conversions and modifies files directly in place (may -be considered dangerous, but is pretty efficient). You can get list of -all encodings it can convert with -.XA "enca \-\-list built\-in" -Beside speed, its main advantage (and also disadvantage) is that it doesn't -care: it simply converts characters having a representation in target -encoding, doesn't touch anything else and never prints any error message. -.sp -This converter can be specified as \fBbuilt\-in\fR with \fB\-C\fR. -.PP -. -.SS Librecode converter -.PP -is an interface to GNU recode library, that does the actual recoding job. -It may or may not be compiled in; run -.XA "enca \-\-version" -to find out its availability in your enca build -(feature +librecode\-interface). -.sp -You should be familiar with \fIrecode\fR(1) before using it, -since recode is a quite sophisticated and powerful charset conversion tool. -You may run into problems using it together with Enca -particularly because Enca's support for surfaces not 100% compatible, -because recode tries too hard to make the transformation reversible, -because it sometimes silently ignores I/O errors, -and because it's incredibily buggy. -Please see GNU recode info pages for details about recode library. -.sp -This converter can be specified as \fBlibrecode\fR with \fB\-C\fR. -.PP -. -.SS Iconv converter -.PP -is an interface to the UNIX98 \fIiconv\fR(3) -conversion functions, that do the actual recoding job. -It may or may not be compiled in; run -.XA "enca \-\-version" -to find out its availability in your enca build -(feature +iconv\-interface). -.sp -While iconv is present on most today systems it only rarely -offer some useful set of available conversions, the only notable exception -being iconv from GNU libc. -It is usually quite picky about surfaces, too (while, at the same time, -not implementing surface conversion). -It however probably represents the only standard(ized) tool -able to perform conversion from/to Unicode. -Please see iconv documentation about for details about its capabilities on -your particular system. -.sp -This converter can be specified as \fBiconv\fR with \fB\-C\fR. -.PP -. -.SS External converter -.PP -is an arbitrary external conversion tool that can be specified with -\fB\-E\fR option (at most one can be defined simultaneously). -There are some standard, provided together with enca: \fBcstocs\fR, -\fBrecode\fR, \fBmap\fR, \fBumap\fR, and \fBpiconv\fR. -All are wrapper scripts: for \fIcstocs\fR(1), \fIrecode\fR(1), -\fImap\fR(1), \fIumap\fR(1), and \fIpiconv\fR(1). -.sp -Please note enca has little control what the external converter really does. -If you set it to \fB/bin/rm\fR -you are fully responsible for the consequences. -.sp -If you want to make your own converter to use with enca, -you should know it is always called -.XA "\fICONVERTER\fR \fIENC_CURRENT\fR \fIENC\fR \fIFILE\fR [\fB\-\fR]" -where \fICONVERTER\fR is what has been set by \fB\-E\fR, -\fIENC_CURRENT\fR is detected encoding, -\fIENC\fR is what has been specified with \fB\-x\fR, -and \fIFILE\fR is the file to convert, i.e. it is called for each file -separately. -The optional fourth parameter, \fB\-\fR, should cause (when present) -sending result of conversion to standard output instead of overwriting -the file \fIFILE\fR. -The converter should also take care of not changing file permissions, -returning error code\~1 when it fails and cleaning its temporary files. -Please see the standard external converters for examples. -.sp -This converter can be specified as \fBextern\fR with \fI\-C\fR. -.PP -. -.SS Default target charset -.PP -The starightforward way of specifying target charset is the \fB\-x\fR -option, which overrides any defaults. -When Enca is called as \fBenconv\fR, default target charset is selected -exactly the same way as \fIrecode\fR(1) does it. -.PP -If the \fBDEFAULT_CHARSET\fR environment variable is set, it's used as the -target charset. -.PP -Otherwise, if you system provides the \fInl_langinfo\fR(3) function, current -locale's native charset is used as the target charset. -.PP -When both methods fail, Enca complains and terminates. -.PP -. -.SS Reversibility notes -.PP -If reversibility is crucial for you, you shouldn't use enca as converter -at all (or maybe you can, with very specifically designed \fIrecode\fR(1) -wrapper). -Otherwise you should at least know that there four -basic means of handling inconvertible character entities: -.sp -fail\-\-this is a possibility, too, and incidentally it's exactly what current -GNU libc iconv implementation does (recode can be also told to do it) -.sp -don't touch them\-\-this is what enca internal converter always does and -recode can do; though it is not reversible, a human being is usually able to -reconstruct the original (at least in principle) -.sp -approximate them\-\-this is what cstocs can do, and recode too, though -differently; and the best choice if you -just want to make the accursed text readable -.sp -drop them out\-\-this is what both recode and cstocs can do (cstocs can also -replace these characters by some fixed character instead of mere ignoring); -useful when the to\-be\-omitted characters contain only noise. -.sp -Please consult your favourite converter manual for details of this issue. -Generally, if you are not lucky enough to have all convertible characters -in you file, manual intervention is needed anyway. -.PP -. -.SS Performance notes -.PP -Poor performance of available converters has been one of main reasons for -including built\-in converter in enca. -Try to use it whenever possible, i.e. when files in consideration are -charset-clean enough or charset-messy enough so that its zero built\-in -intelligence doesn't matter. -It requires no extra disk space nor extra memory and can outperform -\fIrecode\fR(1) more than 10 times on large files and Perl -version (i.e. the faster one) of \fIcstocs\fR(1) more than 400 times on small -files (in fact it's almost as fast as mere \fIcp\fR(1)). -.PP -Try to avoid external converters when it's not absolutely necessary since -all the forking and moving stuff around is incredibily slow. -. -. -.SH "ENCODINGS" -.PP -You can get list of recognised character sets with -.XA "enca \-\-list charsets" -and using \fB\-\-name\fR parameter you can select any name you want to be -used in the listing. -You can also list all surfaces with -.XA "enca \-\-list surfaces" -Encoding and surface names are case insensitive and non-alphanumeric -characters are not taken into account. -However, non-alphanumeric characters are mostly -not allowed at all. The only allowed are: `\-', `_', `.', `:', and\~`/' -(as charset/surface separator). -So `ibm852' and `IBM-852' are the same, while `IBM 852' is not accepted. -.PP -. -.SS Charsets -.PP -Following list of recognised charsets uses Enca's names (\fB\-e\fR) and -verbal descriptions as reported by Enca (\fB\-f\fR): -.PP -.TS -tab (@); -l l. -ASCII@7bit ASCII characters -ISO-8859-2@ISO 8859-2 standard; ISO Latin 2 -ISO-8859-4@ISO 8859-4 standard; Latin 4 -ISO-8859-5@ISO 8859-5 standard; ISO Cyrillic -ISO-8859-13@ISO 8859-13 standard; ISO Baltic; Latin 7 -ISO-8859-16@ISO 8859-16 standard -CP1125@MS-Windows code page 1125 -CP1250@MS-Windows code page 1250 -CP1251@MS-Windows code page 1251 -CP1257@MS-Windows code page 1257; WinBaltRim -IBM852@IBM/MS code page 852; PC (DOS) Latin 2 -IBM855@IBM/MS code page 855 -IBM775@IBM/MS code page 775 -IBM866@IBM/MS code page 866 -baltic@ISO-IR-179; Baltic -KEYBCS2@Kamenicky encoding; KEYBCS2 -macce@Macintosh Central European -maccyr@Macintosh Cyrillic -ECMA-113@Ecma Cyrillic; ECMA-113 -KOI-8_CS_2@KOI8-CS2 code (`T602') -KOI8-R@KOI8-R Cyrillic -KOI8-U@KOI8-U Cyrillic -KOI8-UNI@KOI8-Unified Cyrillic -TeX@(La)TeX control sequences -UCS-2@Universal character set 2 bytes; UCS-2; BMP -UCS-4@Universal character set 4 bytes; UCS-4; ISO-10646 -UTF-7@Universal transformation format 7 bits; UTF-7 -UTF-8@Universal transformation format 8 bits; UTF-8 -CORK@Cork encoding; T1 -GBK@Simplified Chinese National Standard; GB2312 -BIG5@Traditional Chinese Industrial Standard; Big5 -HZ@HZ encoded GB2312 -unknown@Unrecognized encoding -.TE -.PP -where \fBunknown\fR is not any real encoding, -it's reported when Enca is not able to give a reliable answer. -.PP -. -.SS Surfaces -.PP -Enca has some experimental support for so-called surfaces (see below). -It detects following surfaces (not all can be applied to all charsets): -.PP -.TS -tab (@); -l l. -/CR@CR line terminators -/LF@LF line terminators -/CRLF@CRLF line terminators -N.A.@Mixed line terminators -N.A.@Surrounded by/intermixed with non-text data -/21@Byte order reversed in pairs (1,2 -> 2,1) -/4321@Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1) -N.A.@Both little and big endian chunks, concatenated -/qp@Quoted-printable encoded -.TE -.PP -Note some surfaces have N.A. in place of identifier\-\-they -cannot be specified on command line, they can only be reported by Enca. -This is intentional because they only inform you why the file cannot be -considered surface-consistent instead of representing a real surface. -.PP -Each charset has its natural surface (called `implied' in recode) which is not -reported, e.g., for IBM 852 charset it's `CRLF line terminators'. -For UCS encodings, big endian is considered as natural surface; -unusual byte orders are constructed from 21 and 4321 permutations: -2143 is reported simply as 21, -while 3412 is reported as combination of 4321 and 21. -.PP -Doubly-encoded UTF-8 is neither charset nor surface, it's just reported. -.PP -. -.SS About charsets, encodings and surfaces -.PP -Charset is a set of character entities while encoding is its representation -in the terms of bytes and bits. -In Enca, the word \fIencoding\fR means the same as `representation of text', -i.e. the relation between sequence of character entities constituting the -text and sequence of bytes (bits) constituting the file. -.PP -So, encoding is both character set and so-called surface -(line terminators, byte order, combining, Base64 transformation, etc.). -Nevertheless, it proves convenient to work with some {charset,surface} pairs -as with genuine charsets. -So, as in \fIrecode\fR(1), all UCS- and UTF- encodings of Universal character -set are called charsets. -Please see recode documentation for more details of this issue. -.PP -The only good thing about surfaces is: when you don't start playing with -them, neither Enca won't start and it will try to behave as much as -possible as a surface-unaware program, even when talking to recode. -.PP -. -. -.SH "LANGUAGES" -.PP -Enca needs to know the language of input files to work reliably, at least -in case of regular 8bit encoding. -Multibyte encodings should be recognised for any Latin, Cyrillic or Greek -language. -.PP -You can (or have to) use \fB\-L\fR option to tell Enca the language. -Since people most often work with files in the same language for which they -have configured locales, Enca tries tries to guess the language by examining -value of \fBLC_CTYPE\fR and other locale categories -(please see \fIlocale\fR(7)) and using it for the -language when you don't specify any. -Of course, it may be completely wrong and will give you nonsense answers and -damage your files, so please don't forget to use the \fB\-L\fR option. -You can also use \fBENCAOPT\fR environment variable to set a default language -(see section \fBENVIRONMENT\fR). -.PP -Following languages are supported by Enca (each language is listed together -with supported 8bit encodings). -.PP -.TS -tab (@); -l l. -Belarussian@CP1251 IBM866 ISO\-8859\-5 KOI8\-UNI maccyr IBM855 -Bulgarian @CP1251 ISO\-8859\-5 IBM855 maccyr ECMA\-113 -Czech @ISO\-8859\-2 CP1250 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK -Estonian @ISO\-8859\-4 CP1257 IBM775 ISO\-8859\-13 macce baltic -Croatian @CP1250 ISO\-8859\-2 IBM852 macce CORK -Hungarian @ISO\-8859\-2 CP1250 IBM852 macce CORK -Lithuanian @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic -Latvian @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic -Polish @ISO\-8859\-2 CP1250 IBM852 macce ISO\-8859\-13 ISO\-8859\-16 baltic CORK -Russian @KOI8\-R CP1251 ISO\-8859\-5 IBM866 maccyr -Slovak @CP1250 ISO\-8859\-2 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK -Slovene @ISO\-8859\-2 CP1250 IBM852 macce CORK -Ukrainian @CP1251 IBM855 ISO\-8859\-5 CP1125 KOI8\-U maccyr -Chinese @GBK BIG5 HZ -none @ -.TE -.PP -The special language \fBnone\fR can be shortened to \fB__\fR, it -contains no 8bit encodings, so only multibyte encodings are detected. -.PP -. -. -.SH "FEATURES" -.PP -Several Enca's features depend on what is available on your system and how -it was compiled. -You can get their list with -.XA "enca \-\-version" -Plus sign before a feature name means it's available, minus sign means -this build lacks the particular feature. -.PP -\fBlibrecode\-interface\fR. -Enca has interface to GNU recode library charset conversion functions. -.sp -\fBiconv\-interface\fR. -Enca has interface to UNIX98 iconv charset conversion functions. -.sp -\fBexternal\-converter\fR. -Enca can use external conversion programs (if you have some suitable -installed). -.sp -\fBlanguage\-detection\fR. -Enca tries to guess language (\fB\-L\fR) from locales. You don't need the -\fB\-\-language\fR option, at least in principle. -.sp -\fBlocale\-alias\fR. -Enca is able to decrypt locale aliases used for language names. -.sp -\fBtarget\-charset\-auto\fR. -Enca tries to detect your preferred charset from locales. -Option \fB\-\-auto\-convert\fR and calling Enca as \fBenconv\fR works, at -least in principle. -.sp -\fBENCAOPT\fR. -Enca is able to correctly parse this environment variable before command line -parameters. Simple stuff like \fBENCAOPT="\-L uk"\fR will work even without -this feature. -.PP -. -. -.SH "ENVIRONMENT" -.PP -The variable \fBENCAOPT\fR can hold set of default Enca options. -Its content is interpreted before command line arguments. -Unfortunately, this doesn't work everywhere (must have +ENCAOPT -feature). -.PP -\fBLC_CTYPE\fR, \fBLC_COLLATE\fR, \fBLC_MESSAGES\fR -(possibly inherited from \fBLC_ALL\fR or \fBLANG\fR) is used -for guessing your language (must have +language-detection feature). -.PP -The variable \fBDEFAULT_CHARSET\fR can be used by \fBenconv\fR as the default -target charset. -.PP -. -. -.SH "DIAGNOSTICS" -.PP -Enca returns exit code\~0 when all input files were successfully proceeded -(i.e. all encodings were detected and all files were converted to required -encoding, if conversion was asked for). -Exit code\~1 is returned when Enca wasn't able to either guess encoding or -perform conversion on any input file becuase it's not clever enough. -Exit code\~2 is returned in case of serious (e.g. I/O) troubles. -.PP -. -. -.SH "SECURITY" -.PP -It should be possible to let Enca work unattended, it's its goal. However: -.PP -There's no warranty the detection works 100%. Don't bet on it, you can easily -lose valuable data. -.PP -Don't use enca (the program), link to libenca instead if you want anything -resembling security. You have to perform the eventual conversion yourself -then. -.PP -Don't use external converters. Ideally, disable them compile-time. -.PP -Be aware of \fBENCAOPT\fR and all the built-in automagic guessing various -things from environment, namely locales. -.PP -. -. -.SH "SEE ALSO" -.PP -\fIautoconvert\fR(1), -\fIcstocs\fR(1), -\fIfile\fR(1), -\fIiconv\fR(1), -\fIiconv\fR(3), -\fInl_langinfo\fR(3), -\fImap\fR(1), -\fIpiconv\fR(1), -\fIrecode\fR(1), -\fIlocale\fR(5), -\fIlocale\fR(7), -\fIltt\fR(1), -\fIumap\fR(1), -\fIunicode\fR(7), -\fIutf-8\fR(7), -\fIxcode\fR(1) -.PP -. -. -.SH "KNOWN BUGS" -.PP -It has too many \fIunknown\fR bugs. -.PP -The idea of using \fBLC_*\fR value for language is certainly braindead. -However I like it. -.PP -It can't backup files before mangling them. -.PP -In certain situations, it may behave incorrectly on >31bit file systems -and/or over NFS (both untested but shouldn't cause problems in practice). -.PP -Built\-in converter does not convert character `ch' from \fIKOI8-CS2\fR, -and possibly some other characters you've probably never heard about anyway. -.PP -EOL type recognition works poorly on Quoted-printable encoded files. -This should be fixed someday. -.PP -There are no command line options to tune libenca parameters. -This is intentional (Enca should DWIM) but sometimes this is a nuisance. -.PP -The manual page is too long, especially this section. -This doesn't matter since nobody does read it. -.PP -Send bug reports to <http://bugs.cihar.com/>. -. -. -.SH "TRIVIA" -.PP -Enca is Extremely Naive Charset Analyser. -Nevertheless, the `enc' originally comes from `encoding' -so the leading\~`e' should be read as in -`encoding' not as in `extreme'. -. -. -.SH "AUTHORS" -.PP -David Necas (Yeti) <yeti@physics.muni.cz> -.PP -Michal Cihar <michal@cihar.com> -.sp -Unicode data has been generated from various (free) on\-line resources or -using GNU recode. -Statistical data has been generated from various texts on the Net, I hope -character counting doesn't break anyone's copyright. -. -. -.SH "ACKNOWLEDGEMENTS" -.PP -Please see the file THANKS in distribution. -. -. -.SH "COPYRIGHT" -.PP -Copyright (C) 2000-2003 David Necas (Yeti). -.PP -Copyright (C) 2009 Michal Cihar <michal@cihar.com>. -.sp -Enca is free software; you can redistribute it and/or modify it -under the terms of version 2 of the GNU General Public License -as published by the Free Software Foundation. -.sp -Enca is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty -of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. -See the GNU General Public License for more details. -.sp -You should have received a copy of the GNU General Public License -along with Enca; if not, write to the Free Software Foundation, -Inc., 675 Mass Ave, Cambridge, MA 02139, USA. -. |