lib/enca/README.devel


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110

#============================================================================
# Enca v1.12 (2009-10-29)  guess and convert encoding of text files
# Copyright (C) 2000-2003 David Necas (Yeti) <yeti@physics.muni.cz>
# Copyright (C) 2009 Michal Cihar <michal@cihar.com>
#============================================================================

Contents

  0. Developing programs utilizing libenca
  1. How to add a new charset/encoding to libenca
  2. How to add a new surface to libenca
  3. How to add a new language to libenca
  4. Automake, autoconf, libtool, ... note


0. Developing programs utilizing libenca
****************************************

  * Look at libenca API documentation in devel-docs/html.
  * Look into enca source how it uses libenca.
    Note enca is quite a simple application (practically all libenca
    interaction is in src/enca.c).  It's single-threaded and uses one
    language and one analyser all the time.  Provided each thread has its own
    analyser, libenca should be thread-safe (untested).
  * Take names starting with ENCA, Enca, enca, _ENCA, _Enca, and _enca
    as reserved.
  * pkgconfig is supported, you can use PKG_CHECK_MODULES to check for libenca
    in your configure scripts


1. How to add a new charset/encoding
************************************

(optional steps are marked `[optional]'):

  iconvcap.c:
    * Add a new test (even if you are 100% sure iconv will never support it),
      please see top of iconvcap.c for some documentation how it works.
  tools/encodings.dat:
    * Add a new entry.
    * Use @ICONV_NAME_<name>@ (as it will appear in iconvcap output) for
      iconv names.
  tools/iconvenc.null:
    * Add it (with NULL)


Specifically, for regular 8bit (language dependent) charsets:

  lib/unicodemap.c:
    * Add a new map to Unicode (UCS-2) unicode_map_...[].
    * Add a new UNICODE_MAP[] entry.
  lib/filters.c: [optional]
    * Create a new filter or make an alias of an existing filter.
  lib/lang_??.c:
    * Add the new encoding to some existing language(s).
    * Add appropriate filters or hooks [optional].
  data/maps/??.map:
    * Add a new map to Unicode (UCS-2)


Specifically, for multibyte encodings:

  lib/multibyte.c:
    * Create a new check function.
    * Put it into appropriate ascii/8bit/binary test group
      ENCA_MULTIBYTE_TESTS_ASCII[], ENCA_MULTIBYTE_TESTS_8BIT[],
      ENCA_MULTIBYTE_TESTS_BINARY[].
    * Put strict tests (i.e. test which may fail) first, looks-like tests
      last.


2. How to add a new surface
***************************

  * Try to ask the author what to do, since this may be complicated, or
  * Hack, basically it must be added to lib/enca.h EncaSurface enum,
    to lib/encnames.c SURFACE_INFO[] a detection method must be added to
    lib/guess.c and now the most complicated part: this new method must be
    used ``in the right places'' in lib/guess.c make_guess().


3. How to add a new language
****************************

  Create a new language file:
    * Create new lib/lang_....c files by copying some existing (use locale code
      for names)
    * Fill all encoding and occurence data, create filters and hooks (see
      filters.c too).  You can do it manually, but look how it's done for
      existing languages in data/* and read data/README.
  lib/internal.h:
    * Add new ENCA_LANGUAGE_....
  src/lang.c:
    * Add a new LANGUAGE_LIST[] entry pointing to the ENCA_LANGUAGE_....


4. Automake, autoconf, libtool, ... note
****************************************

If you run ./autogen.sh and it finishes OK, you are lucky and can expect
things to work.

You have to give --enable-maintainer-mode to ./configure (or ./autogen) to
build dists and/or the strange stuff in tools/, data/, tests/, and
devel-docs/.