diff options
Diffstat (limited to 'misc/stardict-tools/README.StarDict')
-rw-r--r-- | misc/stardict-tools/README.StarDict | 380 |
1 files changed, 380 insertions, 0 deletions
diff --git a/misc/stardict-tools/README.StarDict b/misc/stardict-tools/README.StarDict new file mode 100644 index 000000000000..352bf18dd0b8 --- /dev/null +++ b/misc/stardict-tools/README.StarDict @@ -0,0 +1,380 @@ + +Format for StarDict dictionary files +------------------------------------ + +StarDict homepage: http://stardict.sourceforge.net + +{0}. Number and Byte-order Conventions +When you record the numbers that identify sizes, offsets, etc., you +should use 32-bit numbers, such as you might represent with a glong. + +In order to make StarDict work on different platforms, these numbers +must be in network byte order. You can ensure the correct byte order +by using the g_htonl() function when creating dictionary files. +Conversely, you should use g_ntohl() when reading dictionary files. + +Strings should be encoded in UTF-8. + + + +{1}. Files + +Every dictionary consists of three files: + +(1). somedict.ifo +(2). somedict.idx or somedict.idx.gz +(3). somedict.dict or somedict.dict.dz + +You can use gzip -9 to compress the .idx file. If the .idx file are not +compressed, the loading can be fast and save memory when using, compress it +will make the .idx file load into memory and make the quering fast when using. + +You can use dictzip to compress the .dict file. +"dictzip" uses the same compression algorithm and file format as does gzip, +but provides a table that can be used to randomly access compressed blocks +in the file. The use of 50-64kB blocks for compression typically degrades +compression by less than 10%, while maintaining acceptable random access +capabilities for all data in the file. As an added benefit, files +compressed with dictzip can be decompressed with gunzip. +For more information about dictzip, refer to DICT project, please see: +http://www.dict.org + +Stardict will search for the .ifo file, then open the .idx or +.idx.gz file and the .dict.dz or .dict file which is in the same directory and +has the same base name. + + + +{2}. The ".ifo" file's format. + +The .ifo file has the following format: + +StarDict's dict ifo file +version=2.4.2 +[options] + +Note that the current "version" string must be "2.4.2". If it's not, +then StarDict will refuse to read the file. + +[options] +--------- + +In the example above, [options] expands to any of the following lines +specifying information about the dictionary. Each option is a keyword +followed by an equal sign, then the value of that option, then a +newline. The options may be appear in any order. + +Note that the dictionary must have at least a bookname, a wordcount and a +idxfilesize, or the load will fail. All other information is optional. All +strings should be encoded in UTF-8. + +Available options: + +bookname= // required +wordcount= // required +idxfilesize= // required +author= +email= +website= +description= +date= +sametypesequence= // very important. + + +wordcount is the count of word entries in .idx file, it must be right. + +idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed +to a .idx.gz file, this entry must record the original .idx file's size, and it +must be right too. The .gz file don't contain its original size information, +but knowing the original size can speed up the extraction to memory, as you +don't need to call realloc() for many times. + + +The "sametypesequence" option is described in further detail below. + +*** + +sametypesequence + +You should first familiarize yourself with the .dict file format +described in the next section so that you can understand what effect +this option has on the .dict file. + +If the sametypesequence option is set, it tells StarDict that each +word's data in the .dict file will have the same sequence of datatypes. +In this case, we expect a .dict file that's been optimized in two +ways: the type identifiers should be omitted, and the size marker for +the last data entry of each word should be omitted. + +Let's consider some concrete examples of the sametypesequence option. + +Suppose that a dictionary records many .wav files, and so sets: +sametypesequence=W +In this case, each word's entry in the .dict file consists solely of a +wav file. In the .dict file, you would leave out the 'W' character +before each entry, and you would also omit the 32-bit integer at the +front of each .wav entry that would normally give the entry's length. +You can do this since the length is known from the information in the +idx file. + +As another example, suppose a dictionary contains phonetic information +and a meaning for each word. The sametypesequence option for this +dictionary would be: +sametypesequence=tm +Once again, you can omit the 't' and 'm' characters before each data +entry in the .dict file. In addition, you should omit the terminating +'\0' for the 'm' entry for each word in the .dict file, as the length +of the meaning string can be inferred from the length of the phonetic +string (still indicated by a terminating '\0') and the length of the +entire word entry (listed in the .idx file). + +So for cases where the last data entry for each word normally requires +a terminating '\0' character, you should omit this character in the +dict file. And for cases where the last data entry for each word +normally requires an initial 32-bit number giving the length of the +field (such as WAV and PNG entries), you must omit this number in the +dictionary. + +Every dictionary should try to use the sametypesequence feature to +save disk space. + +*** + + + +{3}. The ".idx" file's format. + +The .idx file is just a word list. + +The word list is a sorted list of word entries. + +Each entry in the word list contains three fields, one after the other: + +word_str; // a utf-8 string terminated by '\0'. +word_data_offset; // word data's offset in .dict file +word_data_size; // word data's total size in .dict file + +word_str gives the string representing this word. It's the string +that is "looked up" by the StarDict. + +word_data_offset and word_data_size should both be 32-bit numbers in +network byte order. + +No two entries should have the same "word_str". In other words, +(strcmp(s1, s2) != 0). + +The length of "word_str" should be less than 256. In other words, +(strlen(word) < 256). + +The word list must be sorted by calling stardict_strcmp() on the "word_str" +fields. If the word list order is wrong, StarDict will fail to function +correctly! + +============ +gint stardict_strcmp(const gchar *s1, const gchar *s2) +{ +gint a; +a = g_ascii_strcasecmp(s1, s2); +if (a == 0) +return strcmp(s1, s2); +else +return a; +} +============ + +g_ascii_strcasecmp() is a glib function: + +Unlike the BSD strcasecmp() function, this only recognizes standard +ASCII letters and ignores the locale, treating all non-ASCII characters +as if they are not letters. + +stardict_strcmp() works fine with English characters, but the other +locale characters' sorting is not so good. There should be a _strcmp +function which handles the utf-8 string sorting better. If you know +one, email me :) + +g_utf8_collate()? This is a locale-dependent funcition. So if you look +up Chinese characters while in the Chinese locale, it works fine. But +if you are in some other locale then the lookup will fail, as the +order is not the same as in the Chinese locale (which was used when +creating the dictionary). + +g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but.. + +The complete solution can be found in "Unicode Technical Standard #10: Unicode +Collation Algorithm", http://www.unicode.org/reports/tr10/ + +I hope glib will provide a locale-independent g_utf8_collate() soon. +http://bugzilla.gnome.org/show_bug.cgi?id=112798 + + + +{4}. The ".dict" file's format. + +The .dict file is a pure data sequence, as the offset and size of each +word is recorded in the corresponding .idx file. + +If the "sametypesequence" option is not used in the .ifo file, then +the .dict file has fields in the following order: + +============== +word_1_data_1_type; // a single char identifying the data type +word_1_data_1_data; // the data +word_1_data_2_type; +word_1_data_2_data; +...... // the number of data entries for each word is determined by +// word_data_size in .idx file +word_2_data_1_type; +word_2_data_1_data; +...... +============== + +It's important to note that each field in each word indicates its +own length, as described below. The number of possible fields per +word is also not fixed, and is determined by simply reading data until +you've read word_data_size bytes for that word. + + +Suppose the "sametypesequence" option is used in the .idx file, and +the option is set like this: + +sametypesequence=tm + +Then the .dict file will look like this: + +============== +word_1_data_1_data +word_1_data_2_data +word_2_data_1_data +word_2_data_2_data +...... +============== + +The first data entry for each word will have a terminating '\0', but +the second entry will not have a terminating '\0'. The omissions of +the type chars and of the last field's size information are the +optimizations required by the "sametypesequence" option described +above. + + +Type identifiers +---------------- + +Here are the single-character type identifiers that may be used with +the "sametypesequence" option in the .idx file, or may appear in the +dict file itself if the "sametypesequence" option is not used. + +Lower-case characters signify that a field's size is determined by a +terminating '\0', while upper-case characters indicate that the data +begins with a 32-bit integer that gives the length of the data field. + +'m' +Word's pure text meaning. +The data should be a utf-8 string ending with '\0'. + +'l' +Word's pure text meaning. +The data is NOT a utf-8 string, but is instead a string in locale +encoding, ending with '\0'. Sometimes using this type will save disk +space, but its use is discouraged. + +'g' +A utf-8 string which is marked up with the Pango text markup language. +For more information about this markup language, See the "Pango +Reference Manual." +You might have it installed locally at: +file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html + +'t' +English phonetic string. +The data should be a utf-8 string ending with '\0'. + +Here are some utf-8 phonetic characters: +θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ +æɑɒʌәєŋvθðʃʒːɡˏˊˋ + +'y' +Chinese YinBiao. +The data should be a utf-8 string ending with '\0'. + + +'W' +wav file. +The data begins with a network byte-ordered glong to identify the wav +file's size, immediately followed by the file's content. + +'P' +png file. +The data begins with a network byte-ordered glong to identify the png +file's size, immediately followed by the file's content. + +'X' +this type identifier is reserved for experimental extensions. + + + +{5}. Tree Dictionary + +The tree dictionary support is used for information viewing, etc. + +A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz +and sometreedict.dict.dz. + +It is better to compress the .tdx file, as it is always load into memory. + +The .ifo file has the following format: + +StarDict's treedict ifo file +version=2.4.2 +[options] + +Available options: + +bookname= // required +tdxfilesize= // required +wordcount= +author= +email= +website= +description= +date= +sametypesequence= + +wordcount is only used for info view in the dict manage dialog, so it is not +important in tree dictionary. + +The .tdx file is just the word list. + +----------- + +The word list is a tree list of word entries. + +Each entry in the word list contains four fields, one after the other: +word_str; // a utf-8 string terminated by '\0'. +word_data_offset; // word data's offset in .dict file +word_data_size; // word data's total size in .dict file. it can be 0. +word_subentry_count; //have many sub word this entry has, 0 means none. + +Subentry is immidiately followed by its parent entry. This make the order is +just as when a tree list with all its nodes extended, then sort from top to +bottom. + +The .dict file's format is the same as the normal dictionary. + + + +{6}. More information. + +You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and +"src/tools/*.cpp" for more information. + +If you have any questions, email me. :) + +Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's +English. + +Hu Zheng <huzheng_001@163.com> +http://forlinux.yeah.net + +2003.11.11 + |