Every human-readable text file is represented by a sequence of “bytes” or “octets”. The actual meaning of those bytes can happen in a lot of different ways, depending on the system where it is displayed or which language the text is intended for.
Since usually a “byte” is the smallest common unit for data on computers each “character” you see on a screen is represented by exactly one byte. Each byte can stand for a value from 0 to 255 and therefore you are able to display up to 256 different characters within one file.
Usually you have the “ascii” characters, e.g. a through z, A through Z, 0 through 9 and additional language dependent characters. For example, the German umlauts ä or ü or a lot of other special symbols from Russian, Greek, Hebrew or even Chinese. All this tots up to far more than 256 different symbols.
At this point it should be obvious that you have to take care to select the correct “charset”, which is needed on a lot of computer systems to display the bytes in the same way as they are intended.
A better approach is to combine two ore more bytes into a representation for a single character on the screen. This is for example what “Unicode” does. Unicode is a standard that currently uses a range from 0 to 65535 (and even more) to designate a given symbol. Almost each symbol of almost any language on earth (and even a lot more, e.g. well known icons or klingon) is assigned a unique and unambiguous number.
Unfortunately the handling of Unicode is a bit more complicated and does not work with most of the current tools. Texts in unicode might also be longer. A compromise is to use UTF-8 which uses 7 bits (8 bits form one byte) for the most common characters from the ascii set and switches to 2, 3 or even 4 or more bytes if needed.
Every vocabulary file for KVocTrain is basically such a simple text file using Unicode.
To support as many languages as possible KVocTrain version 0.7 offered the possibility to choose a special charset for each language. If you have saved your files in the former “8Bit-Mode” you might see the wrong characters when you load with version 0.8 and higher. Contact me in this case.
If you want to learn more about this issue you should visit the following links:
Would you like to make a comment or contribute an update to this page?
Send feedback to the KDE Docs Team