Heiner Eichmann's GEDCOM 5.5 Sample Page: Different Character Sets

Note: All unicode files on this page have been updated on 13 may 1999!

If you are familiar with ANSEL please visit my open questions page!

Characters (letters) with a value between 20 and 7F (hexadecimal values) are no problem in GEDCOM files. They are defined in the ASCII code and are available worldwide. Other characters not defined in ASCII are a big problem. Windows (and DOS) defines code pages which contain some important special characters of a part of the world (example: code page 1252 in Windows and 850 in DOS define most of the characters of Western Europe). The disadvantage of using code pages is obvious: the meaning of a byte depends on the operating system and on the country where it is read. Example: character D0 (hex) is a capital D with a horizontal line (Eth) on code page 1252 (Windows, Western Europe), but it is a small eth on code page 850 (DOS, Western Europe), a capital Pi on code page 1253 (Windows, Greek), a capital Russian R on code page 1251 (Windows, Cyrillic) and so on. This is therefore NOT usefull for the transmission of data.

The GEDCOM standard allows only three character sets for the transmission of genealogical data: ASCII, ANSEL and UNICODE. ASCII is very easy to implement but contains no special characters. UNICODE is clearly the future but right now not supported very much. It uses two bytes for each character, allowing 65535 different characters. Therefore UNICODE contains most of the characters of most of the languages worldwide.

ANSEL (ANSI Z39.47-1985) is somewhere in between. All standard ASCII characters have the same value in ASCII and ANSEL. Therefore ANSEL and ASCII transmissions can not be distinguished, if just ASCII characters are used. Non-ASCII characters can have one or two bytes in size, which makes ANSEL decoders very complicated and unpopular. But nevertheless in true GEDCOM transmissions only one of these three character sets is allowed.

This page contains 2 GEDCOM files: one in ANSEL and another in UNICODE. The structure is the same: a family consisting of two parents and several children. Only the NAME, BIRT.PLAC and DEAT.PLAC tags are used. Every GEDCOM reader should be able to read them. The ANSEL version contains all special and all non-spacing characters defined in ANSEL. Every combination of a non-spacing and a latin character is made independent of the existence of such a character in the world. The UNICODE version contains only those character combinations, which exist as UNICODE code points and which can (in principle) converted to ANSEL. Furthermore the UNICODE file contains some Greek and Cyrillic letters.

This ANSEL version was created using my ANSEL to Unicode conversion table. I did not find the true ANSEL specification in the web. The encoding scheme in this file can be wrong!

ANSEL encoded GEDCOM file

The UNICODE file was created using the UNICODE homepage: www.unicode.org. It exsists in several versions:
  • the byte order can be Lo-Hi and Hi-Lo,
  • a special byte order mark (BOM) at the beginning can indicate the byte order,
  • the line terminator can be CR, LF, CR+LF and LF+CR

    Byte order: Lo-Hi, no BOM, Line terminator: CR+LF
    Byte order: Hi-LO, no BOM, Line terminator: CR+LF
    Byte order: Lo-Hi, BOM, Line terminator: CR+LF
    Byte order: Hi-Lo, BOM, Line terminator: CR+LF
    Byte order: Lo-Hi, no BOM, Line terminator: LF
    Byte order: Lo-Hi, no BOM, Line terminator: CR
    Byte order: Lo-Hi, no BOM, Line terminator: LF+CR

    Last modification: 1999-05-13