summaryrefslogtreecommitdiff
path: root/README
blob: 7da5e3b6863b3d45616f0ceec5cfddb609e4c5ff (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
This project can currently dump and (partially) normalize white pages from Deutsche Telekom's CD and DVDs.

The on-disc-data currently comes in four flavours (see https://erdgeist.org/posts/2008/datenmessie.html)

version 1) Teleauskunft 1188 from 1992, (April-June)
version 2) Teleauskunft 1188 Telefon-Teilnehmer, Oktober 1995 / Telefon-Teilnehmer Gesamtausgabe from 1995/1996
version 3) Telefonbuch für Deutschland, Version 1.0 1996 through DasTelefonbuch, Deutschland, Herbst 2003
version 4) DasTelefonbuch, Map&Route, Frühjahr 2004 until now

version 1
=========

Notes: Strings are encoded in cp437, those inside records stored in 7-bit packed encoding. Only the .001 files on each CD are interesting.

Each file consists of a standard header and a number of pages, with pages starting at 0x800, being spaced at 0x2000 steps.

The header's important values are (uint16_t*)0x40 number of pages, (uint32_t*)0x42 total number of records in file and a \0 separated list of gasse, city, zip and prefix, starting at 0xe8.

Each page can either be a "normal" one, with phone entries or a "blob" one, with multi line records, being referenced from "normal" pages inside the same file. It starts with a flag (uint8_t*)0x00, a size of blob's contents (i.e. if != 0, this is a blob page) at (uint16_t*)0x02, a count of records in that page at (uint16_t*)0x04 and for each record an offset into this page's records, relative to where they start: at 0x0e. Should this offset happen to be >0x1fff, it refers to a "blob" page that should be substituted here.

Each record starts with an entry count at (uint16_t*)0x00. If a record consists of multiple entries, think of lines as continuation of the "lines above". Usually they are used to describe multiple extensions in a larger subscriber. If the page's flag is zero, each record also has an "prefix" offset at (uint16_t*)0x02, if it is non-zero, the first entry is prepended by a shared prefix (think multiple instances of the same or similar family names which are then "compressed" by this hack). This prefix is unpacked the same way as records are. "Blob" pages are not compressed this way.

The entries are then packed into a 7 bit stream, with the 0000001b separating the columns: "Nachname", "Vorname", "Adresszusatz", "Ortszusatz", "Zustellamt oder PLZ Ost", "Strassenname", "Hausnummer", "Namenszusatz", "Verweise", "Vorwahl", "Rufnummer". An entry ends with a 0000000b.

version 2
=========

TBD.