1 files changed, 123 insertions, 0 deletions
diff --git a/docs/cronos-research.md b/docs/cronos-research.md
new file mode 100644
index 0000000..517137c
--- /dev/null
+++ b/docs/cronos-research.md
@@ -0,0 +1,123 @@
+# About Cronos databases.
+A _cronos database_ consists of those files
+    CroBank.dat
+    CroBank.tad
+    CroIndex.dat
+    CroIndex.tad
+    CroStru.dat
+    CroStru.tad
+and a Vocabulary database with another set of these files in a sub directory Voc/
+`CroIndex.*` can be ignored, unless we suspect there to be residues of old data. All words are serialized in little endianess.
+On a default Windows installation, the CronosPro app shows with several encoding issues that can be fixed like this: 
+    reg set HKLM\System\CurrentControlSet\Control\Nls\Codepage 1250=c_1251.nls 1252=c_1251.nls
+[from](https://ixnfo.com/en/question-marks-instead-of-russian-letters-a-solution-to-the-problem-with-windows-encoding.html)
+##Files ending in .dat
+All .dat files start with the string `"CroFile\0"` and then 8 more header bytes
+`CroStru.dat` has
+    xx yy 30 31 2e 30 32 01 == ? ? 0 1 . 0 2 ?
+CroBank.dat and CroIndex.dat have (as found in the big dump)
+    xx yy 30 31 2e 30 32 0[0123] == ? ? 0 1 . 0 2 ?
+    xx yy 30 31 2e 30 33 0[023]  == ? ? 0 1 . 0 3 ?
+    xx yy 30 31 2e 30 34 03      == ? ? 0 1 . 0 4 ?
+which seems to be the version identifier. The xx yy part is unclear but seems not to be random, might be a checksum.
+In `CroBank.dat` there's a bias towards 313 times c8 05, 196 times b8 00, 116 times 4e 13, 95 times 00 00, and 81 times 98 00 out of 1964 databases.
+In `CroStru.dat` there's a bias towards 351 times c8 05, 224 times b8 00, 119 times 4e 13, 103 times 00 00 and 83 times 98 00 out of 1964 databases.
+In `CroIndex.dat` there's a bias towards 312 times c8 05, 194 times b8 00, 107 times 4e 13, 107 times 00 00 and 82 times 98 00 out of 1964 databases.
+##Files ending in .tad
+The first two `uint32_t` seem to be the amount and the offset to the first free block.
+The original description made it look like there were different formats for the block references, but all entries in the .tads appear to follow the scheme:
+    uint32_t offset
+    uint32_t size       // with flag in upper bit, 0 -> large record
+    uint32_t checksum   // but sometimes just 0x00000000, 0x00000001 or 0x00000002
+where size can be 0xffffffff (probably to indicate a free/deleted block) some size entries have their top bits set. In some files the offset looks garbled but usually the top bit of the size then is set.
+large records start with plaintext: { uint32 offset, uint32 size? }
+followed by data obfuscated with 'shift==0'
+The old description would also assume 12 byte reference blocks but a packed struct
+    uint32_t offset1
+    uint16_t size1
+    uint32_t offset2
+    uint16_t size2
+with the first chunk read from offset1 with length size1 and potentially more parts with total length of size2 starting at file offset offset2 with the first `uint32_t` of the 256 byte chunk being the next chunk's offset and a maximum of 252 bytes being actual data.
+However, I never found files with .tad like that. Also the original description insisted on those chunks needing the decode-magic outlined below, but the python implementation only does that for CroStru files and still seems to produce results.
+##CroStru
+Interesting files are CroStru.dat containing metadata on the database within blocks whose size and length are found in CroStru.tad. These blocks are rotated byte wise using an sbox found in the cro2sql sources and then each byte is incremented by a one byte counter which is initialised by a per block offset. The sbox looks like this:
+    unsigned char kod[256] = {
+      0x08, 0x63, 0x81, 0x38, 0xa3, 0x6b, 0x82, 0xa6,
+      0x18, 0x0d, 0xac, 0xd5, 0xfe, 0xbe, 0x15, 0xf6,
+      0xa5, 0x36, 0x76, 0xe2, 0x2d, 0x41, 0xb5, 0x12,
+      0x4b, 0xd8, 0x3c, 0x56, 0x34, 0x46, 0x4f, 0xa4,
+      0xd0, 0x01, 0x8b, 0x60, 0x0f, 0x70, 0x57, 0x3e,
+      0x06, 0x67, 0x02, 0x7a, 0xf8, 0x8c, 0x80, 0xe8,
+      0xc3, 0xfd, 0x0a, 0x3a, 0xa7, 0x73, 0xb0, 0x4d,
+      0x99, 0xa2, 0xf1, 0xfb, 0x5a, 0xc7, 0xc2, 0x17,
+      0x96, 0x71, 0xba, 0x2a, 0xa9, 0x9a, 0xf3, 0x87,
+      0xea, 0x8e, 0x09, 0x9e, 0xb9, 0x47, 0xd4, 0x97,
+      0xe4, 0xb3, 0xbc, 0x58, 0x53, 0x5f, 0x2e, 0x21,
+      0xd1, 0x1a, 0xee, 0x2c, 0x64, 0x95, 0xf2, 0xb8,
+      0xc6, 0x33, 0x8d, 0x2b, 0x1f, 0xf7, 0x25, 0xad,
+      0xff, 0x7f, 0x39, 0xa8, 0xbf, 0x6a, 0x91, 0x79,
+      0xed, 0x20, 0x7b, 0xa1, 0xbb, 0x45, 0x69, 0xcd,
+      0xdc, 0xe7, 0x31, 0xaa, 0xf0, 0x65, 0xd7, 0xa0,
+      0x32, 0x93, 0xb1, 0x24, 0xd6, 0x5b, 0x9f, 0x27,
+      0x42, 0x85, 0x07, 0x44, 0x3f, 0xb4, 0x11, 0x68,
+      0x5e, 0x49, 0x29, 0x13, 0x94, 0xe6, 0x1b, 0xe1,
+      0x7d, 0xc8, 0x2f, 0xfa, 0x78, 0x1d, 0xe3, 0xde,
+      0x50, 0x4e, 0x89, 0xb6, 0x30, 0x48, 0x0c, 0x10,
+      0x05, 0x43, 0xce, 0xd3, 0x61, 0x51, 0x83, 0xda,
+      0x77, 0x6f, 0x92, 0x9d, 0x74, 0x7c, 0x04, 0x88,
+      0x86, 0x55, 0xca, 0xf4, 0xc1, 0x62, 0x0e, 0x28,
+      0xb7, 0x0b, 0xc0, 0xf5, 0xcf, 0x35, 0xc5, 0x4c,
+      0x16, 0xe0, 0x98, 0x00, 0x9b, 0xd9, 0xae, 0x03,
+      0xaf, 0xec, 0xc9, 0xdb, 0x6d, 0x3b, 0x26, 0x75,
+      0x3d, 0xbd, 0xb2, 0x4a, 0x5d, 0x6c, 0x72, 0x40,
+      0x7e, 0xab, 0x59, 0x52, 0x54, 0x9c, 0xd2, 0xe9,
+      0xef, 0xdd, 0x37, 0x1e, 0x8f, 0xcb, 0x8a, 0x90,
+      0xfc, 0x84, 0xe5, 0xf9, 0x14, 0x19, 0xdf, 0x6e,
+      0x23, 0xc4, 0x66, 0xeb, 0xcc, 0x22, 0x1c, 0x5c };
+The original description of an older database format called the per block counter start offset 'sistN' which seems to imply it to be constant for certain entries. They correspond to a "system number" of meta entries visible in the database software. Where these offsets come from is currently unknown, the existing code just brute forces through all offsets and looks for certain sentinels.
+In noticed that the first 256 bytes of CroStru.dat look close to identical (except the first 16 bytes) than CroBank.dat.
+##CroBank
+CroBank.dat contains the actual database entries for multiple tables as described in the CroStru file. After each chunk is re-assembled (and potentially decoded with the per block offset being the record number in the .tad file).
+Its first byte defines, which table it belongs to. It is encoded in cp1251 (or possibly IBM866) with actual column data separated by 0xfe. There is an extra concept of sub fields in those columns, indicated by a 0xfd byte.

diff --git a/docs/cronos-research.md b/docs/cronos-research.md new file mode 100644 index 0000000..517137c --- /dev/null +++ b/docs/cronos-research.md
@@ -0,0 +1,123 @@
	1	# About Cronos databases.
	2
	3	A _cronos database_ consists of those files
	4
	5	CroBank.dat
	6	CroBank.tad
	7	CroIndex.dat
	8	CroIndex.tad
	9	CroStru.dat
	10	CroStru.tad
	11
	12	and a Vocabulary database with another set of these files in a sub directory Voc/
	13
	14	`CroIndex.*` can be ignored, unless we suspect there to be residues of old data. All words are serialized in little endianess.
	15
	16	On a default Windows installation, the CronosPro app shows with several encoding issues that can be fixed like this:
	17
	18	reg set HKLM\System\CurrentControlSet\Control\Nls\Codepage 1250=c_1251.nls 1252=c_1251.nls
	19
	20	[from](https://ixnfo.com/en/question-marks-instead-of-russian-letters-a-solution-to-the-problem-with-windows-encoding.html)
	21
	22	##Files ending in .dat
	23
	24	All .dat files start with the string `"CroFile\0"` and then 8 more header bytes
	25
	26	`CroStru.dat` has
	27
	28	xx yy 30 31 2e 30 32 01 == ? ? 0 1 . 0 2 ?
	29
	30
	31	CroBank.dat and CroIndex.dat have (as found in the big dump)
	32
	33	xx yy 30 31 2e 30 32 0[0123] == ? ? 0 1 . 0 2 ?
	34
	35	xx yy 30 31 2e 30 33 0[023] == ? ? 0 1 . 0 3 ?
	36
	37	xx yy 30 31 2e 30 34 03 == ? ? 0 1 . 0 4 ?
	38
	39
	40	which seems to be the version identifier. The xx yy part is unclear but seems not to be random, might be a checksum.
	41
	42	In `CroBank.dat` there's a bias towards 313 times c8 05, 196 times b8 00, 116 times 4e 13, 95 times 00 00, and 81 times 98 00 out of 1964 databases.
	43
	44	In `CroStru.dat` there's a bias towards 351 times c8 05, 224 times b8 00, 119 times 4e 13, 103 times 00 00 and 83 times 98 00 out of 1964 databases.
	45
	46	In `CroIndex.dat` there's a bias towards 312 times c8 05, 194 times b8 00, 107 times 4e 13, 107 times 00 00 and 82 times 98 00 out of 1964 databases.
	47
	48	##Files ending in .tad
	49
	50	The first two `uint32_t` seem to be the amount and the offset to the first free block.
	51
	52	The original description made it look like there were different formats for the block references, but all entries in the .tads appear to follow the scheme:
	53
	54	uint32_t offset
	55	uint32_t size // with flag in upper bit, 0 -> large record
	56	uint32_t checksum // but sometimes just 0x00000000, 0x00000001 or 0x00000002
	57
	58	where size can be 0xffffffff (probably to indicate a free/deleted block) some size entries have their top bits set. In some files the offset looks garbled but usually the top bit of the size then is set.
	59
	60	large records start with plaintext: { uint32 offset, uint32 size? }
	61	followed by data obfuscated with 'shift==0'
	62
	63	The old description would also assume 12 byte reference blocks but a packed struct
	64
	65	uint32_t offset1
	66	uint16_t size1
	67	uint32_t offset2
	68	uint16_t size2
	69
	70	with the first chunk read from offset1 with length size1 and potentially more parts with total length of size2 starting at file offset offset2 with the first `uint32_t` of the 256 byte chunk being the next chunk's offset and a maximum of 252 bytes being actual data.
	71
	72	However, I never found files with .tad like that. Also the original description insisted on those chunks needing the decode-magic outlined below, but the python implementation only does that for CroStru files and still seems to produce results.
	73
	74	##CroStru
	75
	76	Interesting files are CroStru.dat containing metadata on the database within blocks whose size and length are found in CroStru.tad. These blocks are rotated byte wise using an sbox found in the cro2sql sources and then each byte is incremented by a one byte counter which is initialised by a per block offset. The sbox looks like this:
	77
	78	unsigned char kod[256] = {
	79	0x08, 0x63, 0x81, 0x38, 0xa3, 0x6b, 0x82, 0xa6,
	80	0x18, 0x0d, 0xac, 0xd5, 0xfe, 0xbe, 0x15, 0xf6,
	81	0xa5, 0x36, 0x76, 0xe2, 0x2d, 0x41, 0xb5, 0x12,
	82	0x4b, 0xd8, 0x3c, 0x56, 0x34, 0x46, 0x4f, 0xa4,
	83	0xd0, 0x01, 0x8b, 0x60, 0x0f, 0x70, 0x57, 0x3e,
	84	0x06, 0x67, 0x02, 0x7a, 0xf8, 0x8c, 0x80, 0xe8,
	85	0xc3, 0xfd, 0x0a, 0x3a, 0xa7, 0x73, 0xb0, 0x4d,
	86	0x99, 0xa2, 0xf1, 0xfb, 0x5a, 0xc7, 0xc2, 0x17,
	87	0x96, 0x71, 0xba, 0x2a, 0xa9, 0x9a, 0xf3, 0x87,
	88	0xea, 0x8e, 0x09, 0x9e, 0xb9, 0x47, 0xd4, 0x97,
	89	0xe4, 0xb3, 0xbc, 0x58, 0x53, 0x5f, 0x2e, 0x21,
	90	0xd1, 0x1a, 0xee, 0x2c, 0x64, 0x95, 0xf2, 0xb8,
	91	0xc6, 0x33, 0x8d, 0x2b, 0x1f, 0xf7, 0x25, 0xad,
	92	0xff, 0x7f, 0x39, 0xa8, 0xbf, 0x6a, 0x91, 0x79,
	93	0xed, 0x20, 0x7b, 0xa1, 0xbb, 0x45, 0x69, 0xcd,
	94	0xdc, 0xe7, 0x31, 0xaa, 0xf0, 0x65, 0xd7, 0xa0,
	95	0x32, 0x93, 0xb1, 0x24, 0xd6, 0x5b, 0x9f, 0x27,
	96	0x42, 0x85, 0x07, 0x44, 0x3f, 0xb4, 0x11, 0x68,
	97	0x5e, 0x49, 0x29, 0x13, 0x94, 0xe6, 0x1b, 0xe1,
	98	0x7d, 0xc8, 0x2f, 0xfa, 0x78, 0x1d, 0xe3, 0xde,
	99	0x50, 0x4e, 0x89, 0xb6, 0x30, 0x48, 0x0c, 0x10,
	100	0x05, 0x43, 0xce, 0xd3, 0x61, 0x51, 0x83, 0xda,
	101	0x77, 0x6f, 0x92, 0x9d, 0x74, 0x7c, 0x04, 0x88,
	102	0x86, 0x55, 0xca, 0xf4, 0xc1, 0x62, 0x0e, 0x28,
	103	0xb7, 0x0b, 0xc0, 0xf5, 0xcf, 0x35, 0xc5, 0x4c,
	104	0x16, 0xe0, 0x98, 0x00, 0x9b, 0xd9, 0xae, 0x03,
	105	0xaf, 0xec, 0xc9, 0xdb, 0x6d, 0x3b, 0x26, 0x75,
	106	0x3d, 0xbd, 0xb2, 0x4a, 0x5d, 0x6c, 0x72, 0x40,
	107	0x7e, 0xab, 0x59, 0x52, 0x54, 0x9c, 0xd2, 0xe9,
	108	0xef, 0xdd, 0x37, 0x1e, 0x8f, 0xcb, 0x8a, 0x90,
	109	0xfc, 0x84, 0xe5, 0xf9, 0x14, 0x19, 0xdf, 0x6e,
	110	0x23, 0xc4, 0x66, 0xeb, 0xcc, 0x22, 0x1c, 0x5c };
	111
	112
	113	The original description of an older database format called the per block counter start offset 'sistN' which seems to imply it to be constant for certain entries. They correspond to a "system number" of meta entries visible in the database software. Where these offsets come from is currently unknown, the existing code just brute forces through all offsets and looks for certain sentinels.
	114
	115	In noticed that the first 256 bytes of CroStru.dat look close to identical (except the first 16 bytes) than CroBank.dat.
	116
	117
	118	##CroBank
	119
	120	CroBank.dat contains the actual database entries for multiple tables as described in the CroStru file. After each chunk is re-assembled (and potentially decoded with the per block offset being the record number in the .tad file).
	121
	122	Its first byte defines, which table it belongs to. It is encoded in cp1251 (or possibly IBM866) with actual column data separated by 0xfe. There is an extra concept of sub fields in those columns, indicated by a 0xfd byte.
	123