I have been working on a Data Structure which aims to provide a unique index of values, with low storage usage, with fast unordered writes.
What I have been playing around with (which may have a proper name and consequent related words I can read) comes down to two things, a list of characters (starting with a null byte), and a list of lists of the characters that can come after. Sort of like a Trie.
For example given an insertion of the string thing "Hello World":
R is a list of the form [0,H,e,l,o, ,W,r,d] and for each R value there is an associated A list, where the index of the list relates to the R character, and the values related to the valid next values, e.g
0: 1 // implies H can be the initial character
1: 2
2: 3
3: 3, 4, 8
4: 5, 7
5: 6,
6: 4
7: 3
I then return the "path" required to traverse the options as offsets to each consequent A row, e.g.
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 2
Where starting from the null byte 0 R value, we take the corresponding index as the next value. The path ends when there are no further choices to take.
I have an implementation of this which is able to take 500,000 unique (english) strings around 4MB on disk and write an index or tree of 35KB which can be used to return all the input strings, given the correct paths or Ids. On disk, the R and A.. rows have a limited capacity, but the number of A rows can grow with the available space. Given the low space required, it can persist this to disk and keep the index in memory fora fairly huge number of values (granted given limited entropy). Have I just offloaded storage of a value to the Id via some form of dictionary ?
I would like to find an efficient mechanism to compress the "path" into bytes. As a future compliment to this, I would also like to index the "path" ids with an incremental index, e.g. value 1 gets id 1, value 2, 2 and so on, where the actual path is not the Id. Doing so, I believe, would allow to update the Tree/Dictionary thing (for example to reorder based on probability) without changing the user facing Id values.
This thing has some unusual properties. The insert order matters. If you insert two strings in two different orders you get two different things. I am looking to use this for limited Datasets, for example, the details of Things in English, where the complexity of possibilities is limited and operations are write and then read only. The ordering does not matter as it is consistent.
How can I best encode the "path" ?