Trying Tries (C# IDictionary Generic Trie)

Tries are a classic implementation of the symbol-table (or associative array) abstraction. They allow for quick retrieval of values based on (typically string) keys and optimize lookup performance. They are a useful way of implementing a dictionary and excel for cases where you would want to find non-exact matches (ie longest prefix match), or all possible suffixes given a starting key (ie a drop-down that shows you all of the different options as you start typing in text).

There are a few good C# Trie implementations available online (see here, and here), but I did not see any that implement the C# Generic Dictionary interface. My goal was to create a Trie implementation that could be used interchangeably with the default C# dictionary, and could also be used with non-string keys.

All of the code is available here. One significant caveat – the overhead I added to make the structure work with the IDictionary interface has a performance impact. In many cases, the default Dictionary implementation outperforms this Trie implementation (more details below). However, even with a relatively inefficient implementation, the Trie outperforms the hashtable-based .NET Dictionary for the longestPrefix method.

How Do Tries Work

Trie Example - Right Oriented
Image 1: Example of a string Trie for ‘A’, ‘AD’, ‘ADD’, ‘ADDRESS’, ‘BAT’

A Trie is a type of tree where every arc represents a character and the path from the root to a value node represents a valid key.

Above is an example of a simple Trie with the nodes marked in green being accepting ‘value nodes’. The Trie data structure maintains a reference to the root node.

Finding a key means traversing down the root node character by character. If we end up with a null reference or a non-value node, the key could not be found.

The Trie structure is constructed whenever adding a key, and when a key is removed the appropriate nodes are removed from the tree (for example, removing ‘ADDRESS’ from this Trie would remove 4 nodes):

Trie Example - post removal
The Trie in Image 1 after removing the ‘ADDRESS’ key

The strength of the Trie data structure is that lookup time would be proportional to the size of the key, rather than to the number of elements in the data structure. For small keys (ie words in the English language) this is O(1) vs. O(logn) for binary search. It puts the Trie on par with a hash table for average time complexity.

Performance vs. .NET Dictionary

The .NET Dictionary implementation is an efficient hash table implementation with O(1) lookups. It has a lot less overhead than this Trie implementation, and would easily outperform it for general CRUD operations. However, for the Longest Prefix operation – the Trie wins every time. Since a scan of the entire dataset is not necessary, the Trie is over 100 times more efficient than the basic Dictionary implementation:

Trie v Dictionary Longest Prefix Performancel

However, for finding all suffixes, both exhibit O(1) and the .NET Dictionary outperforms the Trie. This is likely due to the efficient implementation of string operations in .NET:

Trie v Dictionary Find All Suffixes Performancel

Next Steps

  • This Trie implementation stores the entire string key in each node. This is very space inefficient, but was necessary to fit the Trie into the .NET dictionary interface. I would like to find a way to store each letter only once.
  • The internal tree data structure relies on a Dictionary to connect each node to its children. With more knowledge of the alphabet, this can be optimized.
  • Implementation of ternary search tries
  • Hybrid dictionary that uses both a Trie and a hashtable to optimize for all operations without sacrificing too much space.
  • Explore other, non-string use cases of Tries.
  • Since Tries are a specific type (non-cyclical, tree vs. graph) of a finite state automaton look to see if there are more general solutions that can bridge both problem sets.

 

2 thoughts on “Trying Tries (C# IDictionary Generic Trie)

  1. Great post! I definitely need to try Tries in practice. Actually, it is strange that on the graph comparing the longest prefix performance for Tries and the .NET dictionary, the line for the latter becomes parallel to x axis. It should grow linearly since the number of scans in order to find the longest prefix is proportional to the number of elements in a dictionary, while for Tries it should be constant. Given these points, it is not safe to say that Tries is over 100 more efficient than the .NET dictionary since the difference vary depending on the number of input elements.

    Like

    1. Thats right Kirill, it was over 100 times more efficient in my tests which were for inputs of 1k-7k. For larger inputs it should be even more efficient for the reasons you cited. The slope does decrease for the .net dictionary between 6k and 7k – but i suspect that is because of some of the array resizing the .NET dictionary needs to perform (which means marginal increases in execution time would not be smooth, but would depend on the default array limit sizes). If you look into the code – there is no way this is going to be constant, since we need to scan through the entire collection of keys per your point.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s