Monday, August 2, 2010

Multiple Keyword Search

Some time ago I needed to do a huge data mining project, one of the tasks was getting keyword lists from multiple documents, my initial implementation was multiple IndexOf but my boss demanded near realtime and with the amount of data passing through the program, even a multi-core machine wasn't enough.


I've started digging and found this little algorithm called Aho-Corasick, its a nice algorithm that finds multiple keywords in a document.


Immediately after, I've found out that Tomas Petricek already implemented it in C# and the performance was better than the requirements, allowing for expansion.


So thanks Tomas!



Taken from CodeProject