This blog post describes the transition of technologies from academia to industry, specifically in the semantic search realm. This first post discusses what and why the differences between academia and industry exist, while the second post will more fully develop the pros and cons of various semantic technologies, in and out of academia.
To sum the difference between the two paths, it comes down to a single word: utility. The goal of most enterprises is monetize the solution of problems which people face, as a result creating free-time and/or reducing the amount of effort to complete a goal.
In the case of search solutions, the International Data Corporation (IDC) has recently conducted studies which clearly indicate the problem of information access:
1. Knowledge workers spend from 15% to 35% of their time searching for information.
2. Only 21% of respondents said they found the information they needed 85% to100% of the time
3. 40% of corporate users reported that they can not find the information they need to do their jobs on their intranets.
4. 90% of the time that knowledge workers spend in creating new reports or other products is spent in recreating information that already exists.
providing a great motivation for improved search capabilities, and even a large space for a significant paradigm shift of how users access their information.
While there already exist a large number of papers published in the information retrieval realm, it is safe to say that not all of them take into account the challenges and needs faced by industry implementation.
The main points to be considered by “industry friendly” algorithm are:
1. Scalability: How well can the algorithm handle large datasets, wow well does it support the addition of new documents, and can it work on industry standard hardware configurations (i.e. memory usage)?
2. Efficiency: Is the response time acceptable for all use-cases, especially the most common (i.e. search should be fastest, while addition of a new document can be a bit slower)? Can it run for months without painful maintenance or upkeep?
3. Robustness to user error:If inproperly used, what are the negative side effects? Traditionally this is most concerning in supervised systems, which ask for a training set and then require hours of computation. What if a layman were to train the algorithm, would it still operate close to expectations?
It is not too difficult to imagine a fresh cutting edge algorithm to violate some of the above requirements. In many of the algorithmic competitions, we see computation time reaching into days and weeks in an attempt to squeeze out a few more percentage points, but any change (such as document addition) to the data requires the algorithm to start from scratch. We see carefully created training sets, created by experts, used to boot strap supervised algorithms, which would be expensive in both time and money to create on a per client basis.
While many ideas are proposed which are intellectually stimulating and provocative, only the algorithms which work, and work well, make it into an industry setting.
