Research Interests
This page describes the next steps in research required for the SALSA technology. The technology is a semantically aware Information System that provides the ability to index, retrieve and compare documents based on their sense.
Unlike common information foraging technologies (such as search engine, data-mining, etc.) SALSA is based on Language Acquisition. The technology is emerged from research at the Rensselaer Institute Polytechnic and CogWorks Lab.
This document presents three different research interests; each specialized in a specific field:
- Cognitive Science, Assessment of Semantic Performance
- Machine Learning, Creation and Specialization of Semantic Spaces
- Computer Science, Distributed Objects and Vectorial Indexes
Assessment of Semantic Performance
The goal is to create a quantitative measure of semantic awareness and performance for Information Systems. The measure baseline should be related to human performances. We have created a set of tests for term-to-term comparison. Yet a comprehensive test should include terms, passages and whole documents processing.
The test must be narrow enough not to require any syntactic or pragmatic processing and yet complicated enough to have a strong discrimination between performance levels. Some parts of the TOEFL exam provide good leads toward a measure of semantic awareness. Another possible lead could be based on Google Query Re-injection assertions.
Creation and Specialization of Semantic Spaces
The SALSA technology is based on parallel vectorial spaces called “Semantic Spaces”. These spaces intend to capture global contexts. As of today, their creation is either random or term-primed. There is a need to be able to create contexts based on specific information analysis, with for example Dumais’ Information-Entropy theories.
Another research area related to Semantic Space is their population, their evolution and modifications through times. What is the adequate population size for any given task? What is the effect of dimensionality within Semantic Spaces (current spaces are comprised of between 500 and 2500 dimensions). How do spaces specialize given their results?
Distributed Objects and Vectorial Indexes
SALSA’s language acquisition phase is very computationally intensive yet simple. With a few well designed distributed objects, the efficiency of the acquisition phase greatly increases. The overall design of the SALSA framework is heavily parallel, but no low level objects satisfy the framework’s requirements. Low level objects must:
- be able to be transparently distributed
- allow for queued writing
- solve horse-race conditions while avoiding deadlocks
- be memory resident
The development of such objects is crucial to sustain SALSA’s performance and allow for transparent scaling.
SALSA’s indexes are vector based. Without the overhead and performance trade-offs of Oracle’s Custom Java Index or related tools, the goal is to create a framework based on vectorial indexes to retrieve documents. The framework should be efficient and allow for:
- index clustering
- distributed indexing and retrieval
- cached-retrieval
Cached retrieval is especially tricky: Queries are represented as vector as well. The ideas is to provide a retrieval level cache for similar queries (slightly different vectors) without having to scan the whole vectorial space.
Conclusion
For any information regarding the technology or any of its research interests, please feel free to contact me. This document purposefully does not go into details as it serves a general introduction, but I would gladly answer to any questions you might have.
Thank you for your time and consideration,
_Stéphane

