Barney Pell has written a most interesting post about Natural Language Search, which is the problem his company Powerset is attempting to crack. The post is too long to include in full, but here's a choice quote:
"Keywordese" [the strings of individual keywords we have come to use for web search] is a really impoverished language. It is much less expressive than even first year French. Normal people have learned human languages all their lives, and that language learning ability is based on aeons of biological and cultural evolution. We are all masters of communicating our intent to other people. But when it comes to search engines, we have to revert to an impoverished foreign language in which it is impossible to express anything but the most basic thoughts. It is akin to using a pidgin language, the kind invented by two groups of people who speak different languages so that they can communicate through a combination of individual words and gestures without any real syntactic structure.
This motivates the idea of true natural language search. Instead of keywordese or even advanced keywordese (which few people can remember how to use), true natural language queries have linguistic structure. This includes queries where the function words matter, where word order means something, and where relationships that should be explicitly stated easily are stated. Instead of ignoring the function words, a natural language search engine respects their meaning and uses it to give better results. Instead of being a waste of time for a user to add stopwords [words like by", "for", "about", "of", and "in"] in a query, each little word added has a profound effect on the search quality.
This is a great post, and well worth a read in full.
However, it seems to me that a deep challenge for anyone attempting to crack the Natural Language Search problem is to understand the deep structures of our natural language in a systemic and comprehensive manner, as opposed to attempting to tap into one aspect of it, such as predicate relationships ("stopwords"), in isolation. This piece-meal approach can create as many false leads as useful ones. For example, "about" can mean "approximately", as well as "pertaining to", but without a deeper understanding of semantic context, we would be unable to pinpoint the meaning of "about" in the following sentences:
The man was about fifty years old.
The book was about a fifty year-old man.There is some very interesting academic work being done in the field of natural language syntax, and from a skim read of
an overview of recent developments (the jargon is even worse than techies'!), I have gained the impression that there may well be clear rules and structures underlying our natural language, but equally that seemingly simple natural language phrases can conceal highly complex combinations and super-positions of those rules. And linguistics scholars are only beginning to develop explicit part-of-speech categorisations and syntactical rule sets that consistently succeed in generating sentences of which the majority a four year-old would not laugh at.
What a great challenge Natural Language Search is, though: the challenge of understanding human language itself.
Technorati Tags: NLP, powerset, search, syntax