Optimization of Natural Language Processing

	The human brain has often times been called the world's most 
extraordinary computer. It is capable of quickly and efficiently processing huge 
amounts of information. The most quoted example of this is natural language 
processing. Countless time and resources have gone into the study of this 
uniquely human phenomenon. This paper will examine a technique to efficiently 
process sentences based on a top down approach and will then examine how this 
technique can be improved.
	Humans process language by a type of top-down approach. They 
examine a sentence or phrase as whole and if they don't recognize it, they break 
it down into its component parts until they recognize the parts. If a sentence or 
phrase is used enough, it will become as much a part of a person's lexicon as a 
word. Considering this, a simple LISP parser (Appendix A) has been written to 
more efficiently process sentences. The parser operates by looking at the entire 
phrase passed to it and seeing if that phrase is in the lexicon (Appendix B). If it 
is, the problem is solved: the phrase is understood. Otherwise, the sentence is 
broken down into its component parts. Each of these parts is examined to see if 
it exists in the lexicon.
	For example, if we pass it the phrase "How is it going?" A commonly 
used phrase in everyday conversation. The parser produces the output:

USER(4):  0: (ENGINE (HOW IS IT GOING)
 0: returned
      ((A QUESTION OFTEN USED AS A GREETING REQUESTING INFORMATION AS TO THE
        SUBJECTS STATE))

Whereas, if we pass it the phrase "Her long flowing hair swung back and forth" 
which is composed of a couple key phrases, the parser produces the output:

USER(6):  0: (ENGINE (HER LONG FLOWING HAIR SWUNG BACK AND FORTH)
 0: returned
      ((PRONOUN REFERING TO FEMALE SUBJECT)
       (PROTEEN STRAINS FROM A SUBJECT 'S HEAD OF ABOVE AVERAGE LENGTH AND
        LEFT UNCONSTRAINED)
       (ACTION OF PERIODIC MOVEMENT THROUGH A GIVEN PATH AS IN A PENDULUM))

However the sentence, "Long sentences can be difficult to understand and may 
not break down easily," which pretty much speaks for itself, produces the 
output:

USER(9):  0: (ENGINE (LONG SENTENCES CAN BE DIFFICULT TO UNDERSTAND AND MAY NOT BREAK
             DOWN EASILY)
 0: returned
      ((PARTS OF LANGUAGE CONVEYING AN ENTIRE THOUGHT WHICH ARE OF ABOVE
        AVERAGE LENGTH)
       (THERE IS THE POSSIBILITY OF A STATE OF EXISTANCE) (HARD NOT EASY)
       (THE ACTION OF KNOWING THE SEMANTIC MEANING OF A PIECE OF INFORMATION)
       (CONJUNCTION CONNECTION TWO OR MORE THINGS)
       (EXPRESSES THAT THERE IS A POSSIBILITY NOT A CERTANTY)
       (LOGICALLY NEGATES THE FOLLOWING)
       (BE DECOMPOSED INTO ITS COMPONENT PARTS)
       (ABLE TO BE DONE WITHOUT MUCH EFFORT))

All test sentences can be viewed in Appendix C.

The parser presented here is very rudimentary. With some relatively easy 
improvements, it may prove to be a very useful tool.
	The first major improvement for the parser is the information it returns. 
As presented, it only returns a description of the sentence based on its 
component parts. It does not really do anything. For the parser to be useful, it 
would have to provide the deeper meaning in the lexicon to a calling program. 
The calling program would then "understand" the sentence passed to it (Guthrie, 
Pustejovsky, Wilks, & Slator, 1996).
	The second area for improvement is the size of the lexicon. The 
dictionary used for this program was a small sample dictionary used for the 
purpose of testing this program (Guthrie, et al., 1996). A useful lexicon would 
require a large amount of entries by hand. Size of dictionaries has often been a 
limiting factor in parsers(Guthrie, et al., 1996). The simple parsers used in older 
text based adventure games only recognize a limited number of words -- verbs 
such as walk, talk and look, and nouns such as sword, light and rope. The lack 
of recognition of words often dismayed users. Hence, for a parser to be user 
friendly, it will require a large lexicon.
	The way the information is entered could also use improvement. The 
program would probably be used as part of a larger program that needed to 
understand natural language. This program would handle the user input and 
output. The parser would use clues such as punctuation and case for clues as to 
how to handle the sentence (Cole, Hirschman, Atlas, Beckman, et al., 1995). For 
instance, commas would help delimit phrases; words that included apostrophes 
(conjunctions) could be expanded and disambiguated into its component parts. 
With these small improvements, the parser will be more user friendly to natural 
language.
	The most monumental improvement that can be made to this program is 
the ability to learn (Guthrie, et al., 1996). It would be much more useful if it 
could increase the size of its lexicon on its own accord. There are two ways this 
could be done; however, the most realistic approach is a combination of the two. 
The first approach is quite simple: whenever the parser comes across a word it 
does not recognize, it prompts the user to define it. This would be simple, user 
controlled learning. It could be expanded to include user checking of the final 
meaning, where the definitions of words would be expanded to account for their 
use in the given sentence. The user could then correct the meanings of words 
that were misunderstood. The second approach is the complete opposite: the 
computer will infer the meaning of an unknown word based on the context in 
which it appears. Obviously, this will be much more difficult to implement, for 
there is no simple way to infer meanings. The most logical approach is a 
combination of the two: the computer will choose some aspects of the meaning 
of unknown words, such as the part of speech and the context the word appears 
in, and will prompt the user to fill in the blanks(Cole, et al., 1995).
	There is a fairly reliable way to pick up on similarities throughout the 
text that will help inference: neural networks. Neural networks are particularly 
adept at finding similarities given the parameters for which they are looking. 
Besides aiding inference, they will help the construction of the lexicon by 
identifying commonly used sentences and phrases that will allow those often 
used items to be placed as a whole into the lexicon. Likewise, they will identify 
words that are similarly spelled, which will help account for users spelling errors 
and words that are derived from the same base word as other words.
	There is some debate as to how the lexicon should be constructed in 
artificial natural language processing systems. It would seem that the way many 
humans do this is with links between words of similar meaning in their lexicons. 
Hence the lexicon for this parser may be more useful were it to be implemented 
in two parts: the first part would contain the words and phrases, the second 
would contain the meanings. Hence, one word might contain links to numerous 
meanings and each meaning could have links from numerous words. In this 
regard, it would be similar to a thesaurus (Guthrie, et al., 1996). This would 
reduce the size of the lexicon by eliminating redundancy while not affecting its 
effectiveness. The more important implication, if it works, is that it presents the 
possibility that the human lexicon is similarly separated into two parts.
	One of the main areas where parsers need to be improved is in 
elimination of disambiguation in sentences. This is particularly important in 
applications like translators. A key example of an area that needs to be 
disambiguated is pronouns. When a pronoun is encountered, the rest of the 
sentence should be examined to find what subject is being referred to and that 
subject should be substituted in place of the pronoun. As discussed previously, 
the parser should use clues from context for a better "understanding" of the text. 
This will help disambiguate words of multiple meanings. For example, if the 
word bank is surrounded by other words such as money and teller, the chosen 
meaning is different from when it is surrounded by words such as river and boat 
(Guthrie, et al., 1996).
	Finally, the efficiency of the parser can be improved by parsing based on 
the part of speech(Cole, et al., 1995). Right now, the parser will slowly break 
down the sentence or phrase from top to bottom using no clues from the parts of 
speech. For instance, if the sentence is eight words long and that eight word 
phrase is not in the lexicon, then the first seven and then the last seven words are 
examined to see if those seven word phrases are in the dictionary. If neither of 
those are (which is likely they will not be), then the first six, middle six and last 
six words are examined. This process continues until a match is found. It would 
be much more efficient for the phrases of the sentence to be predicted by trying 
standard patterns, such as subject, verb, object. This would of course be unique 
for the different languages on which the parser is used. However languages such 
as Japanese, which rely on particles to separate the parts of speech, should make 
this easier to accomplish.
	While there have been many possible improvements listed here, the basic 
idea remains unchanged: natural language can be efficiently parsed by a top 
down approach, breaking a sentence or phrase down into its component parts 
until a part is recognized. This can be made more useful by making it a 
component part of a program that will get a deeper meaning from the given 
sentence. This will also make it more user friendly, accepting input that includes 
case and punctuation. The size of the dictionary needs to be drastically 
increased. This can be aided by techniques where the parser learns new words 
and adds them to its lexicon. This could be accomplished with the use of a 
neural network that would add further functionality. The lexicon could be split 
into two parts that would reduce its size and presents interesting ramifications 
for the construction of the human lexicon. The parser needs to have some 
routine for disambiguating pronouns and specific word meanings in sentences. 
Finally, the top down efficiency of the parser could be improved through part of 
speech prediction. If all of these improvements could be made, the parser would 
do a reasonably good job of understanding natural language, and would do it 
efficiently as well

References
Guthrie, L., Pustejovsky, J., Wilks, Y., & Slator, B. M. (1996, January). The 
role of lexicons in natural language processing. Communications of the 
ACM, 39(1), 63-72.
Cole, R., Hirschman, L., Atlas, L., Beckman, M., Biermann, A., Bush, M., 
Clements, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., 
Levinson, S., McKeown, K., Morgan, N., Novick, D. G., Ostendorf, M., 
Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C., 
Zahorian, S., & Zue, V. (1995, January). The challenge of spoken 
language systems: Research directions for the nineties. IEEE Transactions 
on Speech and Audio Processing, 3(1), 1-21.
	

An update to the parser discussed here is availible here.

Appendix A

Appendix B

Appendix C


Last Modified: 5/22/2000