The human brain has often times been called the world's most extraordinary computer. It is capable of quickly and efficiently processing huge amounts of information. The most quoted example of this is natural language processing. Countless time and resources have gone into the study of this uniquely human phenomenon. This paper will examine a technique to efficiently process sentences based on a top down approach and will then examine how this technique can be improved. Humans process language by a type of top-down approach. They examine a sentence or phrase as whole and if they don't recognize it, they break it down into its component parts until they recognize the parts. If a sentence or phrase is used enough, it will become as much a part of a person's lexicon as a word. Considering this, a simple LISP parser (Appendix A) has been written to more efficiently process sentences. The parser operates by looking at the entire phrase passed to it and seeing if that phrase is in the lexicon (Appendix B). If it is, the problem is solved: the phrase is understood. Otherwise, the sentence is broken down into its component parts. Each of these parts is examined to see if it exists in the lexicon. For example, if we pass it the phrase "How is it going?" A commonly used phrase in everyday conversation. The parser produces the output: USER(4): 0: (ENGINE (HOW IS IT GOING) 0: returned ((A QUESTION OFTEN USED AS A GREETING REQUESTING INFORMATION AS TO THE SUBJECTS STATE)) Whereas, if we pass it the phrase "Her long flowing hair swung back and forth" which is composed of a couple key phrases, the parser produces the output: USER(6): 0: (ENGINE (HER LONG FLOWING HAIR SWUNG BACK AND FORTH) 0: returned ((PRONOUN REFERING TO FEMALE SUBJECT) (PROTEEN STRAINS FROM A SUBJECT 'S HEAD OF ABOVE AVERAGE LENGTH AND LEFT UNCONSTRAINED) (ACTION OF PERIODIC MOVEMENT THROUGH A GIVEN PATH AS IN A PENDULUM)) However the sentence, "Long sentences can be difficult to understand and may not break down easily," which pretty much speaks for itself, produces the output: USER(9): 0: (ENGINE (LONG SENTENCES CAN BE DIFFICULT TO UNDERSTAND AND MAY NOT BREAK DOWN EASILY) 0: returned ((PARTS OF LANGUAGE CONVEYING AN ENTIRE THOUGHT WHICH ARE OF ABOVE AVERAGE LENGTH) (THERE IS THE POSSIBILITY OF A STATE OF EXISTANCE) (HARD NOT EASY) (THE ACTION OF KNOWING THE SEMANTIC MEANING OF A PIECE OF INFORMATION) (CONJUNCTION CONNECTION TWO OR MORE THINGS) (EXPRESSES THAT THERE IS A POSSIBILITY NOT A CERTANTY) (LOGICALLY NEGATES THE FOLLOWING) (BE DECOMPOSED INTO ITS COMPONENT PARTS) (ABLE TO BE DONE WITHOUT MUCH EFFORT)) All test sentences can be viewed in Appendix C. The parser presented here is very rudimentary. With some relatively easy improvements, it may prove to be a very useful tool. The first major improvement for the parser is the information it returns. As presented, it only returns a description of the sentence based on its component parts. It does not really do anything. For the parser to be useful, it would have to provide the deeper meaning in the lexicon to a calling program. The calling program would then "understand" the sentence passed to it (Guthrie, Pustejovsky, Wilks, & Slator, 1996). The second area for improvement is the size of the lexicon. The dictionary used for this program was a small sample dictionary used for the purpose of testing this program (Guthrie, et al., 1996). A useful lexicon would require a large amount of entries by hand. Size of dictionaries has often been a limiting factor in parsers(Guthrie, et al., 1996). The simple parsers used in older text based adventure games only recognize a limited number of words -- verbs such as walk, talk and look, and nouns such as sword, light and rope. The lack of recognition of words often dismayed users. Hence, for a parser to be user friendly, it will require a large lexicon. The way the information is entered could also use improvement. The program would probably be used as part of a larger program that needed to understand natural language. This program would handle the user input and output. The parser would use clues such as punctuation and case for clues as to how to handle the sentence (Cole, Hirschman, Atlas, Beckman, et al., 1995). For instance, commas would help delimit phrases; words that included apostrophes (conjunctions) could be expanded and disambiguated into its component parts. With these small improvements, the parser will be more user friendly to natural language. The most monumental improvement that can be made to this program is the ability to learn (Guthrie, et al., 1996). It would be much more useful if it could increase the size of its lexicon on its own accord. There are two ways this could be done; however, the most realistic approach is a combination of the two. The first approach is quite simple: whenever the parser comes across a word it does not recognize, it prompts the user to define it. This would be simple, user controlled learning. It could be expanded to include user checking of the final meaning, where the definitions of words would be expanded to account for their use in the given sentence. The user could then correct the meanings of words that were misunderstood. The second approach is the complete opposite: the computer will infer the meaning of an unknown word based on the context in which it appears. Obviously, this will be much more difficult to implement, for there is no simple way to infer meanings. The most logical approach is a combination of the two: the computer will choose some aspects of the meaning of unknown words, such as the part of speech and the context the word appears in, and will prompt the user to fill in the blanks(Cole, et al., 1995). There is a fairly reliable way to pick up on similarities throughout the text that will help inference: neural networks. Neural networks are particularly adept at finding similarities given the parameters for which they are looking. Besides aiding inference, they will help the construction of the lexicon by identifying commonly used sentences and phrases that will allow those often used items to be placed as a whole into the lexicon. Likewise, they will identify words that are similarly spelled, which will help account for users spelling errors and words that are derived from the same base word as other words. There is some debate as to how the lexicon should be constructed in artificial natural language processing systems. It would seem that the way many humans do this is with links between words of similar meaning in their lexicons. Hence the lexicon for this parser may be more useful were it to be implemented in two parts: the first part would contain the words and phrases, the second would contain the meanings. Hence, one word might contain links to numerous meanings and each meaning could have links from numerous words. In this regard, it would be similar to a thesaurus (Guthrie, et al., 1996). This would reduce the size of the lexicon by eliminating redundancy while not affecting its effectiveness. The more important implication, if it works, is that it presents the possibility that the human lexicon is similarly separated into two parts. One of the main areas where parsers need to be improved is in elimination of disambiguation in sentences. This is particularly important in applications like translators. A key example of an area that needs to be disambiguated is pronouns. When a pronoun is encountered, the rest of the sentence should be examined to find what subject is being referred to and that subject should be substituted in place of the pronoun. As discussed previously, the parser should use clues from context for a better "understanding" of the text. This will help disambiguate words of multiple meanings. For example, if the word bank is surrounded by other words such as money and teller, the chosen meaning is different from when it is surrounded by words such as river and boat (Guthrie, et al., 1996). Finally, the efficiency of the parser can be improved by parsing based on the part of speech(Cole, et al., 1995). Right now, the parser will slowly break down the sentence or phrase from top to bottom using no clues from the parts of speech. For instance, if the sentence is eight words long and that eight word phrase is not in the lexicon, then the first seven and then the last seven words are examined to see if those seven word phrases are in the dictionary. If neither of those are (which is likely they will not be), then the first six, middle six and last six words are examined. This process continues until a match is found. It would be much more efficient for the phrases of the sentence to be predicted by trying standard patterns, such as subject, verb, object. This would of course be unique for the different languages on which the parser is used. However languages such as Japanese, which rely on particles to separate the parts of speech, should make this easier to accomplish. While there have been many possible improvements listed here, the basic idea remains unchanged: natural language can be efficiently parsed by a top down approach, breaking a sentence or phrase down into its component parts until a part is recognized. This can be made more useful by making it a component part of a program that will get a deeper meaning from the given sentence. This will also make it more user friendly, accepting input that includes case and punctuation. The size of the dictionary needs to be drastically increased. This can be aided by techniques where the parser learns new words and adds them to its lexicon. This could be accomplished with the use of a neural network that would add further functionality. The lexicon could be split into two parts that would reduce its size and presents interesting ramifications for the construction of the human lexicon. The parser needs to have some routine for disambiguating pronouns and specific word meanings in sentences. Finally, the top down efficiency of the parser could be improved through part of speech prediction. If all of these improvements could be made, the parser would do a reasonably good job of understanding natural language, and would do it efficiently as well References Guthrie, L., Pustejovsky, J., Wilks, Y., & Slator, B. M. (1996, January). The role of lexicons in natural language processing. Communications of the ACM, 39(1), 63-72. Cole, R., Hirschman, L., Atlas, L., Beckman, M., Biermann, A., Bush, M., Clements, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., Levinson, S., McKeown, K., Morgan, N., Novick, D. G., Ostendorf, M., Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C., Zahorian, S., & Zue, V. (1995, January). The challenge of spoken language systems: Research directions for the nineties. IEEE Transactions on Speech and Audio Processing, 3(1), 1-21.
An update to the parser discussed here is availible here.
Last Modified: 5/22/2000