FLC News

Method of Identifying Topic of Text Using Nouns

DConT2

Laboratory: National Security Agency

Technology: Invention that enables automatic identification of topics in machine-readable text in any language and from any source, including voice transcripts and material scanned for optical character recognition (OCR).

Opportunity: Available for licensing

Details: Even keyword systems that use dictionaries or thesauruses have difficulty identifying words that appear in variant spellings but without significant changes in meaning, or whose meanings change in the presence of words of similar spelling. This invention overcomes those limitations by providing frequency scores for all nouns in the text.

To identify the topic, the invention identifies each noun appearing in its singular form and then creates noun combinations; in each instance, the user defines how many nouns to combine. Each singular noun receives a frequency score, from which the scores for noun combinations (equal to the sum of scores for constituent nouns) are derived. The topic is identified as the singular nouns and noun combinations with the highest scores.

For example, if a text contains the singular nouns "desk" three times, "chair" twice, and "lamp" once, the system uses two-noun combinations to identify the topic as "desk; chair; desk chair; and desk lamp."

Applications:

  • Search engine enhancement
  • Search tools for mobile devices
  • Document storage and retrieval
  • Automated tools for information management and maintenance

Benefits:

  • Does not rely on keywords - Improves performance over dictionary keyword searches by operating on nouns actually in the text; no performance problems relating to keyword identification
  • Operates independently of language and text source - Can support multiple Big Data applications requiring automatic topic identification

Contact: Email NSA Technology Transfer to discuss licensing opportunities. Cite NSA Reference No. 1397.

Category: 
FLC News