Thread

  1. text search: restricting the number of parsed words in headline generation

    Sushant Sinha <sushant354@gmail.com> — 2011-08-23T16:40:20Z

    Given a document and a query, the goal of headline generation is to
    produce text excerpts in which the query appears. Currently the headline
    generation in postgres follows the following steps:
    
    1. Tokenize the documents and obtain the lexemes
    2. Decide on lexemes that should be the part of the headline
    3. Generate the headline
    
    So the time taken by the headline generation is directly dependent on
    the size of the document. The longer the document, the more time taken
    to tokenize and more lexemes to operate on.
    
    Most of the time is taken during the tokenization phase and for very big
    documents, the headline generation is very expensive. 
    
    Here is a simple patch that limits the number of words during the
    tokenization phase and puts an upper-bound on the headline generation.
    The headline function takes a parameter MaxParsedWords. If this
    parameter is negative or not supplied, then the entire document is
    tokenized  and operated on (the current behavior). However, if the
    supplied MaxParsedWords is a positive number, then the tokenization
    stops after MaxParsedWords is obtained. The remaining headline
    generation happens on the tokens obtained till that point.
    
    The current patch can be applied to 9.1rc1. It lacks changes to the
    documentation and test cases. I will add them if you folks agree on the
    functionality.
    
    -Sushant.