May 21st, 2012

We read in the introduction to Mr. Addad’s jGnoetry:

[…Q]uotation marks, parenthesis, and brackets, […] are tricky to handle in bigram generation systems because you can’t be guaranteed that an open-bracket will have a matching close-bracket.

No argument here. Both the opening and closing tokens of any pair are unlikely to fall within the short range of any interesting n-gram model.

But, what if we modify the model, to make it want to close off the pair?
Translating “want” into some sort of algorithmic model, to increase the chance of closing off the pair.

pair: any of a set of matched bracketing punctuation-marks , eg {}, [], (), "", '', `` etc
tokens: any item in the set of matched pairs, eg {, ], “, ‘, `, etc.
open-token: any of the first elements in the set of tokens, eg {, [, (, ", ', ` etc
close-token: any of the second elements in the set of tokens, eg }, ], ), ", ', ` etc.
active-open-token: the most-recently inserted open-token that has not had it’s matched-close-token inserted.
matched-close-token: the second element in the set of tokens corresponding to the active-open-token.

NOTE: these terms indulge in a bit of self-reference and need some clean-up.

  • A new open-token can be inserted at any point.
  • On insertion of an new open-token,
    • If there is an active-open-token,
      • it will be considered the previous-active-token.
      • The matched-close-token will be considered the previous-matched-close-token.
    • The new open-token will be considered the active-open-token.
    • The matched-close-token will correspond to the active-open-token.
    • A close-token cannot be inserted unless the is an active-open-token.
    • An inserted close-token will be a matched-close-token.
    • Once an open-token is active, the chance of inserting the matched close-token increases.
    • A matched-close-token can be inserted once it’s chance is greater than 0.
    • When the chance of insertion of the matched-close-token reaches 100%, it will be inserted.
    • When a matched-close-token is inserted,
      • The current matched-close-token is no longer considered active.
      • The current active-open-token is no longer considered active.
      • If there is a previous active-open-token
      • The previous-active-open-token will be considered the active-open-token.
      • The previous matched-close-token will be considered the active-close-token.

An obvious model for implementing the storage of current and previous matched-tokens is the stack.

Note that if the n-gram model tokenizes punctuation, the closing insertions may or may not need to affect issues of length or key-modification.

Biq Question: WHY would anybody want to do this in the first place?

1 – For the sheer joy in solving a problem (although this is a sideways solution, as the perceived problem lies within the n-gram model, and my proposed solution is not via n-grams).
2 – Because proper bracketing makes the text more grammatically familiar and readable. This reduces distance to the reader, increasing the likelihood of engagement. Or, the text is easier to read.

no comments yet.

comments RSS trackBack identifier URI

leave a comment

  • syndicate

    • Add to MyMSN
    • Add to MyYahoo
    • Add to Google Reader
    • Add to Bloglines
    • Add to Newsgator
    • Add to NewsIsFree