WordSalad.TextMunger History

Hide minor edits - Show changes to output - Cancel

 
 
June 11, 2012, at 11:09 AM by OtherMichael -
Changed line 114 from:
!!!! Markov tokenizers
to:
!!!! [[#tokenizers]] Markov tokenizers
Changed line 130 from:
!!! matching brackets and other punctuation consideration
to:
!!! [[#matchingbrackets]] matching brackets and other punctuation consideration
Changed line 146 from:
!!! granularity
to:
!!! [[#granularity]] granularity
Changed line 154 from:
!!! GUI and editor
to:
!!! [[#gui]] GUI and editor
Changed line 213 from:
TODO: Translator sub-interface of the ITransformation (which are really rule-based pseudo-translators)
to:
TODO: Translator sub-interface of the [@ITransformation@] (which are really rule-based pseudo-translators)
Changed line 225 from:
!!!! Density
to:
!!!! [[#density]] Density
Changed lines 259-260 from:
ie, 5,5,4,7,3,5,15,17,14,15,25,24,12,13,11,12
to:
i.e., 5,5,4,7,3,5,15,17,14,15,25,24,12,13,11,12
Changed line 319 from:
!!! Invisible Literature
to:
!!! [[#invisiblelit]] Invisible Literature
Changed line 322 from:
!!! Scraping Project Gutenberg
to:
!!! [[#scrapegutenberg]] Scraping Project Gutenberg
Changed lines 334-335 from:
!!! Scraping wikipedia
to:
!!! [[#scrapewikipedia]] Scraping wikipedia
Changed line 342 from:
!!! Scraping pmwiki (this site)
to:
!!! [[#scrapepmwiki]] Scraping pmwiki (this site)
Changed line 360 from:
!!! Library
to:
!!! [[#library]] Library
Changed line 417 from:
!! Documentation/PR
to:
!! [[#documentation]] Documentation/PR
 
 
June 11, 2012, at 11:05 AM by OtherMichael -
Added line 402:
Programming/Tech articles, such as [[http://javascript.crockford.com/|Douglas Crockford on JavaScript]]
 
 
June 11, 2012, at 09:12 AM by OtherMichael - cryptome as source
Added lines 399-401:

[[http://cryptome.org/|Cryptome.org]] - I have a note from 2007 saying to use this as source material. Sheesh, I need to catch up on my TODO lists....

 
 
June 04, 2012, at 09:07 AM by OtherMichael -
Changed line 432 from:
!!! Dada Engine
to:
!!! [[#dada]] Dada Engine
Changed line 437 from:
!!! Waffle Generator
to:
!!! [[#waffle]] Waffle Generator
Changed line 440 from:
!!! rmutt
to:
!!! [[#rmutt]] rmutt
Changed line 443 from:
!!! Infinite Monkeys
to:
!!! [[#infinitemonkeys]] Infinite Monkeys
Changed line 449 from:
!!! GTR Language Workbench
to:
!!! [[#gtr]] GTR Language Workbench
Changed line 454 from:
!!! reading and writing electronic text
to:
!!! [[rwetext]] reading and writing electronic text
Changed line 458 from:
!!! [=JanusNode=]
to:
!!! [[#janus]] [=JanusNode=]
Changed line 464 from:
!!! Web-Based (javascript) Computer Poetry Generation Programs
to:
!!! [[#webjs]] Web-Based (javascript) Computer Poetry Generation Programs
Changed line 468 from:
!!! Gibberizer
to:
!!! [[#gibber]] Gibberizer
Changed line 473 from:
!!! various tools
to:
!!! [[#various]] various tools
Changed line 479 from:
!!! Other broad avenues of research
to:
!!! [[#otherresearch]] Other broad avenues of research
Changed line 503 from:
!!! auto-imported generated wiki pages
to:
!!! [[#wikiimport]] auto-imported generated wiki pages
Changed line 514 from:
!! See Also
to:
!! [[#seeAlso]] See Also
 
 
May 23, 2012, at 09:25 AM by OtherMichael - notes on storing n-gram models externally
Added line 34:
Changed lines 36-38 from:
   Willard: They told me that you had gone totally insane, and that your methods were unsound.
    Kurtz: Are my methods unsound?
    Willard
: I don't see any method at all, sir.
to:
Willard: They told me that you had gone totally insane, and that your methods were unsound.

Kurtz: Are my methods unsound?

Willard
: I don't see any method at all, sir.

->([[https
://en.wikiquote.org/wiki/Apocalypse_Now#Dialogue|from Apocalypse Now]])
Changed lines 44-45 from:
->([[https://en.wikiquote.org/wiki/Apocalypse_Now#Dialogue|source]])
to:

Added lines 65-67:
Store the n-gram model externally, although not necessarily [[http://www.monlp.com/2012/05/17/using-berkeleydb-to-create-a-large-n-gram-table/|in a database]]. This would allow for (re)generation from a large-corpus without length re-processing, continue-reprocessing at a given point (i.e., start from an arbitrary seed within an extant output), etc.

Changed line 100 from:
[http://blog.figmentengine.com/2008/10/markov-chain-code.html]]
to:
[[http://blog.figmentengine.com/2008/10/markov-chain-code.html]]
Changed lines 120-121 from:
The source algorithm was modified, as it stored tokens in a string, using a space [@" "@] as token-delimeter. I've replaced the space with a non-printing control-character, and eliminated other uses of the space in the output.
to:
The source algorithm was modified, as it stored tokens in a string, using a space [@" "@] as token-delimiter. I've replaced the space with a non-printing control-character, and eliminated other uses of the space in the output.
Added lines 128-134:


!!! matching brackets and other punctuation consideration
see Interference:2012/05/21/punctuation-art/
[[https://docs.google.com/viewer?a=v&q=cache:gMQaTui1oRkJ:wing.comp.nus.edu.sg/~luwei/publications/emnlp10ln.ppt|Better Punctuation Prediction with Dynamic Conditional Random Fields]]
[[http://www.comp.nus.edu.sg/~nght/pubs/emnlp10_punct.pdf]]
http://english.stackexchange.com/questions/55855/what-rules-determine-the-apostrophe-placement-in-ham-n-eggs-and-similar-expre
 
 
May 21, 2012, at 10:22 AM by OtherMichael -
Added line 510:
[[Interference:tag/textmunger/|Interference Pattern posts tagged ''textmunger'']]
 
 
May 21, 2012, at 10:19 AM by OtherMichael - major (?) re-org of layout. prior-art pulled into a section not under Core-code
Changed line 24 from:
!! no more line ears
to:
!! [[#nonlinear]] no more line ears
Changed line 32 from:
!! Hermetic Encoder
to:
!! [[#hermetic]] Hermetic Encoder
Changed line 46 from:
!! Crazy Thoughts
to:
!! [[#thoughts]] Crazy Thoughts
Changed line 60 from:
!! Core code
to:
!! [[#core]] Core code
Changed line 72 from:
!!! Markov processor
to:
!!! [[#markov]] Markov processor
Deleted lines 120-206:
!!! similar projects to investigate / Prior Art
!!!! [=SCIgen=] - An Automatic CS Paper Generator
http://pdos.csail.mit.edu/scigen/

I forgot about this -- and the code is available!

!!!! Dada Engine
http://dev.null.org/dadaengine/
http://dev.null.org/dadaengine/manual-1.0/dada_toc.html
[[http://herbert.the-little-red-haired-girl.org/en/dada/index.html|Dada Engine web interface]]

!!!! Waffle Generator
http://www.simple-talk.com/dotnet/.net-tools/the-waffle-generator/

!!!! rmutt
[[http://sourceforge.net/scm/?type=svn&group_id=251485|rmutt]] http://www.schneertz.com/rmutt/

!!!! Infinite Monkeys
[[https://code.google.com/p/infinitemonkeys/|InfiniteMonkeys]] - ''is an open source random poetry generator written in [=FreeBASIC.=] It is largely considered the Industry Standard in SPAM generation.'' No variables, no loops.
https://gnoetrydaily.wordpress.com

See also : Wikipedia:Snowclone for some interesting script ideas

!!!! GTR Language Workbench
http://web.njit.edu/~newrev/3.0//workbench/Workbench.html
wait. is that what I'm trying to build, here?!??!
It's a huge program, because it's built around Eclipse. yipes.

!!!! reading and writing electronic text
http://www.decontextualize.com/teaching/rwet/
Notes from a course; code in Python.

!!!! [=JanusNode=]
http://janusnode.com/
[[https://gnoetrydaily.wordpress.com/2010/06/13/other-tools-ee-wittgenstein-with-janusnode/|review of JN @ GnoetryDaily]]

NO SOURCE AVAILABLE

!!!! Web-Based (javascript) Computer Poetry Generation Programs
http://sourceforge.net/projects/poetrygen/
->by [[WordSalad.AntonioRoque|Edde Addad]]

!!!! Gibberizer
[[https://code.google.com/p/gibberizer/|the Gibberizer]] is on the JRE
Has some good documentation, compared to other projects.


!!!! various tools
https://gnoetrydaily.wordpress.com/2011/08/02/other-tools-gtr-language-workbench-rhyming-robot-seuss-infinite-monkeys/
I've just gotten in-touch with some of the gnoetry-daily guys.
http://www.cutnmix.com/


!!!! Other broad avenues of research
http://stackoverflow.com/questions/1670867/libraries-or-tools-for-generating-random-but-realistic-text
->http://wordnet.princeton.edu/
http://stackoverflow.com/search?q=text%20generator
http://homepages.inf.ed.ac.uk/jbos/comsem/
http://www.statmt.org/
[[http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=6285&lngWId=2|poetry generator applet (java)]]

what??? http://www.nictoglobe.com/new/notities/text.list.html

[[http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html|How To Build a Naive Bayes Classifier]] - with some discussion of "stop words", a link to one catalog, and ideas for pulling out of Gutenberg texts.

http://apocryph.org/2006/06/23/weekend_project_parody_generator_using_rss_pos_tagging_markov_text_generation/
https://sites.google.com/site/texttotext2011/#data
https://en.wikipedia.org/wiki/Natural_language_generation
http://apocryph.org/tag/markov/

[[http://www.perlmonks.org/?node=Acme%3A%3ATranslator|pseudo-translation (in Perl)]] - like the Shizzolator.
[[http://speeves.erikin.com/2007/01/perl-random-string-generator.html|random characters]]
[[http://www.ruf.rice.edu/~pound/|Chris Pound's language machines]]
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer. '''update:''' on reading a follow-up, seems like this is more of designing a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....

[[http://netpoetic.com/2010/10/interactive-poetry-generation-systems-an-illustrated-overview/|Interactive Poetry Generation Systems - an illustrated overview]] - several projects covered above. As the title indicates, an nicely illustrated overview. Good looking GUI ideas....

!!!! auto-imported generated wiki pages
http://www.pmwiki.org/wiki/Cookbook/ImportText

Hrm. That could be.... interesting....

Pushing output of the app back into the wiki (my website, and XRML home)

sounds more like a standalone app that the output is pushed into, though....

Added lines 408-495:



!! [[#priorart]] similar projects to investigate / Prior Art
!!! [=SCIgen=] - An Automatic CS Paper Generator
http://pdos.csail.mit.edu/scigen/

I forgot about this -- and the code is available!

!!! Dada Engine
http://dev.null.org/dadaengine/
http://dev.null.org/dadaengine/manual-1.0/dada_toc.html
[[http://herbert.the-little-red-haired-girl.org/en/dada/index.html|Dada Engine web interface]]

!!! Waffle Generator
http://www.simple-talk.com/dotnet/.net-tools/the-waffle-generator/

!!! rmutt
[[http://sourceforge.net/scm/?type=svn&group_id=251485|rmutt]] http://www.schneertz.com/rmutt/

!!! Infinite Monkeys
[[https://code.google.com/p/infinitemonkeys/|InfiniteMonkeys]] - ''is an open source random poetry generator written in [=FreeBASIC.=] It is largely considered the Industry Standard in SPAM generation.'' No variables, no loops.
https://gnoetrydaily.wordpress.com

See also : Wikipedia:Snowclone for some interesting script ideas

!!! GTR Language Workbench
http://web.njit.edu/~newrev/3.0//workbench/Workbench.html
wait. is that what I'm trying to build, here?!??!
It's a huge program, because it's built around Eclipse. yipes.

!!! reading and writing electronic text
http://www.decontextualize.com/teaching/rwet/
Notes from a course; code in Python.

!!! [=JanusNode=]
http://janusnode.com/
[[https://gnoetrydaily.wordpress.com/2010/06/13/other-tools-ee-wittgenstein-with-janusnode/|review of JN @ GnoetryDaily]]

NO SOURCE AVAILABLE

!!! Web-Based (javascript) Computer Poetry Generation Programs
See [[WordSalad/AntonioRoque#software|software by EddeAddad]]


!!! Gibberizer
[[https://code.google.com/p/gibberizer/|the Gibberizer]] is on the JRE
Has some good documentation, compared to other projects.


!!! various tools
https://gnoetrydaily.wordpress.com/2011/08/02/other-tools-gtr-language-workbench-rhyming-robot-seuss-infinite-monkeys/
I've just gotten in-touch with some of the gnoetry-daily guys.
http://www.cutnmix.com/


!!! Other broad avenues of research
http://stackoverflow.com/questions/1670867/libraries-or-tools-for-generating-random-but-realistic-text
->http://wordnet.princeton.edu/
http://stackoverflow.com/search?q=text%20generator
http://homepages.inf.ed.ac.uk/jbos/comsem/
http://www.statmt.org/
[[http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=6285&lngWId=2|poetry generator applet (java)]]

what??? http://www.nictoglobe.com/new/notities/text.list.html

[[http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html|How To Build a Naive Bayes Classifier]] - with some discussion of "stop words", a link to one catalog, and ideas for pulling out of Gutenberg texts.

http://apocryph.org/2006/06/23/weekend_project_parody_generator_using_rss_pos_tagging_markov_text_generation/
https://sites.google.com/site/texttotext2011/#data
https://en.wikipedia.org/wiki/Natural_language_generation
http://apocryph.org/tag/markov/

[[http://www.perlmonks.org/?node=Acme%3A%3ATranslator|pseudo-translation (in Perl)]] - like the Shizzolator.
[[http://speeves.erikin.com/2007/01/perl-random-string-generator.html|random characters]]
[[http://www.ruf.rice.edu/~pound/|Chris Pound's language machines]]
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer. '''update:''' on reading a follow-up, seems like this is more of designing a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....

[[http://netpoetic.com/2010/10/interactive-poetry-generation-systems-an-illustrated-overview/|Interactive Poetry Generation Systems - an illustrated overview]] - several projects covered above. As the title indicates, an nicely illustrated overview. Good looking GUI ideas....

!!! auto-imported generated wiki pages
http://www.pmwiki.org/wiki/Cookbook/ImportText

Hrm. That could be.... interesting....

Pushing output of the app back into the wiki (my website, and XRML home)

sounds more like a standalone app that the output is pushed into, though....
 
 
May 17, 2012, at 09:01 AM by OtherMichael -
Changed line 161 from:
to:
->by [[WordSalad.AntonioRoque|Edde Addad]]
 
 
May 08, 2012, at 10:41 PM by OtherMichael -
Deleted line 47:
for XRML, need to split into the grid-size
 
 
May 08, 2012, at 01:27 PM by OtherMichael - or just use a rich-text-box
Added lines 377-380:

!!!! [[#richTextBox]] Dissenting Voice
co-worker [[http://www.jonlangdon.com/|Jon Langdon]] suggested just using a [@RichTextBox@] control. Click point can be found, word-beginning and end determined (if not selected), and background-color modified to show that it has been selected.
Probably much easier to integrate than re-tooling everything for HTML output
 
 
May 08, 2012, at 01:12 PM by OtherMichael - web interactivity?
Changed line 257 from:
!!! potential transform rules
to:
!!! [[#transformations]] potential transform rules
Changed lines 357-358 from:

!! Getting Source material
to:
!!! [[#interactive]] Interactivity
Some think that [[http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0014.209|interactivity is a highlight of contemporary epoetry/e-text generation.]]

I'm not sure where I stand on that, but more interactivity would be nice.
See [[http://www.eddeaddad.net/jGnoetry/|jGnoetry]] for an example.

not sure how the method would be used, but the current implementation would not support it, at this point.
WebRendering is great for allowing text to be "object" with styling, links, etc.
Text controls in C#, not so much.

So, what about embedding a web-renderer inside of a Winform?
!!!! linkdump
http://stackoverflow.com/questions/26147/is-it-possible-to-embed-gecko-or-webkit-in-a-windows-form-just-like-a-webview
http://stackoverflow.com/questions/790542/replacing-net-webbrowser-control-with-a-better-browser-like-chrome
http://stackoverflow.com/questions/tagged/webbrowser-control
http://stackoverflow.com/questions/153748/how-to-inject-javascript-in-webbrowser-control
http://stackoverflow.com/questions/1218325/best-way-to-render-html-in-winforms-application
http://www.codeproject.com/Articles/101403/Show-Dynamic-HTML-in-WinForm-Applications
http://www.codeproject.com/Articles/32376/A-Professional-HTML-Renderer-You-Will-Use
http://mono-project.com/WebBrowser


!! [[#sourcMaterial]] Getting Source material
Changed line 489 from:
The "Shannonier" directly links its pedigree to Claude Shannon's information theory, and boats a suite of "editors" that, as far as I can tell, pre-seeded markov banks. But the coder ''sells' it so well....
to:
The "Shannonizer" directly links its pedigree to Claude Shannon's information theory, and boats a suite of "editors" that, as far as I can tell, pre-seeded markov banks. But the coder ''sells' it so well....
 
 
April 16, 2012, at 11:04 PM by OtherMichael -
Added lines 258-259:
re-emvowell
->[[https://github.com/darius/languagetoys/blob/C/emvowel.py|python script]]
 
 
April 16, 2012, at 02:22 PM by OtherMichael - docs and pr
Added lines 459-468:
!! Documentation/PR
Other than the jumble above, and a roadmap on the google-code-wiki, TM has no documentation.
Other high-powered projects I've looked at have scattershot documentation, or are only available in PDF form.
And then, some projects have TONS of documentation:

http://www.cutnmix.com/esoteric/finnegans_wake.html - this is a great promo page for the tool, that basically "rebrands" the idea of character n-grams.

The "Shannonier" directly links its pedigree to Claude Shannon's information theory, and boats a suite of "editors" that, as far as I can tell, pre-seeded markov banks. But the coder ''sells' it so well....

[[https://code.google.com/p/gibberizer/|The Gibberizer]] has a ton of documentation. I haven't actually run the thing yet, so I can't say what the doc:performance ratio is.
 
 
April 16, 2012, at 02:16 PM by OtherMichael -
Changed line 122 from:
!!! similar projects to investigate
to:
!!! similar projects to investigate / Prior Art
Added lines 172-173:
http://www.cutnmix.com/
 
 
April 16, 2012, at 11:53 AM by OtherMichael - notes on sources, also lyrics and scraperwiki
Added lines 432-438:
[[http://www.asklyrics.com/topLyrics.php|lyrics]] - scrape links from this page, for an online sourcer?
->for all online sourcers, I suggest caching content locally

http://archive.org/details/texts
https://scraperwiki.com/

Changed line 457 from:
http://archive.org/details/texts
to:
 
 
April 10, 2012, at 12:22 PM by OtherMichael -
Added lines 256-258:
letter-position shifter, with first and last letters intact
->[[http://www.rubyquiz.com/quiz76.html|ruby quiz]] where this is referred to as a "Text Munger".
[[http://dharmadevil.com/mungeutf8.html|various "mungers"]]
 
 
April 10, 2012, at 10:02 AM by OtherMichael - added gibberizer
Added lines 162-167:


!!!! Gibberizer
[[https://code.google.com/p/gibberizer/|the Gibberizer]] is on the JRE
Has some good documentation, compared to other projects.

 
 
April 04, 2012, at 09:04 AM by OtherMichael -
Changed lines 4-5 from:
I'm (finally) building a (c#) application to do a variety of processing on a variety of inputs.
to:
I'm (finally) building a (c#) application to do a variety of processing on a variety of inputs. 
Added lines 21-23:
The source is hosted online at [[https://code.google.com/p/text-munger/]].

Changed lines 225-228 from:
The GUI is not yet operable, and has a lot of rabbit-holes to run-down, {-but I hope to get the code online and under version-control in the next couple of days (2012.03.06).-}
Initial project pages and SVN commit @ https://code.google.com/p/text-munger/
Hopefully, the GUI will be roughly useable within a week.
UPDATE: (2012.03.15) it's useable, and been refactored a bit into tabs
.
to:

The GUI is operable, if funny-looking.
 
 
April 02, 2012, at 09:07 AM by OtherMichael -
Added lines 435-436:
THE CONTENTS OF MY SPAN FOLDER. How could I have ignored this trove for so long?
Added lines 438-440:


http://archive.org/details/texts
 
 
March 28, 2012, at 09:50 AM by OtherMichael - some links for sources (BCP, POMO), and more notes on encoding meaning and Markov shortcomings
Changed lines 6-11 from:
Oh, yeah, there are [[WordSalad/Generators|plenty of those about.]]

I wanted one of my
own, with the ability to set up a set of inputs, apply my own twiddling to the output, and some other things.

Instead of taking only from a defined source of text -- ie, file or textarea, I want to be able to pull from a number of online resources.
to:
Oh, yeah, there are [[WordSalad/Generators|plenty of those about,]] but I want one of my own, with the ability to set up a set of inputs, apply my own twiddling to the output, and some other things.

Instead of taking only from a defined source of text -- i.e., file or text-area, I want to be able to pull from a number of online resources.
Added line 20:
Added lines 79-80:
And therein lies the [w|r]ub. If I want an algorithm to encode ''meaning'' there has to be some meaning behind it. A statistically-significant-series [of words] doesn't mean much. And it only produces other series. To get something planar, with leverls of reference, there has to be something beyond word-sequence. There has to be some analysis of the words, and working with that to related words/concepts. Which is the Big Bugbear of NLP, isn't it? We're getting close to AI territory.  don't want to go there, but ''my'' goal is more complicated than a linear Markov series.
Added lines 424-426:
[[http://www.streettech.com/bcp/BCPtext/manTOC.html|Beyond Cyberpunk Manifestos (TOC)]]
Random POMO stuff (like the [[http://www.streettech.com/bcp/BCPtext/Manifestos/PanicEncyclopedia.html|Panic Encyclopedia]])

Added line 448:
WordSalad.ElectroPoetics
 
 
March 26, 2012, at 03:07 PM by OtherMichael -
Added line 2:
(:*toc-float:)
 
 
March 25, 2012, at 12:33 PM by OtherMichael -
Added line 249:
-> suggested by [[[http://ex-ex-lit.blogspot.com/2012/03/pome-billy-bob-beamer_24.html|this pome]
 
 
March 25, 2012, at 12:31 PM by OtherMichael -
Added lines 247-248:
re-spacing - take existing word breaks (spaces, punctuation) and move them about.
->"this is a text" := "thi sisat ext"
Deleted line 249:
replace letters/vowels with punctuation or other mark -- "x" or "-"
Changed line 334 from:
to:
replace letters/vowels with punctuation or other mark -- "x" or "-"
 
 
March 16, 2012, at 01:55 PM by OtherMichael - gooey thoughts
Added lines 183-184:

[[http://netpoetic.com/2010/10/interactive-poetry-generation-systems-an-illustrated-overview/|Interactive Poetry Generation Systems - an illustrated overview]] - several projects covered above. As the title indicates, an nicely illustrated overview. Good looking GUI ideas....
 
 
March 16, 2012, at 12:39 PM by OtherMichael -
Changed line 217 from:
As one of the many rabbit-holes I've chased (am chasing) down on this project, I ended up having to create a [[https://code.google.com/p/winforms-custom-select-control/|winforms custom-control]]
to:
As one of the many [[WordSalad/RabbitWholes|rabbit-holes]] I've chased (am chasing) down on this project, I ended up having to create a [[https://code.google.com/p/winforms-custom-select-control/|winforms custom-control]]
 
 
March 15, 2012, at 05:01 PM by OtherMichael - searching for Cowboy Fischer
Changed lines 411-412 from:
[[http://www.gutenberg.org/wiki/Western_%28Bookshelf%29|westerns]]
to:
[[http://www.gutenberg.org/wiki/Western_%28Bookshelf%29|western bookshelf]] - which is surprisingly sparse, compared to actually searching for Westerns
[[http://www.gutenberg.org/ebooks/search.html/?default_prefix=subject_id&sort_order=downloads&query=293|actually searching for Westerns]]

Webapps: [[http://webapps.stackexchange.com/questions/12311/how-to-download-all-english-books-from-gutenberg|How do I download all English-language books from Project Gutenberg?]]
 
 
March 15, 2012, at 04:51 PM by OtherMichael -
Changed line 409 from:
[[http://www.gutenberg.org/ebooks/5740|Tractatus Logico-Philosophicus by Ludwig Wittgenstein]] - unfortunately, only in PDF or TEX
to:
[[http://www.gutenberg.org/ebooks/5740|Tractatus Logico-Philosophicus by Ludwig Wittgenstein]] - unfortunately, only in PDF or TEX. Which makes sense for all of the logical equations.
Added line 411:
[[http://www.gutenberg.org/wiki/Western_%28Bookshelf%29|westerns]]
 
 
March 15, 2012, at 11:41 AM by OtherMichael - fixed some markup
Changed lines 88-92 from:
http://blog.figmentengine.com/2008/10/markov-chain-code.html
http://phalanx.spartansoft.org/2010/03/30/markov-chain-generator-in-c/
http://2kittymafiasoftware.blogspot.com/2011/03/pseudo-random-tex-generator-using.html
https://github.com/pjbss/Pseudo-Random-Text-Generator/blob/master/PseudoRandomTextGenerator/TextGenerator.cs
to:
[http://blog.figmentengine.com/2008/10/markov-chain-code.html]]
[[
http://phalanx.spartansoft.org/2010/03/30/markov-chain-generator-in-c/]]
[[
http://2kittymafiasoftware.blogspot.com/2011/03/pseudo-random-tex-generator-using.html]]
[[
https://github.com/pjbss/Pseudo-Random-Text-Generator/blob/master/PseudoRandomTextGenerator/TextGenerator.cs]]
Changed lines 98-99 from:
However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically editing the output -- ie, First word is capitalized, sentences, paragraphs, etc.
to:
However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically pro-grammatically editing the output -- ie, First word is capitalized, sentences, paragraphs, etc. [if the target is not [[XraysMonaLisa|XRML]]
Changed line 103 from:
Breaking apart the sources (texts) into tokens is, for me at least, not a simple issue. Lots of tokenizers discard punctuation and whitespace. My own quirks mean, that unless I'm trying to disparage Python, I'm interested in whitespace and punctation as semantic elements. However, I'm not positive wether they should be considered as a block -- ie, [@{",-;....@}] or broken into pieces [@{ {"}, {,}, {-}, {;}, {.}, {.}, {.}, {.}, {@} }].
to:
Breaking apart the sources (texts) into tokens is, for me at least, not a simple issue. Lots of tokenizers discard punctuation and whitespace. My own quirks mean, that unless I'm trying to disparage Python, I'm interested in whitespace and punctation as semantic elements. However, I'm not positive wether they should be considered as a block -- ie, [@{",-;....@}@] or broken into pieces [@{ {"}, {,}, {-}, {;}, {.}, {.}, {.}, {.}, {@} }@].
Changed line 113 from:
->http://nlpdotnet.com/SampleCode/ImproveGreedyTokenizer.aspx
to:
->[[http://nlpdotnet.com/SampleCode/ImproveGreedyTokenizer.aspx]]
 
 
March 15, 2012, at 10:55 AM by OtherMichael -
Changed lines 1-2 from:
(:description notes notes notes :)
to:
(:description notes stones etons snote:)
Changed lines 55-56 from:

to:
But it would allow programmatic-scripting of.... something.

Changed lines 102-103 from:

!!! [=SCIgen=] - An Automatic CS Paper Generator
to:
!!!! Markov tokenizers
Breaking apart the sources (texts) into tokens is, for me at least, not a simple issue. Lots of tokenizers discard punctuation and whitespace. My own quirks mean, that unless I'm trying to disparage Python, I'm interested in whitespace and punctation as semantic elements. However, I'm not positive wether they should be considered as a block -- ie, [@{",-;....@}] or broken into pieces [@{ {"}, {,}, {-}, {;}, {.}, {.}, {.}, {.}, {@} }].
Plus, sometimes I like processing on the character-level, instead of the word level. Sentence-level would seem to be a non-starter for generating a "new" text as repeating sentences are not common. In most texts. (Even Gertrude Stein. Right? Hrm....)

So, My Markov transformer takes a tokenizer as a parameter, along with some other rules.

The source algorithm was modified, as it stored tokens in a string, using a space [@" "@] as token-delimeter. I've replaced the space with a non-printing control-character, and eliminated other uses of the space in the output.

I also want to revisit the storage model, as I've hand-built my own Markov-generators in the past using real data-structures. String concatenation seems slow, but it might be using a string-builder for all I remember at the moment...

I need to continue look into how I'm building my tokenizers.
->http://nlpdotnet.com/SampleCode/ImproveGreedyTokenizer.aspx

NOTE: Word-based rule-application relies upon breaking apart the source-text into words. This is currently discrete code, and puts things back together with spaces, with less-than-perfect results. The same tokenizers and combiners [?!?!] should be used

!!! similar projects to investigate
!
!!! [=SCIgen=] - An Automatic CS Paper Generator
Changed line 123 from:
!!! Dada Engine
to:
!!!! Dada Engine
Changed line 128 from:
!!! Waffle Generator
to:
!!!! Waffle Generator
Changed line 131 from:
!!! rmutt
to:
!!!! rmutt
Changed lines 134-135 from:
!!! Infinite Monkeys
[[https://code.google.com/p/infinitemonkeys/|InfiniteMonkeys]] - ''is an open source random poetry generator written in [=FreeBASIC.=] It is largely considered the Industry Standard in SPAM generation.''
to:
!!!! Infinite Monkeys
[[https://code.google.com/p/infinitemonkeys/|InfiniteMonkeys]] - ''is an open source random poetry generator written in [=FreeBASIC.=] It is largely considered the Industry Standard in SPAM generation.'' No variables, no loops.
Changed line 140 from:
!!! GTR Language Workbench
to:
!!!! GTR Language Workbench
Changed line 145 from:
!!! reading and writing electronic text
to:
!!!! reading and writing electronic text
Changed line 149 from:
!!! [=JanusNode=]
to:
!!!! [=JanusNode=]
Changed line 155 from:
!!! Web-Based (javascript) Computer Poetry Generation Programs
to:
!!!! Web-Based (javascript) Computer Poetry Generation Programs
Changed line 158 from:
!!! various tools
to:
!!!! various tools
Changed line 162 from:
!!! broad avenues of research
to:
!!!! Other broad avenues of research
Changed line 184 from:
!!! auto-imported generated wiki pages
to:
!!!! auto-imported generated wiki pages
Changed lines 191-193 from:


to:
sounds more like a standalone app that the output is pushed into, though....

Changed lines 207-208 from:

to:
The current crop of Transformers/Rules are considered "All", "Sentence" or "Word" -level granularity. Practically speaking, only "All" and "Word" -level rule application exists, so no real sentence-chunk rules exist. I've positied some other levels, but none have been implemented.
A rethink of the granularity implementation is required -- a given rule probably hits a RANGE of granulary -- ie, Pig-Latin works on the word-level only, but Reverse and Random-Caps can work on any level (they're almost pointless on a char-level, though). Markov has no application on char-level, and can only work on word-level if a character-level tokenizer is used. And even then results are almost random with words with little to no repetition.

See Also: tokenizers, above.

Changed lines 221-222 from:

to:
UPDATE: (2012.03.15) it's useable, and been refactored a bit into tabs.

Added line 389:
TODO: At some point, bundle up some source-texts and provide them as a download on the google-code page.
 
 
March 15, 2012, at 09:50 AM by OtherMichael - rant incleded
Changed lines 21-28 from:
to:
!! no more line ears
I'm sick of linearity, word-follows-word, line-follows-line, left-to-right end-of-line DING zzzzzzzzzip back to the start and next-line-again until we get to the marginalized page and turn to next page which follows the previous page.

Where's the up? Where's the down? Where's the round-the-world, "I don't think we're in flatland-anymore, Toto!" Now, while, XRML _looks_ like flatland, it wants to be the technicolor escape from the black-and-white of linear-land when you look at the history of edits. Which doesn't exist in any visually accessible form. Each page/the-whole is part of a text river than changes -- it's not a linear textile, it's a stacked set of frames, that the camera can pan through and focus on whichever plane makes the most sense, while the foreground and background panes adding context. Thanks, Walt Disney.

THIS is the end-goal of TextMunger. To be a planar text-editor. That's a long way down the yellow-brick-road. But I've got the ruby[^silver in the book^] slippers, and I've taken the first steps....

Changed lines 414-416 from:
WordSalad.ElectroText
to:
WordSalad.ElectroText

[^#^]
 
 
March 15, 2012, at 09:16 AM by OtherMichael - rhyme time
Added lines 245-249:

Rhyming?
->archive.org's [[http://www.archive.org/details/compactrhymingdi00benniala|Compact Rhyming Dictionary]] looks to be a dodgy OCR
->archive.org's [[http://www.archive.org/stream/rhymingdictionar00walk/rhymingdictionar00walk_djvu.txt|Walkers rhyming Dictionary]] also has dodgy OCR
TODO: find a better file
 
 
March 08, 2012, at 11:37 AM by OtherMichael - refactoring invisible literature
Changed lines 316-325 from:
Wikileaks? http://wikileaks.org/gifiles/
http://invisibleliterature.com/?55427e60
http://williamgibsonboard.com/eve/forums/a/tpc/f/8606097971/m/8946093202
http://www2.iath.virginia.edu/elab//hfl0093.html
http://rodcorp.typepad.com/rodcorp/2006/04/invisible_index.html
>>clip lrindent<<
have always been a voracious reader of what I call invisible literatures - scientific journals, technical manuals, pharmaceutical company brochures, think-tank internal documents, PR company position papers - part of that universe of published material to which most literate people have scarcely any access but which provides the most potent compost for the imagination [...]

My copy of the Los Angeles Yellow Pages I stole from the Beverly Hilton Hotel three years ago; it has been a fund of extraordinary material, as surrealist in its way as Dali's autobiography.
>><<
to:
see WordSalad.InvisibleLiterature
 
 
March 07, 2012, at 08:57 AM by OtherMichael - edde addad's javascript poetry generators
Added lines 131-133:

!!! Web-Based (javascript) Computer Poetry Generation Programs
http://sourceforge.net/projects/poetrygen/
 
 
March 06, 2012, at 09:49 PM by OtherMichael - link to google-code project
Changed lines 189-190 from:
The GUI is not yet operable, and has a lot of rabbit-holes to run-down, but I hope to get the code online and under version-control in the next couple of days (2012.03.06).
to:
The GUI is not yet operable, and has a lot of rabbit-holes to run-down, {-but I hope to get the code online and under version-control in the next couple of days (2012.03.06).-}
Initial project pages and SVN commit @ https://code.google.com/p/text-munger/
 
 
March 06, 2012, at 04:24 PM by OtherMichael - notes on GUI in-progress....
Added lines 187-192:
I've begun work on the editor.
As one of the many rabbit-holes I've chased (am chasing) down on this project, I ended up having to create a [[https://code.google.com/p/winforms-custom-select-control/|winforms custom-control]]
The GUI is not yet operable, and has a lot of rabbit-holes to run-down, but I hope to get the code online and under version-control in the next couple of days (2012.03.06).
Hopefully, the GUI will be roughly useable within a week.

Changed lines 233-234 from:
to:
[[http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html|purposefully bad implementation of a bowlderizer]] - an instant ''clbuttic''!!!
Changed lines 406-407 from:
WordSalad.AppropriationsCommittee - ''Originality, what is that?''
to:
WordSalad.AppropriationsCommittee - ''Originality, what is that?''
WordSalad.ElectroText
 
 
March 02, 2012, at 03:17 PM by OtherMichael -
Added lines 44-47:

Break apart from a monolithic program, to having all transformers be standalone apps that take [[Windows/CommandLineParameter|command-line parameters]] ?
This could allow for some interesting chaining in other ways..... but would also mean a lot of weird calling inside the app?
The Unix model.....
 
 
March 01, 2012, at 12:00 PM by OtherMichael - janus node notes
Added lines 121-126:

!!! [=JanusNode=]
http://janusnode.com/
[[https://gnoetrydaily.wordpress.com/2010/06/13/other-tools-ee-wittgenstein-with-janusnode/|review of JN @ GnoetryDaily]]

NO SOURCE AVAILABLE
 
 
February 29, 2012, at 01:14 PM by OtherMichael -
Deleted lines 196-197:

Changed line 216 from:
[[https://textytext.wordpress.com/2008/02/20/bayesian-text-replacement/|Bayesian replacement]] - I had thought about this vaguely, but it looks as though [[http://www.decontextualize.com/projects/|Adam Parrish]] thought about this concretely.
to:
[[https://textytext.wordpress.com/2008/02/20/bayesian-text-replacement/|Bayesian replacement]] - I had thought about this vaguely, but it looks as though [[http://www.decontextualize.com/projects/|Adam Parrish]] thought about this concretely ([[https://textytext.wordpress.com/2008/02/21/more-bayesian-text-swapping/|python source code]]).
 
 
February 29, 2012, at 12:39 PM by OtherMichael -
Changed lines 13-16 from:
What I want is a '''hermetic encoder'''. (Not [[http://www.gnnvietnam.com/Product-MAGRES_hermetic_encoder_%E2%80%93_Unsurpassed_robustness-4438_3_DCH-12-211-268-122-2.aspx|THIS]], [[WordSalad/HermeticDetector|THIS.]])

And to build a naive, simplistic, even moderately-interesting output engine
, I must have a better understanding of my own opaque processes.
to:
What I want is a '''hermetic encoder'''. (Not [[http://www.gnnvietnam.com/Product-MAGRES_hermetic_encoder_%E2%80%93_Unsurpassed_robustness-4438_3_DCH-12-211-268-122-2.aspx|THIS]], [[WordSalad/HermeticDetector|THIS.]]) cf, Jerry Cornelius' [[Wikipedia:Airtight_Garage|Hermetic Garage.]]

And to build a naive
, simplistic, even moderately-interesting output engine, I must have a better understanding of my own opaque, oxygen-starved processes.
Added line 218:
[[https://textytext.wordpress.com/2008/02/20/bayesian-text-replacement/|Bayesian replacement]] - I had thought about this vaguely, but it looks as though [[http://www.decontextualize.com/projects/|Adam Parrish]] thought about this concretely.
 
 
February 29, 2012, at 09:59 AM by OtherMichael - markov notes added, and shuffled
Deleted lines 48-52:
Markov text generation is naive.
Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
Naive in that I thought with some tweaking I could get it to provide an interesting output.
And finally naive in that it has nothing to do with language at all - it's a happy accident of statistics that it produces output that looks like language. there is no "understanding" of language at all (ignore that no application "understands" language).

Added lines 59-75:
Markov text generation is naive.

Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.

Naive in that I thought with some tweaking I could get it to provide an interesting output.

And finally naive in that it has nothing to do with language at all - it's a happy accident of statistics that it produces output that looks like language. there is no "understanding" of language at all (ignore that no application "understands" language).

>>clip lrindent<<
n-gram models are often criticized because they lack any explicit representation of long range dependency. (In fact, it was Chomsky's critique of Markov models in the late 1950s that caused their virtual disappearance from natural language processing, along with statistical methods in general, until well into the 1980s.) This is because the only explicit dependency range is (n-1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.

Another criticism that has been made is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction discussed by Chomsky. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications.
->([[Wikipedia:N-gram#Applications_and_considerations|source]])
>><<

TODO: the above notes will probably be moved to [[WordSalad.ChainsOfLove]]

 
 
February 29, 2012, at 09:39 AM by OtherMichael - translator and density random-walk notes, GTR and gnoetry glosses
Deleted line 102:
Changed lines 104-105 from:
to:
It's a huge program, because it's built around Eclipse. yipes.
Changed lines 108-109 from:
to:
Notes from a course; code in Python.
Changed lines 112-113 from:
to:
I've just gotten in-touch with some of the gnoetry-daily guys.
Added lines 207-213:
TODO: Translator sub-interface of the ITransformation (which are really rule-based pseudo-translators)
Translators retain the Source, Munged methods, but add Translate and Reverse
on the assumption that the translation is somewhat bi-jective
In the case of pig-latin, that may not be strictly true, which could be interesting.
[ie. the two english words "wall" and "all" translate to the single pig-latin ''allway'', which means that it could have two possible reverse transformations, and only contextual analysis could tell which one, and that's a huge scope-creep.]
Nevertheless, most of the pseudo-translators have rules that can be easily reversed.

Added lines 245-253:

Granularity needs to be less static than randomization around a fixed point. We need a random walk points, with randomization around those.

ie, 5,5,4,7,3,5,15,17,14,15,25,24,12,13,11,12

algorithm: random walker with weighted jumps, some discontinuity, but not 0..1840 generally. Some weighted, random amount for the number of points around that walk-stop.

TODO: some sort of d--n terminology
TODO: clean up the above mess
 
 
February 28, 2012, at 10:48 PM by OtherMichael -
Changed lines 3-6 from:
I'm (finally) building a (c#) application to do Markov processing on a variety of inputs.

Oh, yeah, there are plenty of those about.
to:
I'm (finally) building a (c#) application to do a variety of processing on a variety of inputs.

Oh, yeah, there are [[WordSalad/Generators|plenty of those about.]]
Changed lines 11-13 from:
What I really want is an algorithm that approximates the bizarre thought patterns that lead to an XRML page.
What I want is a '''hermetic encoder'''.
to:
What I really want is an algorithm that approximates the bizarre thought patterns that lead to an [[XraysMonaLisa/Anywhere|XRML page.]]

What I want is a '''hermetic encoder'''. (Not [[http://www.gnnvietnam.com/Product-MAGRES_hermetic_encoder_%E2%80%93_Unsurpassed_robustness-4438_3_DCH-12-211-268-122-2.aspx|THIS]], [[WordSalad/HermeticDetector|THIS.]])
Changed lines 17-19 from:
This means both processing algorithms, but also sourcing algorithms = where does text come from, how does it interrelate. the landscape, textscape, textriver ideas, cut ups, concrete poetry, misunderstood maths, art, lit, pop culture. how to blender?

And
I should stop worrying about "perfection" -- work towards a first approximation -- something that provides me with a jumping-off point for real editing, curation, modification, etc. Which is the major goal, isn't it?
to:
This means both processing algorithms, but also sourcing algorithms. IE, where does text come from, how does it interrelate. The landscape, textscape, textriver ideas, cut ups, concrete poetry, misunderstood maths, art, lit, pop culture. How do I replicate the blenderize in my headscape?

I should stop worrying about "perfection"
-- work towards a first approximation -- something that provides me with a jumping-off point for real editing, curation, modification, etc. Which is the major goal, isn't it?
 
 
February 28, 2012, at 03:02 PM by OtherMichael -
Added lines 98-108:

!!! GTR Language Workbench
http://web.njit.edu/~newrev/3.0//workbench/Workbench.html

wait. is that what I'm trying to build, here?!??!

!!! reading and writing electronic text
http://www.decontextualize.com/teaching/rwet/

!!! various tools
https://gnoetrydaily.wordpress.com/2011/08/02/other-tools-gtr-language-workbench-rhyming-robot-seuss-infinite-monkeys/
 
 
February 28, 2012, at 02:50 PM by OtherMichael -
Changed lines 95-96 from:

See also
: Wikipedia:Snowclone
to:
https://gnoetrydaily.wordpress.com

See also
: Wikipedia:Snowclone for some interesting script ideas
 
 
February 28, 2012, at 02:34 PM by OtherMichael -
Added lines 95-96:

See also : Wikipedia:Snowclone
 
 
February 28, 2012, at 02:23 PM by OtherMichael -
Deleted lines 171-172:
leet-speak replacement
->http://stackoverflow.com/questions/3215228/convert-leet-speak-to-plaintext
Deleted lines 172-173:
pig-latin-ification
->http://stackoverflow.com/questions/3563328/piglatin-using-arrays http://stackoverflow.com/questions/4098178/c-sharp-translator-from-pig-latin-to-english http://www.programmersheaven.com/mb/csharp/392979/392979/piglatin-program-calling-methods/
Changed line 193 from:
need some rule for injecting these at OTHER points, beyond the existing markov rules
to:
need some rule for injecting these at OTHER points, beyond the existing Markov rules
Changed lines 205-206 from:
but number of spaces can't be too large -- we don't want them on separate pages (ie, 40+ lines between each part)
to:
but number of spaces can't be too large -- we don't want them on separate pages (i.e., 40+ lines between each part)
Changed line 214 from:
I'm getting a lot of algorthmic workouts, here.
to:
I'm getting a lot of algorithmic workouts, here.
Added lines 231-233:
pig-latin-ification
leet-speak replacement

 
 
February 28, 2012, at 09:09 AM by OtherMichael -
Changed lines 215-220 from:
First coding will be naive, and probably not even include word-splits.
to:
Slightly randomized density around the indicated amount is implemented, although the algorithm could use some tweaking.
And, again, once it's in place, I see that it still is "not enough."
I think the density should "wander" -- that is, not remain fixed, but go up and down, sometimes discretely, sometimes discontinuously. Weighted randomness based on history?
I'm getting a lot of algorthmic workouts, here.
Which is half the point in doing a project. But I need to remember that the product [oh! how crude!] is the goal, not the process, right? The text, not the application. But as a programmer, it is also a goal....

 
 
February 27, 2012, at 11:21 AM by OtherMichael - plagiarisim & remixing notes added to stand-alone "Hermetic Encode" section
Deleted line 17:
Changed lines 20-33 from:
to:
!! Hermetic Encoder
Notes on a|my process
>>clip lrindent<<
    Willard: They told me that you had gone totally insane, and that your methods were unsound.
    Kurtz: Are my methods unsound?
    Willard: I don't see any method at all, sir.
>><<
->([[https://en.wikiquote.org/wiki/Apocalypse_Now#Dialogue|source]])

How much of this is "original" ? See also, WordSalad.AppropriationsCommittee
This project will be process (and (re)mixing) existing texts, perhaps min, mostly others'.
When I worked "manually", I usually had texts in front of me, or they hyper-jumped scramble of memory... tv, radio, advertising, newspapers, receipts, collages, etc.

Changed lines 339-340 from:
[[/XraysMonaLisa.NonSequential]] - not a lot of there there, but a start.
to:
[[/XraysMonaLisa.NonSequential]] - not a lot of there there, but a start.
WordSalad.AppropriationsCommittee - ''Originality, what is that?''
 
 
February 27, 2012, at 10:50 AM by OtherMichael - added "Invisible Literatur", major re-shuffling to combine repeated or similar sections
Deleted lines 10-15:

Markov text generation is naive.
Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
Naive in that I thought with some tweaking I could get it to provide an interesting output.
And finally naive in that it has nothing to do with language at all - it's a happy accident of statistics that it produces output that looks like language. there is no "understanding" of language at all (ignore that no application "understands" language).

Changed lines 32-49 from:
!! Markov core
to:
!! Core code
The core is no longer a Markov engine, although was what I once thought it would be

Markov text generation is naive.
Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
Naive in that I thought with some tweaking I could get it to provide an interesting output.
And finally naive in that it has nothing to do with language at all - it's a happy accident of statistics that it produces output that looks like language. there is no "understanding" of language at all (ignore that no application "understands" language).

So, core code is two parts
a) selection of texts
** local library
** online sources
b) application of processing rules
** this includes formatting, etc.



!!! Markov processor
Deleted line 91:
Added lines 94-115:
[[http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html|How To Build a Naive Bayes Classifier]] - with some discussion of "stop words", a link to one catalog, and ideas for pulling out of Gutenberg texts.

http://apocryph.org/2006/06/23/weekend_project_parody_generator_using_rss_pos_tagging_markov_text_generation/
https://sites.google.com/site/texttotext2011/#data
https://en.wikipedia.org/wiki/Natural_language_generation
http://apocryph.org/tag/markov/

[[http://www.perlmonks.org/?node=Acme%3A%3ATranslator|pseudo-translation (in Perl)]] - like the Shizzolator.
[[http://speeves.erikin.com/2007/01/perl-random-string-generator.html|random characters]]
[[http://www.ruf.rice.edu/~pound/|Chris Pound's language machines]]
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer. '''update:''' on reading a follow-up, seems like this is more of designing a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....

!!! auto-imported generated wiki pages
http://www.pmwiki.org/wiki/Cookbook/ImportText

Hrm. That could be.... interesting....

Pushing output of the app back into the wiki (my website, and XRML home)



Changed line 131 from:
!! visual interface
to:
!!! GUI and editor
Changed lines 142-152 from:

!! Getting Source material
I'm building source inputs
-- could be called [@TextGetters@], for want of a better name
Working on [@WebGetters@], to grab from XraysMonaLisa, Gutenberg, and eventually Textfiles.com
See
: WordSalad/TextShopping WordSalad/Generators WordSalad/InternetMemeText WordSalad/LoremIpsum WordSalad/Spam for more ideas

I found that processing 60 pages of XRML provides a strikingly boring output
. due to a high lack of self-similarity, the rendered output doesn't vary all that much. Dropping the key-length amount is one option. Adding in alternate sources is another.
However, processing 60 XRML pages and an entire novel now skews towards the novel, and I want output to "look" like XRML
.
So, need to figure out some methods of modifying non
-XRML source to be more similar.

to:
http://www.codeproject.com/Articles/26422/Peter-Programmers-Extensive-Text-Editor
http
://www.codeproject.com/Articles/18166/Scratchpad-An-Auto-Save-Notepad
http://www
.codeproject.com/Articles/9761/Visual-Studio-Editor-Clone-V0-1a
http://www
.codeproject.com/Articles/19852/Text-Editor-Using-C
http://www
.codeproject.com/Articles/7213/Write-Your-Customized-Editor-for-Your-Own-Programm
http://www
.codeproject.com/Articles/11705/Hex-Editor-in-c

[[http://community.sharpdevelop.net/forums/p/5717/16387.aspx#16387|editors that allow vertical (block) selection]] -- and of course, Emacs.
sadly, {{#Develop}} doesn't (yet?) support block-highlighting
[[http://www.freelancer.in/projects/NET/Only-for-experts-NETC-Column.html|call developer to implement]]
[[http://www.icsharpcode.net/OpenSource/SD/InsideSharpDevelop.aspx|Dissecting a C# Application: Inside SharpDevelop]] - have a look at Chapter 11, "Writing the editor control" (I have the ebook inside of dropbox)



Changed line 214 from:
whitespace to punctuation
to:
white-space to punctuation
Changed lines 217-218 from:
techncally, Markov belongs in here, as it is just one of several rules.
to:
technically, Markov belongs in here, as it is just one of several rules.
Added lines 224-244:
!! Getting Source material
I'm building source inputs -- could be called [@TextGetters@], for want of a better name
Working on [@WebGetters@], to grab from XraysMonaLisa, Gutenberg, and eventually Textfiles.com
See: WordSalad/TextShopping WordSalad/Generators WordSalad/InternetMemeText WordSalad/LoremIpsum WordSalad/Spam for more ideas

I found that processing 60 pages of XRML provides a strikingly boring output. due to a high lack of self-similarity, the rendered output doesn't vary all that much. Dropping the key-length amount is one option. Adding in alternate sources is another.
However, processing 60 XRML pages and an entire novel now skews towards the novel, and I want output to "look" like XRML.
So, need to figure out some methods of modifying non-XRML source to be more similar.

!!! Invisible Literature
Wikileaks? http://wikileaks.org/gifiles/
http://invisibleliterature.com/?55427e60
http://williamgibsonboard.com/eve/forums/a/tpc/f/8606097971/m/8946093202
http://www2.iath.virginia.edu/elab//hfl0093.html
http://rodcorp.typepad.com/rodcorp/2006/04/invisible_index.html
>>clip lrindent<<
have always been a voracious reader of what I call invisible literatures - scientific journals, technical manuals, pharmaceutical company brochures, think-tank internal documents, PR company position papers - part of that universe of published material to which most literate people have scarcely any access but which provides the most potent compost for the imagination [...]

My copy of the Los Angeles Yellow Pages I stole from the Beverly Hilton Hotel three years ago; it has been a fund of extraordinary material, as surrealist in its way as Dali's autobiography.
>><<

Changed lines 283-318 from:

!! auto-imported generated wiki pages
http://www.pmwiki.org/wiki/Cookbook/ImportText

Hrm. That could be.... interesting....

Pushing output of the app back into the wiki.....


!! research
http://apocryph.org/2006/06/23/weekend_project_parody_generator_using_rss_pos_tagging_markov_text_generation/
https://sites.google.com/site/texttotext2011/#data
https://en.wikipedia.org/wiki/Natural_language_generation
http://apocryph.org/tag/markov/

[[http://www.perlmonks.org/?node=Acme%3A%3ATranslator|pseudo-translation (in Perl)]] - like the Shizzolator.
[[http://speeves.erikin.com/2007/01/perl-random-string-generator.html|random characters]]
[[http://www.ruf.rice.edu/~pound/|Chris Pound's language machines]]
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer. '''update:''' on reading a follow-up, seems like this is more of desigining a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....


!! GUI and editor
http://www.codeproject.com/Articles/26422/Peter-Programmers-Extensive-Text-Editor
http://www.codeproject.com/Articles/18166/Scratchpad-An-Auto-Save-Notepad
http://www.codeproject.com/Articles/9761/Visual-Studio-Editor-Clone-V0-1a
http://www.codeproject.com/Articles/19852/Text-Editor-Using-C
http://www.codeproject.com/Articles/7213/Write-Your-Customized-Editor-for-Your-Own-Programm
http://www.codeproject.com/Articles/11705/Hex-Editor-in-c

[[http://community.sharpdevelop.net/forums/p/5717/16387.aspx#16387|editors that allow vertical (block) selection]] -- and of course, Emacs.
sadly, {{#Develop}} doesn't (yet?) support block-highlighting
[[http://www.freelancer.in/projects/NET/Only-for-experts-NETC-Column.html|call developer to implement]]
[[http://www.icsharpcode.net/OpenSource/SD/InsideSharpDevelop.aspx|Dissecting a C# Application: Inside SharpDevelop]] - have a look at Chapter 11, "Writing the editor control" (I have the ebook inside of dropbox)


!! Library
to:
!!! Library
Deleted lines 316-317:

Deleted lines 317-318:

 
 
February 25, 2012, at 02:58 PM by OtherMichael -
Changed lines 70-72 from:
to:
!!! Infinite Monkeys
[[https://code.google.com/p/infinitemonkeys/|InfiniteMonkeys]] - ''is an open source random poetry generator written in [=FreeBASIC.=] It is largely considered the Industry Standard in SPAM generation.''

Changed line 79 from:
to:
[[http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=6285&lngWId=2|poetry generator applet (java)]]
 
 
February 25, 2012, at 02:43 PM by OtherMichael - rule ideas
Changed lines 127-134 from:
pseudo-localiation
to:
internet slang converter
->http://www.noslang.com/dictionary/full/
The Snoop-Dog "Shizzolator" is long gone, but some samples live on:
->http://nwtekno.org/showthread.php?t=48752 http://www.brianenos.com/forums/index.php?showtopic=3402
Dialectalizer: http://rinkworks.com/dialect/works.shtml
12-year-old AOLer , whose author says "don't copy this", so, don't
->http://ssshotaru.homestead.com/files/aolertranslator.html
pseudo-localization
Changed lines 140-141 from:

to:
Automatic Translation of English Text to Phonetics by Means of Letter-to-Sound Rules - 1976 paper
->http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA021929

Changed lines 172-173 from:
disemvowelling
disemconsonanting
to:
disemvowell
disemconsonant (cf, disemvowell, but, well, it should be obvious)
Changed lines 178-179 from:
to:
density - to a second approximation. have a sliding scale of 0..100% := 0..1840 puncts,, but no randomness yet.
techncally, Markov belongs in here, as it is just one of several rules.
 
 
February 24, 2012, at 03:05 PM by OtherMichael -
Changed line 38 from:
!! Application core
to:
!! Markov core
Deleted line 43:
Added lines 51-53:
Which is all non-Markov stuff. That heavy lifting is over with. whatever.

Deleted line 121:
homophonic replacement: http://www.peak.org/~jeremy/dictionaryclassic/chapters/homophones.php
Changed lines 169-171 from:
to:
homophonic replacement: http://www.peak.org/~jeremy/dictionaryclassic/chapters/homophones.php

Changed line 233 from:
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer.
to:
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer. '''update:''' on reading a follow-up, seems like this is more of desigining a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....
 
 
February 24, 2012, at 11:26 AM by OtherMichael -
Changed lines 230-231 from:

to:
[[http://saizai.livejournal.com/657391.html|Non-Linear Fully Two-Dimensional Writing System Design]] - some interesting ideas, well articulated. doesn't like the grid. the small example shown is still more linear than I prefer.

Changed lines 293-295 from:
XraysMonaLisa.ArchaeologicalNotes - some small thoughts on origins. Needs more earthquake.
to:
XraysMonaLisa.ArchaeologicalNotes - some small thoughts on origins. Needs more earthquake.
[[XraysMonaLisa.ElectronicWriting]] - more notes on influences.
[[/XraysMonaLisa.NonSequential]] - not a lot of there there, but a start
.
 
 
February 24, 2012, at 10:53 AM by OtherMichael - xref to Archeological Notes
Changed lines 291-292 from:
[[WordSalad.Generators]]
to:
[[WordSalad.Generators]]
XraysMonaLisa.ArchaeologicalNotes - some small thoughts on origins. Needs more earthquake.
 
 
February 23, 2012, at 11:39 AM by OtherMichael -
Changed lines 62-68 from:
to:
!!! Waffle Generator
http://www.simple-talk.com/dotnet/.net-tools/the-waffle-generator/

!!! rmutt
[[http://sourceforge.net/scm/?type=svn&group_id=251485|rmutt]] http://www.schneertz.com/rmutt/

Changed lines 75-78 from:
[[http://sourceforge.net/scm/?type=svn&group_id=251485|rmutt]] http://www.schneertz.com/rmutt/
http://dev.null.org/dadaengine/ http://dev.null.org/dadaengine/manual-1.0/dada_toc.html


to:


Added line 231:
Changed line 291 from:
WordSalad.Generators
to:
[[WordSalad.Generators]]
 
 
February 23, 2012, at 11:21 AM by OtherMichael -
Changed line 85 from:
!!! atomicity
to:
!!! granularity
Added line 89:
Added lines 93-95:

the editor is a downstream project. Generation of text is the top priority, as that includes analyzing my own processes. the editor is a tool to assist. Which is nice. But not required. But devoutly to be wished.

Deleted line 98:
Added line 101:
Added line 111:
Deleted line 112:
whitespace to punctuation
Added line 117:
->http://stackoverflow.com/questions/3215228/convert-leet-speak-to-plaintext
Added lines 120-122:
->http://stackoverflow.com/questions/3563328/piglatin-using-arrays http://stackoverflow.com/questions/4098178/c-sharp-translator-from-pig-latin-to-english http://www.programmersheaven.com/mb/csharp/392979/392979/piglatin-program-calling-methods/
pseudo-localiation
->http://www.codeproject.com/Articles/8496/pseudoLocalizer-a-tool-to-aid-development-and-test and https://blogs.msdn.com/b/delay/archive/2011/01/27/sudo-localize-amp-amp-make-me-a-sandwich-free-pseudolocalizer-class-makes-it-easy-for-anyone-to-identify-potential-localization-issues-in-net-applications.aspx?Redirected=true
Changed lines 126-127 from:
-> Effects will vary with atomicity
to:
-> Effects will vary with granularity

Added line 162:
whitespace to punctuation
 
 
February 23, 2012, at 10:39 AM by OtherMichael - density notes added and moved under rules, data engine notes added. many notes have been lost.
Changed line 52 from:
!!! SCIgen - An Automatic CS Paper Generator
to:
!!! [=SCIgen=] - An Automatic CS Paper Generator
Changed lines 57-62 from:
to:
!!! Dada Engine
http://dev.null.org/dadaengine/
http://dev.null.org/dadaengine/manual-1.0/dada_toc.html
[[http://herbert.the-little-red-haired-girl.org/en/dada/index.html|Dada Engine web interface]]

Changed lines 274-275 from:
WordSalad.TextShopping - perhaps some sources (which is vaguely the idea behind the creation of this page in the first place)
to:
WordSalad.TextShopping - perhaps some sources (which is vaguely the idea behind the creation of this page in the first place)
WordSalad.Generators
 
 
February 23, 2012, at 10:32 AM by OtherMichael -
Changed lines 36-43 from:
Density of text -- if input is all char-heavy, allow density filter to intrude by adding in chunks of periods?
but how probable are they?
If source text has NO PERIODS then adding 2000 pages of periods won't help, as those periods will only fire on a probability rule for the last letter before they start
need some rule for injecting these at OTHER points, beyond the existing markov rules
since I'm not strictly following a markovian generation -- I'm doing custom building. new rules.
hrm.....


to:

Deleted line 110:
reverse
Changed lines 115-142 from:
to:
!!!! Density
Density of text -- if input is all char-heavy, allow density filter to intrude by adding in chunks of periods?
but how probable are they?
If source text has NO PERIODS then adding 2000 pages of periods won't help, as those periods will only fire on a probability rule for the last letter before they start
need some rule for injecting these at OTHER points, beyond the existing markov rules
since I'm not strictly following a markovian generation -- I'm doing custom building. new rules.
hrm.....

set via percentage 0..99
100 is no added punctuation -- all words run together with no spaces (retain source punctuation?)
0 would be all punct, no source, so not available... unless we want to generate a blank slate? hrm....
Not sure how the intermediate would be. Not numerical -- at the upper bound (somewhere in the 1..10 range) is an XRML page that has but one word on it.
Fill with default punct mark -- the period
small chance of other characters intruding at random (mostly punct, some alpha -- weighted list? I prefer "x", for some reason)
chance of words splitting with a few chars in-between the splits
chance increases as density... decreases?
but number of spaces can't be too large -- we don't want them on separate pages (ie, 40+ lines between each part)

What is the density a measure of -- source-text density, or punctuation?
It should be source-text. Raw punct/blank slate [ie, all periods] should be 0 density.
So, need to edit the above to indicate that.

First coding will be naive, and probably not even include word-splits.
Elaborate over time, including syllable-breakage
Eventually, would like some splits to be vertical, not just linear.
That is waaaaay down the road.

Changed line 147 from:
to:
reverse
 
 
February 22, 2012, at 09:54 PM by OtherMichael -
Changed lines 61-62 from:
I forgot about this -- and the code is available?
to:
I forgot about this -- and the code is available!

Changed lines 70-74 from:
to:
[[http://sourceforge.net/scm/?type=svn&group_id=251485|rmutt]] http://www.schneertz.com/rmutt/
http://dev.null.org/dadaengine/ http://dev.null.org/dadaengine/manual-1.0/dada_toc.html


what??? http://www.nictoglobe.com/new/notities/text.list.html
 
 
February 22, 2012, at 05:11 PM by OtherMichael -
Changed line 17 from:
What I really want is an algorithm that approximates  the bizarre thought patterns that lead to an XRML page.
to:
What I really want is an algorithm that approximates the bizarre thought patterns that lead to an XRML page.
Added lines 25-27:
And I should stop worrying about "perfection" -- work towards a first approximation -- something that provides me with a jumping-off point for real editing, curation, modification, etc. Which is the major goal, isn't it?

Added lines 226-235:
emoticon lists/explanations?
list of file-formats
phone-numbers
white-pages listings (businesses, not personal)
bank statements
list of numbers
receipts [one of the early inspirations for the hyper-dense cash-register-tape manual-typewriter typing I did in 92-95]


Added lines 237-238:

 
 
February 22, 2012, at 02:40 PM by OtherMichael -
Added lines 54-66:

!!! SCIgen - An Automatic CS Paper Generator
http://pdos.csail.mit.edu/scigen/

I forgot about this -- and the code is available?

!!! broad avenues of research
http://stackoverflow.com/questions/1670867/libraries-or-tools-for-generating-random-but-realistic-text
->http://wordnet.princeton.edu/
http://stackoverflow.com/search?q=text%20generator
http://homepages.inf.ed.ac.uk/jbos/comsem/
http://www.statmt.org/

 
 
February 22, 2012, at 11:22 AM by OtherMichael -
Added lines 206-210:


On further thought, these are all pretty vanilla linear texts [not necessarily narrative]. But I'm interested in some other forms, or changing into other forms, of the invisible literature -- the headings of the Gutenberg texts [question -- what variations are there? which etext has the longest?], receipts, recipes, code, what else should be thrown into the mix?

What hermetic algorithm do I need to encode the narratives to those [[Wikipedia:Formant|formants]]?
 
 
February 22, 2012, at 09:13 AM by OtherMichael -
Changed line 1 from:
(:title Text Mungler:)
to:
(:description notes notes notes :)
 
 
February 21, 2012, at 10:22 PM by OtherMichael -
Changed line 91 from:
homophonic replacement
to:
homophonic replacement: http://www.peak.org/~jeremy/dictionaryclassic/chapters/homophones.php
 
 
February 21, 2012, at 04:57 PM by OtherMichael -
Added lines 182-205:


!! Library
Some texts that might be fun to provide as defaults:
[[http://www.gutenberg.org/ebooks/5402|1811 Dictionary of the Vulgar Tongue by Francis Grose ]]
[[http://www.gutenberg.org/ebooks/20019|Lectures on Landscape by John Ruskin ]]
[[http://www.gutenberg.org/ebooks/2383|Canturbery Tales]]
[[http://www.gutenberg.org/ebooks/3825|Pygmalion]]
[[http://www.gutenberg.org/ebooks/84|Frankenstein]]
[[http://www.gutenberg.org/ebooks/10|King James Bible]]
[[http://www.gutenberg.org/ebooks/28144|Futurist Manifesto]] - but it's in Italian, need a translation?
->http://www.unknown.nu/futurism/ and http://cscs.umich.edu/~crshalizi/T4PM/futurist-manifesto.html
-> and particularly [[http://www.unknown.nu/futurism/techpaint.html|Technical Manifesto of Futurist Painting ]] for the x-ray quote
[[http://www.gutenberg.org/ebooks/16917|Art, by Clive Bell]] -- why this one?
[[http://www.gutenberg.org/ebooks/24726|A History of Art for Beginners and Students by Clara Erskine Clement Waters ]]
[[http://www.gutenberg.org/ebooks/14400|Manual of Egyptian Archaeology and Guide to the Study of Antiquities in Egypt ]] - may be more suited to a different project
[[http://www.gutenberg.org/ebooks/search.html/?default_prefix=subject_id&sort_order=downloads&query=385|Books on Arthurian romances (sorted by popularity) ]]
[[http://www.gutenberg.org/ebooks/search.html/?format=html&default_prefix=subjects&sort_order=downloads&query=alchemy|alchemy]]
->[[http://www.gutenberg.org/ebooks/26340|Of Natural and Supernatural Things by Basilius Valentinus ]]
[[http://www.gutenberg.org/ebooks/search.html/?default_prefix=subject_id&sort_order=downloads&query=27|Books on Classical literature (sorted by popularity) ]]
[[http://www.gutenberg.org/ebooks/search.html/?default_prefix=subject_id&sort_order=downloads&query=45|Oz books by L. Frank Baum]]
[[http://www.gutenberg.org/ebooks/search.html/?default_prefix=subject_id&sort_order=downloads&query=166|books on Philosophy]]
[[http://www.gutenberg.org/ebooks/5740|Tractatus Logico-Philosophicus by Ludwig Wittgenstein]] - unfortunately, only in PDF or TEX
[[http://www.gutenberg.org/ebooks/search.html/?format=html&default_prefix=all&sort_order=&query=encyclopedia|some sort of encyclopedia]] ???
 
 
February 20, 2012, at 09:10 PM by OtherMichael -
Changed line 13 from:
Naive in  the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
to:
Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
Added line 183:
Added lines 186-187:
WordSalad.AutomaticForThePeople - in particular, the notes on Philip Parker
WordSalad.TextShopping - perhaps some sources (which is vaguely the idea behind the creation of this page in the first place)
 
 
February 20, 2012, at 04:42 PM by OtherMichael -
Added lines 10-22:


Markov text generation is naive.
Naive in  the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.
Naive in that I thought with some tweaking I could get it to provide an interesting output.
And finally naive in that it has nothing to do with language at all - it's a happy accident of statistics that it produces output that looks like language. there is no "understanding" of language at all (ignore that no application "understands" language).

What I really want is an algorithm that approximates  the bizarre thought patterns that lead to an XRML page.
What I want is a '''hermetic encoder'''.

And to build a naive, simplistic, even moderately-interesting output engine, I must have a better understanding of my own opaque processes.

This means both processing algorithms, but also sourcing algorithms = where does text come from, how does it interrelate. the landscape, textscape, textriver ideas, cut ups, concrete poetry, misunderstood maths, art, lit, pop culture. how to blender?
 
 
February 17, 2012, at 12:34 PM by OtherMichael -
Deleted lines 75-76:
disemvowelling
random-caps
Changed line 77 from:
replace letters/vowels with punctuation
to:
replace letters/vowels with punctuation or other mark -- "x" or "-"
Added lines 81-98:
pig-latin-ification
translation into another language (selected at random?) [French, German, Spanish, Italian, Latin]
reverse
random re-order
splice -- ie, split in n pieces and re-arrange.
-> Effects will vary with atomicity


!!!! deployed
disemvowelling
disemconsonanting
random-caps


!!!! grid-based transformations
rotate 90|180|270 degrees (is anything else practicable?)
shift ''n'' chars -- ie, end of line flows into start of next line, end of block flows into start of block

 
 
February 16, 2012, at 04:49 PM by OtherMichael -
Changed lines 150-152 from:
to:
sadly, {{#Develop}} doesn't (yet?) support block-highlighting
[[http://www.freelancer.in/projects/NET/Only-for-experts-NETC-Column.html|call developer to implement]]
[[http://www.icsharpcode.net/OpenSource/SD/InsideSharpDevelop.aspx|Dissecting a C# Application: Inside SharpDevelop]] - have a look at Chapter 11, "Writing the editor control" (I have the ebook inside of dropbox)
 
 
February 16, 2012, at 04:11 PM by OtherMichael -
Added lines 1-2:
(:title Text Mungler:)
Added line 149:
[[http://community.sharpdevelop.net/forums/p/5717/16387.aspx#16387|editors that allow vertical (block) selection]] -- and of course, Emacs.
 
 
February 16, 2012, at 11:50 AM by OtherMichael -
Added lines 134-146:

[[http://www.perlmonks.org/?node=Acme%3A%3ATranslator|pseudo-translation (in Perl)]] - like the Shizzolator.
[[http://speeves.erikin.com/2007/01/perl-random-string-generator.html|random characters]]
[[http://www.ruf.rice.edu/~pound/|Chris Pound's language machines]]

!! GUI and editor
http://www.codeproject.com/Articles/26422/Peter-Programmers-Extensive-Text-Editor
http://www.codeproject.com/Articles/18166/Scratchpad-An-Auto-Save-Notepad
http://www.codeproject.com/Articles/9761/Visual-Studio-Editor-Clone-V0-1a
http://www.codeproject.com/Articles/19852/Text-Editor-Using-C
http://www.codeproject.com/Articles/7213/Write-Your-Customized-Editor-for-Your-Own-Programm
http://www.codeproject.com/Articles/11705/Hex-Editor-in-c

 
 
February 15, 2012, at 05:01 PM by OtherMichael -
Added lines 50-53:
!!! atomicity
process at what level?
''n'' chars, ''n'' words, ''n'' sentences, ''n'' paragraphs, pages, blocks, something else?

Changed lines 129-135 from:
to:
!! research
http://apocryph.org/2006/06/23/weekend_project_parody_generator_using_rss_pos_tagging_markov_text_generation/
https://sites.google.com/site/texttotext2011/#data
https://en.wikipedia.org/wiki/Natural_language_generation
http://apocryph.org/tag/markov/

Changed line 137 from:
WordSalad.ChainsOfLove
to:
WordSalad.ChainsOfLove
 
 
February 15, 2012, at 03:43 PM by OtherMichael -
Added lines 56-57:

Grid? http://msdn.microsoft.com/en-us/library/system.windows.controls.grid.aspx
 
 
February 15, 2012, at 08:55 AM by OtherMichael -
Added line 25:
Changed lines 36-37 from:
need to look at other implementations, as one of them uses a node-structure that superficially confuses me. It could just beterminology.
to:
need to look at other implementations, as one of them uses a node-structure that superficially confuses me. It could just be terminology.
Added lines 40-56:
!!! random notes
Timeline -> store a copy of the extant text with the rule that has been applied to it.
First step has a empty rule
This will allow for stepping through the process, and redefining it, deciding to go in another direction
this would also require that all transform rules have a common interface -- which means the Markov engine needs more tweaking to fit.
should be able to serialize all of this, so could be restored, re-processed?
Hrm. Once rule + text is serialized, should be trivial for the historical sequence.
This will not be a small file, though.


!! visual interface
Select from a list of sources (online, or local cache, files, etc)
arrange "timeline" of transformations
visual editor of text at any given stage
Think about a matrix, instead of a stream of text -- edits should be in a grid, so that blocks can be picked up, moved, shifted, sliced, etc.

Changed line 71 from:
replace letters/vowells with punctuation
to:
replace letters/vowels with punctuation
 
 
February 01, 2012, at 02:13 PM by OtherMichael -
Changed lines 58-69 from:
to:
!!! Scraping Project Gutenberg
I found that looking @ http://www.gutenberg.org/browse/recent/last1 was an interesting source
And from the generic links on that page
-> eg, http://www.gutenberg.org/ebooks/38724
I could build a direct link to the plaintext by appending ".txt.utf8"

Now, there's still some boilerplate that, for my purposes, would be good to eliminate

[[http://www.stanford.edu/~mjockers/cgi-bin/drupal/node/49|Auto Converting Project Gutenberg Text to TEI]] offers some code (in python) that was used to remove boilerplate and do some reformatting
[[http://www.michielovertoom.com/python/gutenberg-ebook-scraping/|referenced in the above link]] and code is easier to read (due to formatting)

Added line 76:
 
 
February 01, 2012, at 09:59 AM by OtherMichael -
Changed lines 32-39 from:
I'm using the last project, as it was the easiest to download. hah, so lazy!

need to look at other implementations.

However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically editing the output -- ie, First word is capitalized, senteces, paragraphs, etc.


!! Scraping pmwiki (this site)
to:
I'm using the last project, as it was the easiest to download, and followed a pattern I was familiar with. Simple dictionary.
I've found a few bugs in it [off-by-one boundary for random selection that resulted in never selecting the LAST element], and am extending it to use different parsing rules to break apart on words or chars, or to treat whitespace as significant (since it trims it to non-existence in the original).

need to look at other implementations, as one of them uses a node-structure that superficially confuses me. It could just beterminology.

However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically editing the output -- ie, First word is capitalized, sentences, paragraphs, etc.

!! Getting Source material
I'm building source inputs -- could be called [@TextGetters@], for want of a better name
Working on [@WebGetters@], to grab from XraysMonaLisa, Gutenberg, and eventually Textfiles.com
See: WordSalad/TextShopping WordSalad/Generators WordSalad/InternetMemeText WordSalad/LoremIpsum WordSalad/Spam for more ideas

I found that processing 60 pages of XRML provides a strikingly boring output. due to a high lack of self-similarity, the rendered output doesn't vary all that much. Dropping the key-length amount is one option. Adding in alternate sources is another.
However, processing 60 XRML pages and an entire novel now skews towards the novel, and I want output to "look" like XRML.
So, need to figure out some methods of modifying non-XRML source to be more similar.

!!! potential transform rules
whitespace to punctuation
disemvowelling
random-caps
random-loss of letters
replace letters/vowells with punctuation
homophonic replacement
leet-speak replacement
translation of source-text? uh. I dunno. pseudo-translation, maybe, like the now-defunct snoop-dog translator


!!! Scraping wikipedia

[[http://studentclub.ro/lucians_weblog/archive/2010/07/14/14798.aspx|extract markup from Wikipedia]]
[[http://studentclub.ro/lucians_weblog/archive/2008/07/13/parse-the-wikipedia-s-mediawiki-text-in-natural-language-format.aspx|Parse to natural language]]

http://www.evanjones.ca/software/wikipedia2text.html

!!! Scraping pmwiki (this site)
Added lines 78-79:
UPDATE: my first iteration is just scraping the HTML, using XPATH to get what I need (for both page-list links, and content)
Changed lines 84-85 from:
!!! auto-imported generated wiki pages
to:

!! auto-imported generated wiki pages
Changed lines 90-96 from:

!! Scraping wikipedia

[[http://studentclub
.ro/lucians_weblog/archive/2010/07/14/14798.aspx|extract markup from Wikipedia]]
[[http://studentclub
.ro/lucians_weblog/archive/2008/07/13/parse-the-wikipedia-s-mediawiki-text-in-natural-language-format.aspx|Parse to natural language]]

http://www
.evanjones.ca/software/wikipedia2text.html
to:
Pushing output of the app back into the wiki.....

 
 
January 27, 2012, at 01:02 PM by OtherMichael -
Added lines 8-23:


!! Crazy Thoughts
Some NLP rules? Paragraphs, words, sentences, footnotes?
for XRML, need to split into the grid-size
What about word-breaking?
U&LC analysis?
word-substitution? line-references to each other? shifts?
Should this stuff be automated?
Uh, it's pie-in-the-sky. Of course it can be automated; is it worth it?
Density of text -- if input is all char-heavy, allow density filter to intrude by adding in chunks of periods?
but how probable are they?
If source text has NO PERIODS then adding 2000 pages of periods won't help, as those periods will only fire on a probability rule for the last letter before they start
need some rule for injecting these at OTHER points, beyond the existing markov rules
since I'm not strictly following a markovian generation -- I'm doing custom building. new rules.
hrm.....
 
 
January 26, 2012, at 04:38 PM by OtherMichael -
Added line 33:
Uh, since I'm writing an external application, it would probably be retrieving the HTML, anyway. so there.
 
 
January 26, 2012, at 12:28 PM by OtherMichael -
Added lines 21-41:


!! Scraping pmwiki (this site)
Specifically, I would like to process [XraysMonaLisa]

So, the following may be required:

http://www.pmichaud.com/pipermail/pmwiki-users/2008-February/049155.html

http://www.pmichaud.com/pipermail/pmwiki-users/2008-February/049148.html
http://www.pmwiki.org/wiki/Cookbook/TextExtract - was pointed to as a suggestion, but it turns out it can't really strip the markup. Will have to continue to look into this.
Possibly, render to HTML and THEN strip all tags?

Looks like there are some C# solutions:
http://www.dotnetperls.com/remove-html-tags
http://osherove.com/blog/2003/5/13/strip-html-tags-from-a-string-using-regular-expressions.html

!!! auto-imported generated wiki pages
http://www.pmwiki.org/wiki/Cookbook/ImportText

Hrm. That could be.... interesting....
 
 
January 25, 2012, at 05:27 PM by OtherMichael -
Added lines 1-31:
I'm (finally) building a (c#) application to do Markov processing on a variety of inputs.

Oh, yeah, there are plenty of those about.

I wanted one of my own, with the ability to set up a set of inputs, apply my own twiddling to the output, and some other things.

Instead of taking only from a defined source of text -- ie, file or textarea, I want to be able to pull from a number of online resources.

!! Application core
http://blog.figmentengine.com/2008/10/markov-chain-code.html
http://phalanx.spartansoft.org/2010/03/30/markov-chain-generator-in-c/
http://2kittymafiasoftware.blogspot.com/2011/03/pseudo-random-tex-generator-using.html
https://github.com/pjbss/Pseudo-Random-Text-Generator/blob/master/PseudoRandomTextGenerator/TextGenerator.cs


I'm using the last project, as it was the easiest to download. hah, so lazy!

need to look at other implementations.

However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically editing the output -- ie, First word is capitalized, senteces, paragraphs, etc.


!! Scraping wikipedia

[[http://studentclub.ro/lucians_weblog/archive/2010/07/14/14798.aspx|extract markup from Wikipedia]]
[[http://studentclub.ro/lucians_weblog/archive/2008/07/13/parse-the-wikipedia-s-mediawiki-text-in-natural-language-format.aspx|Parse to natural language]]

http://www.evanjones.ca/software/wikipedia2text.html

!! See Also
WordSalad.ChainsOfLove