Browsed by
Category: Knowledge Discovery & Mgmnt

Novelty Detection in Text Corpora

Novelty Detection in Text Corpora

Detecting Novelty Using Text Analytics Detecting novel events – new words, meaning new events – is one of the most important text analytics tasks, and is an important step towards predictive analytics using text mining. On July 24, 2015, The New York Times (and many other news sources) published an article identifying potential inclusion of classified information in the emails which Hillary Clinton had sent via private email and stored on her private email server. How would we use text…

Read More Read More

Ground-Truthing – The First Step in Predictive Analytics

Ground-Truthing – The First Step in Predictive Analytics

Ground Truthing – Your First Day in Class You may be joining me for the Summer 2015 class in Text Analytics (PREDICT 453) that I’ll be teaching through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, starting June 22, 2015. Or you may not. Either way, you can follow this class virtually – here through this blog. I’ll be posting lots of information about the upcoming class – and referencing my permanent pages that I’m setting up prior…

Read More Read More

"Nonadditive Entropy" – An Excellent Review Article

"Nonadditive Entropy" – An Excellent Review Article

New Advances in Entropy Formulation – “Nonadditive Entropy” Well, chalk it up to being newly returned to the fold – after years of work in knowledge discovery, predictive analysis, neural networks, and sensor fusion, I’m finally returning to my roots and re-invigorating some previous work that involves the Cluster Variation Method. In the course of this, I’ve just learned (as a Janie-come-lately) about the major evolution in thinking about entropy, largely led by Constantino Tsallis. He has an excellent review…

Read More Read More

Chapter 2 (Part 3), Sennelart & Blondel – Automatic Discovery of Similar Words

Chapter 2 (Part 3), Sennelart & Blondel – Automatic Discovery of Similar Words

In Section 2.3, we get to the meat of Sennelart & Blondel’s work, which is a graph-based method for determining similar words, using a dictionary as source. Their method uses a vXv matrix, where each v is a word in the dictionary. They compare their method and results with that of Kleinberg, who proposes a method for determining good Web hubs and authorities, and with the ArcRank and WordNet methods. They test the four methods on four words: disappear, parallelogram,…

Read More Read More

Chapter 2 Review, Continued, Part 2 — "Automatic Discovery of Similar Words"

Chapter 2 Review, Continued, Part 2 — "Automatic Discovery of Similar Words"

(Direct continuation of yesterday’s post, w/r/t Senellart & Blondel on “Automatic Discovery of Similar Words” in Survey of Text Mining II. I give the references that cite, which I discuss in this post, at the end of the post.) In Chapter 2’s revieww of previous methods and associated literature, Senellart & Blondel start with banal and get progressively more interesting. The one thing I found interesting in the first model that Senellart and Blondel discussed was that the model was…

Read More Read More

"Automatic Discovery of Similar Words" – Chapter 2 in Survey of Text Mining II

"Automatic Discovery of Similar Words" – Chapter 2 in Survey of Text Mining II

This post begins a review of “Automatic Discovery of Similar Words,” by Pierre Senellart and Vincent D. Blondel, published as Chapter 2 in Berry and Castellanos’ Survey of Text Mining II. This is an excellent and useful chapter, in that it:1) Addresses the broad issue of computational methods for discovering “similar words” (including synonyms, near-synonyms, and thesauri-generating techniques) from large data corpora,2) Illustrates the different leading mathematical methods, giving an excellent overview of the SoA,3) Competently discusses how different methods…

Read More Read More

Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining

Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining

A good night’s sleep is excellent for clearing away mental cobwebs, and has given me more perspective on Chapter 1, “Cluster-Preserving Dimension Reduction Methods,” by Howland and Park in Survey of Text Mining II: Clsutering, Classification, and Retrieval (ed. by Berry & Castellanos). If you will, please recall with me that the Howland & Park work proposed a two-step dimensionality reduction method. They successfully reduced over 20,000 “dimensions” (of words found in the overall corpus collection) to four dimensions, and…

Read More Read More