Browsed by
Category: Big Data / Data Analysis

Making Sense: Extracting Meaning from Text

Making Sense: Extracting Meaning from Text

Making Sense: Extracting Meaning from Text by Matching Terms and Entities to Ontologies and Concepts Text analytics is the means by which computer algorithms can extract meaning and useful insights from raw text sources. This can have enormous impact in realms such as marketing, business intelligence, and political campaigns. However, text analytics is one of the toughest challenges in predictive analytics. The reason why this is so hard? It’s because – when done right – text analytics must effectively connect…

Read More Read More

Novelty Detection in Text Corpora

Novelty Detection in Text Corpora

Detecting Novelty Using Text Analytics Detecting novel events – new words, meaning new events – is one of the most important text analytics tasks, and is an important step towards predictive analytics using text mining. On July 24, 2015, The New York Times (and many other news sources) published an article identifying potential inclusion of classified information in the emails which Hillary Clinton had sent via private email and stored on her private email server. How would we use text…

Read More Read More

Ground-Truthing – The First Step in Predictive Analytics

Ground-Truthing – The First Step in Predictive Analytics

Ground Truthing – Your First Day in Class You may be joining me for the Summer 2015 class in Text Analytics (PREDICT 453) that I’ll be teaching through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, starting June 22, 2015. Or you may not. Either way, you can follow this class virtually – here through this blog. I’ll be posting lots of information about the upcoming class – and referencing my permanent pages that I’m setting up prior…

Read More Read More

The 1-D Cluster Variation Method (CVM) – Simple Application

The 1-D Cluster Variation Method (CVM) – Simple Application

The 1-D Cluster Variation Method – Application to Text Mining and Data Mining There are three particularly good reasons for us to look at the Cluster Variation Method (CVM) as an alternative means of understanding the information in a system: The CVM captures local pattern distributions (for an equilibrium state), When the system is made up of equal numbers of units in each of two states, and the enthalpy for each state is the same (the simple unit activation energy…

Read More Read More

Why Nonadditive Entropy Is Important for Big Data Corpora Combinations

Why Nonadditive Entropy Is Important for Big Data Corpora Combinations

Non-Additive Entropy – A Crucial Predictive Analysis Measure for Data Mining in Multiple Large Data Corpora Statistical mechanics has an important role to play in big data analytics. Up until now, there has been almost no understanding of how statistical mechanics provides both practical value and a theoretic framework for data analysis and even predictive intelligence (sometimes called predictive analysis). In a separate White Paper (link to be provided), I identify – for the first time – how statistical mechanics,…

Read More Read More

Big Data, Big Graphs, and Graph Theory: Tools and Methods

Big Data, Big Graphs, and Graph Theory: Tools and Methods

Big Graphs Need Specialized Data Storage and Computational Methods {A Working Blogpost – Notes for research & study} Processing large-scale graph data: A guide to current technology, by Sherif Sakr (ssakr@cse.unsw.edu.au), IBM Developer Works (10 June 2013). Note: Dr. Sherif Sakr is a senior research scientist in the Software Systems Group at National ICT Australia (NICTA), Sydney, Australia. He is also a conjoint senior lecturer in the School of Computer Science and Engineering at University of New South Wales. He…

Read More Read More

Chapter 2 (Part 3), Sennelart & Blondel – Automatic Discovery of Similar Words

Chapter 2 (Part 3), Sennelart & Blondel – Automatic Discovery of Similar Words

In Section 2.3, we get to the meat of Sennelart & Blondel’s work, which is a graph-based method for determining similar words, using a dictionary as source. Their method uses a vXv matrix, where each v is a word in the dictionary. They compare their method and results with that of Kleinberg, who proposes a method for determining good Web hubs and authorities, and with the ArcRank and WordNet methods. They test the four methods on four words: disappear, parallelogram,…

Read More Read More

Chapter 2 Review, Continued, Part 2 — "Automatic Discovery of Similar Words"

Chapter 2 Review, Continued, Part 2 — "Automatic Discovery of Similar Words"

(Direct continuation of yesterday’s post, w/r/t Senellart & Blondel on “Automatic Discovery of Similar Words” in Survey of Text Mining II. I give the references that cite, which I discuss in this post, at the end of the post.) In Chapter 2’s revieww of previous methods and associated literature, Senellart & Blondel start with banal and get progressively more interesting. The one thing I found interesting in the first model that Senellart and Blondel discussed was that the model was…

Read More Read More

"Automatic Discovery of Similar Words" – Chapter 2 in Survey of Text Mining II

"Automatic Discovery of Similar Words" – Chapter 2 in Survey of Text Mining II

This post begins a review of “Automatic Discovery of Similar Words,” by Pierre Senellart and Vincent D. Blondel, published as Chapter 2 in Berry and Castellanos’ Survey of Text Mining II. This is an excellent and useful chapter, in that it:1) Addresses the broad issue of computational methods for discovering “similar words” (including synonyms, near-synonyms, and thesauri-generating techniques) from large data corpora,2) Illustrates the different leading mathematical methods, giving an excellent overview of the SoA,3) Competently discusses how different methods…

Read More Read More

Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining

Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining

A good night’s sleep is excellent for clearing away mental cobwebs, and has given me more perspective on Chapter 1, “Cluster-Preserving Dimension Reduction Methods,” by Howland and Park in Survey of Text Mining II: Clsutering, Classification, and Retrieval (ed. by Berry & Castellanos). If you will, please recall with me that the Howland & Park work proposed a two-step dimensionality reduction method. They successfully reduced over 20,000 “dimensions” (of words found in the overall corpus collection) to four dimensions, and…

Read More Read More