There is a tremendous amount of textual information on the Web. When material facts are trumped by the "alternative facts" and opinions, wading through the discourse can be extremely difficult for the consumers. Underneath, we show very simple fact-o-meter so reader's cognition can focus on facts quickly. Whether you are a media company, an information company, or just an enterprise business building collaboration portals, helping sift opinions from facts with simple iconography can be tremendously useful; helping summarize content can mean productivity and experience gains for the customer.
# The URL of the whitehouse executive order
uri = 'https://www.whitehouse.gov/the-press-office/2017/01/27/executive-order-protecting-nation-foreign-terrorist-entry-united-states'
# Fetch the web page
import requests
import lxml.html
# Get the document from a URI
dom = lambda uri: lxml.html.fromstring(requests.get(uri).content)
# The executive order HTML content; document object model
exec_order_from_potus = dom(uri)
# Convert into TextBlob for analysis
from textblob import TextBlob
# Apply XPath to avoid all the boilerplate text from whitehouse.gov
discourse = TextBlob('\n'.join([
para
for para in exec_order_from_potus.xpath(
'//*[@id="content-start"]/div[3]/div/div//text()') if para.strip()
]))
# Print preview of the executive order
HTML(discourse.string.encode('ascii', errors='ignore').replace('\n', '<br/>'))
What keywords can be gleaned from the order? Let us portray a wordcloud of all the noun mentions
# Some imports
import matplotlib.pyplot as plt
import wordcloud
from PIL import Image
from StringIO import StringIO
import numpy as np
# Get a mask of the whitehouse image to paint our wordcloud in
whitehouse_mask = np.array(Image.open(StringIO(requests.get('https://static.vecteezy.com/system/resources/previews/000/057/818/non_2x/the-white-house-vector.jpg').content)))
extent = 0, 700, 0, 490
# Initialize a cloud palette
cloud_wh = wordcloud.WordCloud(width=700, height=490, background_color='#eee', mask=whitehouse_mask)
cloud_sq = wordcloud.WordCloud(width=700, height=490, background_color='white')
# Generate the word cloud in the whitehouse overlay
cloud_wh.generate_from_frequencies(discourse.np_counts.items())
cloud_sq.generate_from_frequencies(discourse.np_counts.items())
fig, axes = plt.subplots(1, 2, figsize=(28,9.8))
# Show a preview of the nouns from the executive order; paint it in a whitehouse silhouette
axes[1].imshow(whitehouse_mask, extent=extent)
axes[1].imshow(cloud_wh, cmap=plt.cm.gray, alpha=0.98, extent=extent)
axes[1].axis("off")
# Also show the raw word cloud
axes[0].imshow(cloud_sq)
axes[0].axis("off")
plt.show()
For each sentence (tile or para/chunk), give a modality score for its factuality vs opinion. The modality is a score between [-1, 1] which shows the "factual" (not opinionated) nature of a statement. Higher modality scores (> 0.8) usually shows facts. Anything less is a dampened confidence in the factuality of the sentence.
The polarity score is a float within the range [-1.0, 1.0]. Negative values mean negative sentiment.
The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. For facts, you want objectivity to be high, meaning look for values closer to 0.0.
Let us also identify all the sentences that carry a good noun quotient. Although this isn't a true summary generation, the quotient score must quickly reveal sentences worth closer attention.
from pattern.en import parse, Sentence, parse
from pattern.en import modality, mood
# Get the noun_phrase occurence scores in each sentence
np_scores = discourse.np_counts
# Exclude short sentences and extract some meaning
# Also compute noun score as a summary indicator
# You will see generation of sentiment polarity and subjectivity (sentiment)
# Similarly a mood and modality (aka how assertive a statement is)
# Similarly a quotient for the noun density in each sentence
sentences = pd.DataFrame(
[(s.string, s.sentiment.polarity, s.sentiment.subjectivity,
mood(parse(
s.string, lemmata=True)), modality(parse(
s.string, lemmata=True)),
np.sum(np_scores[w] if w.string in np_scores else 0
for w in s.noun_phrases) / float(max(1, len(s.noun_phrases))))
for s in discourse.sentences if len(s.words) >= 5],
columns=[
'Sentence', 'Polarity', 'Subjectivity', 'Mood', 'Modality', 'Nounity'
])
pd_display(sentences, "Sentences scored")
In order to decompose the scores, it is best to show individual sentiment, modality scores with simple examples.
# Some tests to show if the intended functions/metrics are correct
# A negative sentiment
tblob_sentiment = TextBlob("I think it is cold hearted")
# A very assertive opinion
tblob_fact = TextBlob("It is hot in summer")
# A simple sentence with nouns
tblob_nouns = TextBlob(
"Donald J. Trump is the President of the United States of America")
# Sentiment first
display(
HTML('The polarity and subjectivity of the <u>"{0}"</u> is {1} and {2}'.
format(tblob_sentiment.string, tblob_sentiment.sentiment.polarity,
tblob_sentiment.sentiment.subjectivity)))
# Modality second
display(
HTML('The mood and modality of the <u>"{0}"</u> is {1} and {2}'.format(
tblob_fact.string,
mood(parse(
tblob_fact.string, lemmata=True)),
modality(parse(
tblob_fact.string, lemmata=True)))))
# Nouns third
display(
HTML('The key entities of <u>"{0}"</u> are {1}'.format(
tblob_nouns.string, tblob_nouns.noun_phrases)))
The TFIDF scores from the original executive order are not normalized. The TextBlob normalizes some scores; still the different ranges, some are [-1, 1] and others are [0, 1], make it hard for interpretation. Let us normalize all the scores to a [0, 1] scale.
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
sentence_characteristics = sentences.select_dtypes(include=['number'])
scored_sentences = pd.concat([sentences[[col for col in sentences.columns if col not in sentence_characteristics.columns]], pd.DataFrame(min_max_scaler.fit_transform(sentence_characteristics), columns=sentence_characteristics.columns)], axis=1)
pd_display(scored_sentences, "Sentence scores normalized to a unit scale")
Facts - aka where modality > 0.8
HTML('<p>'.join([
'<span style="color:{1}">{0}</span>'.format(
row['Sentence'].encode(
'ascii', errors='ignore').replace('\n', '<br/>'),
"black" if row['Modality'] > 0.75 and row['Mood'] == 'imperative' else
"gray") for (index, row) in scored_sentences.iterrows()
]))