Information is power. We have encoded so much of our worldly knowledge in plain English -- but how do we codify this knowledge for computers to make better decision support systems?
Amazon Alexa, Google Home, Apple Siri like personal digital assistants have put natural language questions (and consequently the dense answer bank) within everyone's reach. Thanks to the many kind souls and bright researchers, knowledge-bases like Wikipedia, Wikidata, Freebase, Never-Ending-Language-Learning, OpenCyc, OLLIE already have the precursory knowledge to bootstrap smart information assistants by commoners.
Do you envision using your content for best customer outcomes? Then start encoding your content into a parallel SVO triplets and ontological SKOs for best outcomes -- for if you do, both androids and anthropoids may use your content.
We create below a simple data-of-birth genie that parses and generates a date of birth for notable people in the world using simple structural/boilerplate rules. We will extend this more using linguistic analysis later.
Let us create a macro to get some unstructured text from Wikipedia for any given entity. Hopefully, it will work.
import requests
import html2text, wikipedia
# Connect to Wikipedia and get relevant content for any entity
# Also parse HTML content into plain text
text = lambda entity: wikipedia.page(entity).summary
Let us find all occurrences of "born/birth" -- return coordinates (Sentence, Word index) of all "birth" occurrences
Similarly find all occurrences of dates in the text -- again return coordinates of all dates.
Simply looking at structural context, it is easy to imagine dates that occur closest to the birth occurrence is probably a date of birth.
from nltk.tokenize import sent_tokenize, word_tokenize
import datefinder, itertools, re
from scipy.spatial import cKDTree
from babel.dates import format_date, format_datetime, format_time
# Parse the text into chunks of sentences
sentences = lambda entity: sent_tokenize(text(entity))
# Find a pattern and return coordinates in the text
def find_occurrences(sentences, ptrn):
# Compile a pattern from lexical rule
tgt_pattern = re.compile(ptrn, re.IGNORECASE)
# Create a cache bag
matches = []
# Return sentence, word index where the match is found
# Assuming each sentence adds a penalty of 10 words
for (row, sentence) in enumerate(sentences):
for (col, word) in enumerate(word_tokenize(sentence)):
if tgt_pattern.match(word):
matches.append((row, row * 10, col, word, sentence))
return matches
print sentences('Brian Krzanich')
print find_occurrences(sentences('Brian Krzanich'), "Born")
print find_occurrences(sentences('Brian Krzanich'), r"\d{4}")
Although not demonstrated here, reading the date occurrences in a temporally ascending order reveals how quickly we may compile notable timelines/provenance information from raw text. Visually rendering these timelines over knowledge articles makes a great information feature.
Find closest occurrences of dates in the vicinity of born/birth mentions to attribute a guessed date-of-birth.
# Download and find possible candidates
# Scan for mention of numeric years and "born" in the text
def candidate_dob(entity):
# fetch content and parse into sentences
chunks = sentences(entity)
# Let us imagine these are the 2-dimensional centers that we want to find dates around!
born_centers = find_occurrences(chunks, 'born')
# Date occurrences in sentences
possible_date_sentences = find_occurrences(chunks, r'\d{4}')
# All possible dates are found here. We want to nominate the closest birth occurrences to dates
date_centers = [
list((x[0], x[1], x[2], y)
for y in datefinder.find_dates(chunks[x[0]]))
for x in possible_date_sentences
]
# Initialize the gravity centers -- all dates
date_tree = [date for date in itertools.chain(*date_centers) if date]
coordinate_tree = cKDTree(map(lambda x: [x[1], x[2]], date_tree))
# Now match where closest "birth" occurrences are
dist, indexes = coordinate_tree.query(
map(lambda x: [x[1], x[2]], born_centers
if len(born_centers) > 0 else [[0, 0]]),
k=1)
# Now all dates are sorted by their closest proximity to "birth" mentions
cands = np.array(date_centers)[indexes]
# Select top date candidate
(sentence, row_coord, column, possible_date) = cands[0][
0] if cands is not None and len(cands) > 0 and cands.any() else None
# Format the date as a readable human form
return format_date(
possible_date, format='full', locale='en') if possible_date else None
# Safe wrapper
def find_dob(entity):
if entity:
try:
date = candidate_dob(entity)
return "{0} was born on {1}".format(entity, date) if date else "Could not find DOB for {0}".format(entity)
except:
return "Cannot find a DOB mentioned in the article for {0}".format(entity)
In order to test and make our birthday genie interactive -- aka seek notable names via a textbox in this notebook -- we will use interactive widgets module.
# The following widgets allow for interactively seeking a person name
from IPython.html.widgets import interact
from IPython.html import widgets
# Display an interactive slider for user to choose enter a notable personality
# We use Martin Luther King Jr. as a test case.
display(interact(find_dob, entity=widgets.Text()));
display(HTML('<h4>{0}</h4>'.format(find_dob('Donald Trump'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Manmohan Singh'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Barack Obama'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Brad Pitt'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('George Clooney'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Superman'))))