Data Bloom

Fact Extraction

Information is power. We have encoded so much of our worldly knowledge in plain English -- but how do we codify this knowledge for computers to make better decision support systems?

Amazon Alexa, Google Home, Apple Siri like personal digital assistants have put natural language questions (and consequently the dense answer bank) within everyone's reach. Thanks to the many kind souls and bright researchers, knowledge-bases like Wikipedia, Wikidata, Freebase, Never-Ending-Language-Learning, OpenCyc, OLLIE already have the precursory knowledge to bootstrap smart information assistants by commoners.

Do you envision using your content for best customer outcomes? Then start encoding your content into a parallel SVO triplets and ontological SKOs for best outcomes -- for if you do, both androids and anthropoids may use your content.

We create below a simple data-of-birth genie that parses and generates a date of birth for notable people in the world using simple structural/boilerplate rules. We will extend this more using linguistic analysis later.

Data

Let us create a macro to get some unstructured text from Wikipedia for any given entity. Hopefully, it will work.

In [1]:
import requests
import html2text, wikipedia

# Connect to Wikipedia and get relevant content for any entity
# Also parse HTML content into plain text
text = lambda entity: wikipedia.page(entity).summary

Analysis

Let us find all occurrences of "born/birth" -- return coordinates (Sentence, Word index) of all "birth" occurrences

Similarly find all occurrences of dates in the text -- again return coordinates of all dates.

Simply looking at structural context, it is easy to imagine dates that occur closest to the birth occurrence is probably a date of birth.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
import datefinder, itertools, re
from scipy.spatial import cKDTree
from babel.dates import format_date, format_datetime, format_time

# Parse the text into chunks of sentences
sentences = lambda entity: sent_tokenize(text(entity))

# Find a pattern and return coordinates in the text
def find_occurrences(sentences, ptrn):
    # Compile a pattern from lexical rule
    tgt_pattern = re.compile(ptrn, re.IGNORECASE)

    # Create a cache bag
    matches = []
    # Return sentence, word index where the match is found
    # Assuming each sentence adds a penalty of 10 words
    for (row, sentence) in enumerate(sentences):
        for (col, word) in enumerate(word_tokenize(sentence)):
            if tgt_pattern.match(word):
                matches.append((row, row * 10, col, word, sentence))
    return matches

Wiki Text

In [3]:
print sentences('Brian Krzanich')
[u'Brian Matthew Krzanich (born May 9, 1960) is the Chief Executive Officer of Intel.', u'He was elected CEO on May 2, 2013, concluding a six-month executive search after incumbent CEO Paul Otellini announced his resignation in November 2012.', u"Krzanich assumed the role of CEO on May 16, 2013 at the company's annual general meeting.", u"Before becoming CEO, he was Intel's Executive Vice President and Chief Operating Officer.", u"Krzanich earned a bachelor's degree in chemistry from San Jose State University and holds a patent for semiconductor processing.", u'He joined Intel in 1982 in New Mexico as an engineer.', u'He was promoted to COO in January 2012.', u'He often visits Intel-sponsored hackathons and Best Buys with his wife and two daughters.']

"Born" Occurrences

In [4]:
print find_occurrences(sentences('Brian Krzanich'), "Born")
[(0, 0, 4, u'born', u'Brian Matthew Krzanich (born May 9, 1960) is the Chief Executive Officer of Intel.')]

"Date" Occurrences

In [5]:
print find_occurrences(sentences('Brian Krzanich'), r"\d{4}")
[(0, 0, 8, u'1960', u'Brian Matthew Krzanich (born May 9, 1960) is the Chief Executive Officer of Intel.'), (1, 10, 8, u'2013', u'He was elected CEO on May 2, 2013, concluding a six-month executive search after incumbent CEO Paul Otellini announced his resignation in November 2012.'), (1, 10, 25, u'2012', u'He was elected CEO on May 2, 2013, concluding a six-month executive search after incumbent CEO Paul Otellini announced his resignation in November 2012.'), (2, 20, 10, u'2013', u"Krzanich assumed the role of CEO on May 16, 2013 at the company's annual general meeting."), (5, 50, 4, u'1982', u'He joined Intel in 1982 in New Mexico as an engineer.'), (6, 60, 7, u'2012', u'He was promoted to COO in January 2012.')]

Although not demonstrated here, reading the date occurrences in a temporally ascending order reveals how quickly we may compile notable timelines/provenance information from raw text. Visually rendering these timelines over knowledge articles makes a great information feature.

"Date" and "Birth" Intersection

Find closest occurrences of dates in the vicinity of born/birth mentions to attribute a guessed date-of-birth.

In [6]:
# Download and find possible candidates
# Scan for mention of numeric years and "born" in the text
def candidate_dob(entity):
    # fetch content and parse into sentences
    chunks = sentences(entity)

    # Let us imagine these are the 2-dimensional centers that we want to find dates around!
    born_centers = find_occurrences(chunks, 'born')

    # Date occurrences in sentences
    possible_date_sentences = find_occurrences(chunks, r'\d{4}')

    # All possible dates are found here. We want to nominate the closest birth occurrences to dates
    date_centers = [
        list((x[0], x[1], x[2], y)
             for y in datefinder.find_dates(chunks[x[0]]))
        for x in possible_date_sentences
    ]

    # Initialize the gravity centers -- all dates
    date_tree = [date for date in itertools.chain(*date_centers) if date]

    coordinate_tree = cKDTree(map(lambda x: [x[1], x[2]], date_tree))
    # Now match where closest "birth" occurrences are
    dist, indexes = coordinate_tree.query(
        map(lambda x: [x[1], x[2]], born_centers
            if len(born_centers) > 0 else [[0, 0]]),
        k=1)

    # Now all dates are sorted by their closest proximity to "birth" mentions
    cands = np.array(date_centers)[indexes]

    # Select top date candidate
    (sentence, row_coord, column, possible_date) = cands[0][
        0] if cands is not None and len(cands) > 0 and cands.any() else None

    # Format the date as a readable human form
    return format_date(
        possible_date, format='full', locale='en') if possible_date else None

# Safe wrapper
def find_dob(entity):
    if entity:
        try:
            date = candidate_dob(entity)
            return "{0} was born on {1}".format(entity, date) if date else "Could not find DOB for {0}".format(entity)
        except:
            return "Cannot find a DOB mentioned in the article for {0}".format(entity)

Interactive Dialog

In order to test and make our birthday genie interactive -- aka seek notable names via a textbox in this notebook -- we will use interactive widgets module.

In [7]:
# The following widgets allow for interactively seeking a person name
from IPython.html.widgets import interact
from IPython.html import widgets

# Display an interactive slider for user to choose enter a notable personality
# We use Martin Luther King Jr. as a test case.
display(interact(find_dob, entity=widgets.Text()));
'Martin Luther King was born on Tuesday, January 15, 1929'

Some other tests

In [8]:
display(HTML('<h4>{0}</h4>'.format(find_dob('Donald Trump'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Manmohan Singh'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Barack Obama'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Brad Pitt'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('George Clooney'))))
display(HTML('<h4>{0}</h4>'.format(find_dob('Superman'))))

Donald Trump was born on Friday, June 14, 1946

Manmohan Singh was born on Monday, September 26, 1932

Barack Obama was born on Friday, August 4, 1961

Brad Pitt was born on Wednesday, December 18, 1963

George Clooney was born on Saturday, May 6, 1961

Cannot find a DOB mentioned in the article for Superman