Features were designed and accumulated to represent the histories of millions of users.
Features shown here are all log-normalized, after clipping outliers.
(Feature names redacted)
While I’ve done a lot of work in Java, along with occasional front-end work, I spend most of my time these days in Python or R, while using Bash, SQL and other tools when appropriate.
Representative samples of my work are shown below, with minimal redactions to protect confidential client names and/or implementation details..
While the details of the learning model are confidential, at a high-level we used a sequence learning model in Tensor Flow, running on Amazon SageMaker.
As is often the case, a great deal of the work consisted of data cleaning and analysis. Some of the steps along the way are shown below.
Features were designed and accumulated to represent the histories of millions of users.
Features shown here are all log-normalized, after clipping outliers.
(Feature names redacted)
Using cluster analysis, based on their features we segment the millions of users into six groups, with 34 sub-groups.
The matrix view at right shows how groups compare to each other: each group is shown as a row in the figure, while each column represents a feature, so the cells show the mean feature value for the group.
Another way to get a feel for the groupings is to color-code the users in a projection of the 32 features onto the plane.
Finally, within each group, we identify vulnerabilities, in the form of previously successful attacks, shown in red. Some groups are easy to attack, while others are less vulnerable.
With embeddings, we can define low-dimensional representations for all sorts of either very high-dimensional categorical variables like words (left), or even collections of things, such as the words in a document. More importantly, we can learn fixed-length representations of variable sized sets like documents or baskets of purchases.
In this project, I used embeddings to characterize the space of purchases by a large industrial client. To represent these ‘baskets’ of purchases for a learning model, we require a fixed-length representation.
In the doc2vec formulation, to produce a document embedding, we start with a word2vec approach, and add in the document (paragraph) ID as part of the input vector. Just as word2vec lets you learn embeddings for words, here we learn embeddings of sets of items.
In each time-slice, varying numbers of items are bought – to deal with this, instead of individual items, we trained a sequence model to learn sequences of groups of parts – once trained, this yields embeddings for ‘item baskets’
Finally, the basket embeddings is used as inputs to a sequence model, which learns to predict the next purchases for a client, given a known history of their purchases.
In addition to the scientific work, I led a team to build a client-facing simulation tool wrapping the model
Working in close coordination with subject-matter specialist MDs, I designed a system that combined customizable decision trees with detailed data about drug effectiveness and side effects, along with information about patient and doctor populations.
At the model’s core was a population of Migraineur agents, divided into segments. Each agent was initialized with Comorbidities, as well as demographic data like gender, age and income level, all of which, combined with drug costs data, influences migraineur and perhaps prescriber decisions.
Income was also used in an initial exploration of the influence of payers.
The main simulation metrics were presented as interactive charts, as seen at right.
In order to understand properties that are emergent at a population level, I find it’s often important to look under the hood, and understand dynamics at the level of the individual.
Using the Migraineur Microscope (left) we went beyond aggregates by looking at individual migraineur trajectories
The model consisted of modeling a population of ‘citizens’, impacted by the actions of two actors, operating on critical resources like access to water, medical facilities, crop and food security, as well as general safety levels.
The simulation was grounded in the context of Afghan villages, and citizen awareness of actor actions was either through direct experience or via communication channels.
We built a user-friendly UI to let analysts create and evaluate scenarios of impacts by either the allied forces (ISAF) or Taliban in specific locations and times.
The scenario evaluation was grounded in data about populations and communications options defined in coordination with SMEs, based on things like the geographic spread of radio stations, or the degree of trust between members of various tribes.
In addition to the impacts of localized actions, citizens were also impacted by communications efforts, for example via radio messaging, dropping of flyers, or in places of worship.
In order to calibrate our model, we made use of the Wikileaks data about improvised explosive device (IED) incidents during the Afghan war. We used the data, which provided incident locations and dates, as a proxy for localized sentiment. We assumed that an IED incident was often correlated with negative localized feeling towards ISAF and positive feelings towards the Taliban. More colloquially, we assumed that incidents were more likely given the tacit support of the local population.
After the model’s free parameters were optimized for best fit to the Wikileaks, we were then able to obtain predictive results from the model, shown in dashboard form at right.
As is often the case, the first step consisted of a detailed analysis of their user population. User behavior varied both by day and time of day (shown at left), but also by customer type. We segmented the population into several groups with well-defined behavior patterns, e.g., light watchers, midnight bingers, etc.
We were provided with a reference network architecture within which to evaluate traffic, with content cached in certain nodes, and accessed by users at others. Based on six behavioral clusters we identified in the user population, we built a system for generating synthetic network traffic, with simulated users choosing movies and shows to simulate network loads. A user-facing simulation tool was produced to design and compare different strategies for caching in response to varying assumptions about user makeup and demand.
Dashboards showed summaries of network traffic, problems, and network hot spots (right)
Our model was built on a rich data set describing all the details of competing insurance plans, along with years of market share data that we could use for calibration.
At its core, the model was an agent-based simulation of a population of seniors choosing their Medicare plans during certain eligibility periods. We also had available a lot of demographic and behavioral information about the population. We also worked with SMEs to develop seven archetypes of seniors, each with a different approach to decision making and plan selection.
Given a well calibrated model, the client was then able to evaluate the impact of changes to their plans, or gameplay the consequences of changes in their competitor’s plans.
After defining all the plan changes and scenarios of interest, the result of simulating a year or two into the future would guide decision making.
While trying to get to the bottom of confusing model dynamics, I developed a useful visual inspection tool I call polyseries, as in ‘many time-series’. These are detailed further on the visualizations page, but in this context they show us how different segments of seniors have changed Medicare plans over a couple of simulated years.
The site would let you collect names you like, and would give you a continuously tuned stream of new names derived from the ones you like, and distinct from the ones you blocked.
I got to wear many hats on this project:
It was a fascinating experience having to scrape and distill our own data from disparate sources, cross-validating as well as integrating it with 100+ years of US Social Security data on name popularity.
The tool we developed was based on the concept interactive evolution: