I’ve been involved in dozens of projects in many capacities, from lead scientist, to software developer or visualization specialist. I have also managed many teams, though I find hands-on work more rewarding.

While I’ve done a lot of work in Java, along with occasional front-end work, I spend most of my time these days in Python or R, while using Bash, SQL and other tools when appropriate.

Representative samples of my work are shown below, with minimal redactions to protect confidential client names and/or implementation details..

Selected Projects

Cybersecurity Automation: I designed and built a pair of recommender systems for a leading cybersecurity training company, automating the selection of attacks and training approaches based on the interaction history and other available information about users. 

While the details of the learning model are confidential, at a high-level we used a sequence learning model in Tensor Flow, running on Amazon SageMaker.

As is often the case, a great deal of the work consisted of data cleaning and analysis. Some of the steps along the way are shown below.

Features were designed and accumulated to represent the histories of millions of users.

Features shown here are all log-normalized, after clipping outliers.

(Feature names redacted)

Using cluster analysis, based on their features we segment the millions of users into six groups, with 34 sub-groups.

The matrix view at right shows how groups compare to each other: each group is shown as a row in the figure, while each column represents a feature, so the cells show the mean feature value for the group.

features_by_group

Another way to get a feel for the groupings is to color-code the users in a projection of the 32 features onto the plane.

Finally, within each group, we identify vulnerabilities, in the form of previously successful attacks, shown in red. Some groups are easy to attack, while others are less vulnerable.

Purchase Forecasting Using Embeddings: I used ‘basket embeddings’ to predict corporate purchase patterns. This research was a study of transferring the concepts of the doc2vec algorithm to a different domain.

With embeddings, we can define low-dimensional representations for all sorts of either  very high-dimensional categorical variables like words (left), or even collections of things, such as the words in a document. More importantly, we can learn fixed-length representations of variable sized sets like documents or baskets of purchases.

In this project, I used embeddings to characterize the space of purchases by a large industrial client. To represent these ‘baskets’ of purchases for a learning model, we require a fixed-length representation.

In the doc2vec formulation, to produce a document embedding, we start with a word2vec approach, and add in the document (paragraph) ID as part of the input vector. Just as word2vec lets you learn embeddings for words, here we learn embeddings of sets of items.

In each time-slice, varying numbers of items are bought – to deal with this, instead of individual items, we trained a sequence model to learn sequences of groups of parts – once trained, this yields embeddings for ‘item baskets’

Finally, the basket embeddings is used as inputs to a sequence model, which learns to predict the next purchases for a client, given a known history of their purchases.

Simulating Patient/Doctor/Insurer Populations: In this agent-based modeling example, developed for a leading Pharma company, I simulated decision making in a population-level model of doctors, patients and insurers in relation to the consumption and prescribing of migraine drugs.

In addition to the scientific work, I led a team to build a client-facing simulation tool wrapping the model 

Working in close coordination with subject-matter specialist MDs, I designed a system that combined customizable decision trees with detailed data about drug effectiveness and side effects, along with information about patient and doctor populations.

At the model’s core was a population of Migraineur agents, divided into segments. Each agent was initialized with Comorbidities, as well as demographic data like gender, age and income level, all of which, combined with drug costs data, influences migraineur and perhaps prescriber decisions.

Income was also used in an initial exploration of the influence of payers.

The main simulation metrics were presented as interactive charts, as seen at right.

In order to understand properties that are emergent at a population level, I find it’s often important to look under the hood, and understand dynamics at the level of the individual.

Using the Migraineur Microscope (left) we went beyond aggregates by looking at individual migraineur trajectories

Modeling of Sentiment at Population Level: This was a project funded by the U.S. Navy; I worked with experts on Afghan inter-tribal dynamics to build a simulation tool of sentiment in the Afghan theater.

The model consisted of modeling a population of ‘citizens’, impacted by the actions of two actors, operating on critical resources like access to water, medical facilities, crop and food security, as well as general safety levels.

The simulation was grounded in the context of Afghan villages, and citizen awareness of actor actions was either through direct experience or via communication channels.

We built a user-friendly UI to let analysts create and evaluate scenarios of impacts by either the allied forces (ISAF) or Taliban in specific locations and times.

The scenario evaluation was grounded in data about populations and communications options defined in coordination with SMEs, based on things like the geographic spread of radio stations, or the degree of trust between members of various tribes.

In addition to the impacts of localized actions, citizens were also impacted by communications efforts, for example via radio messaging, dropping of flyers, or in places of worship.

In order to calibrate our model, we made use of the Wikileaks data about improvised explosive device (IED) incidents during the Afghan war. We used the data, which provided incident locations and dates, as a proxy for localized sentiment. We assumed that an IED incident was often correlated with negative localized feeling towards ISAF and positive feelings towards the Taliban. More colloquially, we assumed that incidents were more likely given the tacit support of the local population.

After the model’s free parameters were optimized for best fit to the Wikileaks, we were then able to obtain predictive results from the model, shown in dashboard form at right.

Optimization in Content-Delivery Networks: I worked for a leading French telecom vendor on strategies for designing the layouts of their content-delivery network (used for things like on-demand video)

As is often the case, the first step consisted of a detailed analysis of their user population. User behavior varied both by day and time of day (shown at left), but also by customer type. We segmented the population into several groups with well-defined behavior patterns, e.g., light watchers, midnight bingers, etc.

We were provided with a reference network architecture within which to evaluate traffic, with content cached in certain nodes, and accessed by users at others. Based on six behavioral clusters we identified in the user population, we built a system for generating synthetic network traffic, with simulated users choosing movies and shows to simulate network loads.  A user-facing simulation tool was produced to design and compare different strategies for caching in response to varying assumptions about user makeup and demand. 

Dashboards showed summaries of network traffic, problems, and network hot spots (right)

Market Share Simulation: This was a tool for exploring the impact of the design of Medicare Part B plans on regional market share. We worked closely with our client, a top insurer, to build a tool to better understand the market dynamics.

Our model was built on a rich data set describing all the details of competing insurance plans, along with years of market share data that we could use for calibration.

At its core, the model was an agent-based simulation of a population of seniors choosing their Medicare plans during certain eligibility periods. We also had available a lot of demographic and behavioral information about the population. We also worked with SMEs to develop seven archetypes of seniors, each with a different approach to decision making and plan selection.

Given a well calibrated model, the client was then able to evaluate the impact of changes to their plans, or gameplay the consequences of changes in their competitor’s plans.

After defining all the plan changes and scenarios of interest, the result of simulating a year or two into the future would guide decision making.

While trying to get to the bottom of confusing model dynamics, I developed a useful visual inspection tool I call polyseries, as in ‘many time-series’. These are detailed further on the visualizations page, but in this context they show us how different segments of seniors have changed Medicare plans over a couple of simulated years.

Nymbler: In this fun project, we build a web site to help prospective parents navigate the space of baby names. We used an evolutionary algorithm approach to choosing names, with crossover and mutation operators bridging the gap across name styles and spelling variations.

The site would let you collect names you like, and would give you a continuously tuned stream of new names derived from the ones you like, and distinct from the ones you blocked.

    I got to wear many hats on this project:

    • interviewed prospective parents about the factors they consider when choosing names
    • scraped many web sites for name data, origins and meanings, and then cleaned and merged for a new name data set
    • worked with the visual design team on the design elements

    It was a fascinating experience having to scrape and distill our own data from disparate sources, cross-validating as well as integrating it with 100+ years of US Social Security data on name popularity.

    nymbler_ui2

    Human-centered Route Optimization: In contrast with most route-optimization work, we built a tool that gave the postal workers a voice in the process. This tool was adopted in post offices across the whole of France for producing the découpages, the segmentation of the routes in a district.

    The tool we developed was based on the concept interactive evolution:

    1. Produce a few sets of route segmentations
    2. Postal workers rate each set according their subjective preference.
    3. Generate new route sets that take into account both  worker preferences and traditional metrics like time and fuel. Get new worker ratings and repeat. 
    Located: Near Cambridge, MA.