I’ve been involved in dozens of projects in many capacities, from lead scientist to software developer to visualization specialist. I have also managed many teams, though I find hands-on work more rewarding.

In terms of tools, I’ve worked a lot in Java, along with a bit of JS/HTML/CSS, though I spend most of my time now in Python or R, and using other tools like Bash or SQL when appropriate. I love learning new languages.

Some samples of my work are provided, with a few redactions to protect confidential client names and/or implementation details.

Machine Learning

Cybersecurity Automation

I designed and built a pair of recommender systems for a leading cybersecurity training company, automating the selection of attacks and training approaches based on interaction histories and other available information about users. 

While details of the learning model are confidential, at a high-level we used a sequence learning model in Tensor Flow, running on Amazon SageMaker.

As is often the case, a great deal of the work consisted of data cleaning and analysis. Some of the steps along the way are shown below.

Features were designed and accumulated to represent the histories of millions of users.

Features shown here are all log-normalized, after clipping outliers.

(Feature names redacted)

Using cluster analysis, based on their features we segmented the millions of users into six major groups and 34 sub-groups.

The matrix view at right shows how groups compare to each other: each sub-group is shown as a row in the figure, while each column represents a feature, so the cells show the mean feature value for the sub-group.

features_by_group

Another way to get a feel for the groupings is to color-code the users in a projection of the 32 features onto the plane.

Finally, within each group, we identify vulnerabilities, in the form of previously successful attacks, shown in red. Some groups (columns) are easy to attack, while others are less vulnerable.

I revisited the polyseries concept to better understand differences in the various user populations. We see at left two very different classes of users. Before having this look, I expected relatively similar homogeneous user populations. It’s often not until you get a visual of a data set, with interactive faceting and sorting, that you really understand what you’re dealing with.

Purchase Forecasting Using Embeddings

I used ‘basket embeddings’ to predict corporate purchase patterns. This research was a study of transferring the concepts of the doc2vec algorithm to a different domain.

With embeddings, we can define low-dimensional representations for all sorts of either very high-dimensional categorical variables like words (left), or even collections of things, such as the contents of a document. More importantly, we can learn fixed-length representations of variable sized sets like documents or baskets of purchases, which makes them easier to use with learning models.

In this project, I used embeddings to characterize the space of purchases by a large industrial client. In order to present these ‘baskets’ of purchases for a learning model, we needed a fixed-length representation.

In the doc2vec formulation, to produce a document embedding, we start with a word2vec approach, and add in the document (paragraph) ID as part of the input vector. Just as word2vec lets you learn embeddings for words, here we learn embeddings of sets of items.

In each time-slice, varying numbers of items are bought – to deal with this, instead of individual items, we trained a sequence model to learn sequences of groups of parts – once trained, this yields embeddings for ‘item baskets’

Finally, the basket embeddings is used as inputs to a sequence model, which learns to predict the next purchases for a client, given a known history of their purchases.

PhD Thesis Work

In graduate school, my thesis work was focused on two applications of ARTMAP neural networks:

  • Classifying remote sensing imagery of forests

  • Sensor fusion of robot sensors

As it was grad school, I got to do some interesting data collection: I spent a week with a team dodging logging trucks driving around the Plumas National Forest to get visual estimates of forest cover, and got a robot to do self-supervised collection of distance data to serve as ground-truth labels for sensor fusion.

Classifying remote sensing imagery​

In a collaboration with the Geography department at Boston University, I worked on two projects centered on classifying the vegetation from satellite imagery, specifically the Plumas and Sierra National Forests in California.

In addition to the six bands of spectral data from Thematic Mapper remote sensing imagery, I was also working from additional geo-referenced data, such as the slope and elevation.

We were looking to see how well neural network methods could improved the efficiency of the US Forest Service’s vegetation mapping scheme.

The neural network learns to classify vegetation types by fusing spectral, terrain and location variables.

Our approach exceeded the pre-editing accuracy of maps produced by the National Forest Service, and approach the accuracy of the post-editing expert maps.

Training a neural network to do the discrimination takes minutes, whereas handcrafting an expert system to do the job can take more than six months.

Robot sensor fusion:​

In a separate application of the same neural network architecture, I fused sonar and visual inputs on a robot, yielding an improved distance percept.

I made the training set through self-supervision: the B14 robot randomly explored its environment. When the infrared sensor encountered an obstacle, snapshots of the visual and sonar sensors taken in transit were recorded along the retrospective distance to the obstacle.

Located: Near Cambridge, MA.