Embedding Everything for Anything2Anything Recommendations

Ethan Rosenthal
Making Dia
Published in
6 min readSep 30, 2016

--

At Dia&Co, our goal is always to find the best clothing to match the personal style and fit of each and every one of our customers. We excel at this due to our deep expertise and understanding of our customers’ unique needs, and we are always striving to improve. One way we improve is through our use of data.

Let’s say that we wanted to build an algorithm to find the products that a customer is most likely to love given her unique style. We could go fancy, build out a deep learning solution, and maybe this would surface the very best products for each customer. Unfortunately, we would find reduced interpretability of the model. There are often multiple stakeholders that would find more benefit from knowing which features are important compared to a simple prediction score.

On the other end of the model complexity spectrum are simple linear models. While these tend to perform fairly well, our needs may be more complex than the model can handle. There could be many interaction terms at play, and we might not have the time or expertise to hand-roll all of them. Linear models also make it difficult to ask recommendation-style questions of the data like “Which tank-tops go best with this pair of jeans?”

We have recently fallen in love with latent factor models for their ability to fit complex objective functions, remain reasonably interpretable, and prove flexible under recommendation-style interrogation.

Factors FTW

Before I explain how we use latent factor models at Dia, let me provide a brief primer on what I mean by latent factors. There are classes of machine learning algorithms that learn vector representations of the data’s features by embedding the features in a lower dimensional space whilst optimizing an objective function (e.g. logistic loss). Examples usually help when math words fail:

Imagine a feature of our products is a binary variable saying whether or not the item of clothing is a pair of jeans. The latent factor model would learn some vector, i.e. a series of numbers with direction: [0.2, 0.9, -1.4, …, 0.1] that represents this feature. The numbers inside the vector are free weights that are learned during the optimization process. Because these vector representations all exist in the same vector space, we can use the vectors to explore the relationship between features.

Example of feature vectors in a 2-dimensional latent space. Many jeans have frayed cuffs, so these vectors may lie close to each other as opposed to a “top” feature vector.

From matrix factorization to factorization machines

The usual place to start with latent factor models is with matrix factorization. Typically, in this paradigm, one learns a single vector representation for each unique product ID and customer ID. With these vectors, you can then use a similarity measure (typically cosine similarity) to look at similarities between products, between customers, and between customers and products. Chris Johnson has an excellent implementation of implicit matrix factorization on his github. I’ve also previously written in more detail about explicit matrix factorization on my personal blog.

At Dia, we ask our customers to fill out a survey before ordering clothing so that we get to know them better and are able to delight them with hand-picked clothing to match their style. We also have rich information about our clothing through our detailed taxonomy and tagging systems. Until recently, it was difficult to include this extra customer and product information (often called side information) into a latent factor model.

LightFM is a python library from the data scientists at Lyst, and it incorporates this extra information in a fast and scalable manner (i.e. Cython + parallelization). Vectors are learned for the side information, and this allows one to look at relationships between features, products, and customers. The LightFM algorithm approximates products and customers as the sum of all their respective feature vectors. The prediction of the algorithm is the dot product of the summed product and customer vectors.

One can take things a step further and employ factorization machines. This is a similar class of model but one that makes no assumptions about how feature vectors should be combined. Factorization machines learn vectors for every feature, and they predict via the dot product of every combination of features (and employ nice math tricks to make this computationally feasible). This ability to look at all possible interactions of features embedded in the same vector space make factorization machines extremely performant at modeling complex data. The Python libraries polylearn and fastFM both provide scikit-learn compatible APIs for training and using factorization machines.

Latent factors at Dia&Co

We’ve learned these latent factor representations for both our customer and product data by training latent factor models on our historical data. While one should always pick a metric for optimization and hyperparameter tuning of machine learning models, it’s especially fun to play with trained recommendation systems to make sure that they make sense.

Let’s walk through an example. Say we have a matrix of products and their features called prod_features with shape (num_products, num_features). Each row is a product, and each column is a feature (like is_a_jean). We also have another learned matrix of latent factors for these features called prod_feature_factors with shape (num_features, num_factors).

We have a labeled matrix prod_features on the left and a learned matrix prod_feature_factors on the right.

We could go the LightFM route and approximate a product as the sum of its features’ latent factors to create single vector representations for each product. To do this in Python and numpy, we simply use matrix multiplication.

prod_representation = prod_features.dot(prod_feature_factors)

We’ll also assume that we have paired dictionaries which map the column indices of product_features to the names of our features and vice versa as well as the row indices of our products to the products’ names and vice versa.

print(prod_feat_dict['is_a_jean'])
# 0
print(reverse_prod_feat_dict[0])
# is_a_jean
print('prod_index_dict['Product Name 1'])
# 42
print('reverse_prod_index_dict[42])
# Product Name 1

We can now interrogate our model by asking questions. For example, we ask our customers which types of tops they like. Using latent factors, we can look at the top products for customers who checked off that they like sleeveless tops. Assuming we have analogous customer information to our product information, this code looks like

def cosine_similarity(mat1, mat2):
sim = mat1.dot(mat2)
norms = np.array([np.sqrt(np.diagonal(sim))])
sim /= np.linalg.norm(mat2, axis=1)[:, np.newaxis]
sim /= np.linalg.norm(mat1, axis=1)[:, np.newaxis]
return sim
sleeveless_idx = cust_feat_dict['loves_sleeveless_tops']
sleeveless_vec = cust_feat_factors[sleeveless_idx][:, np.newaxis]
sims = cosine_similarity(sleeveless_vec, prod_representation)
recs = [reverse_prod_index_dict[i] for i in np.argsort(-sims)]

Unsurprisingly, the top products for customers who like sleeveless tops are tank tops:

Even though this is the result that we should expect, this is important to see that (1) our survey is collecting true and relevant information and (2) our algorithm is learning accordingly.

We can also ask more nuanced questions. Let’s say a customer loves the last maxi dress that we sent them (an example of which is shown on the left below) but we do not want to send them another dress. We can search for the clothing that is most similar to maxi dresses and then filter out any dress results. We find mostly wide-legged trousers which hit the lower body similarly to maxi dresses but are an entirely different category of clothing.

As you can see, by combining our detailed customer and product information with the power of well-trained latent factors, our ability to surprise and delight our customers with personalized clothing is simply limited by our creativity in asking questions of our data.

If you are interested in working on problems like these or would like to learn about the myriad of other technology solutions that we are building, then visit our careers page and join us!

--

--

Data Scientist at Dia&Co. Formerly at Birchbox (modeling, data and otherwise) and Columbia (physics phd).