RED: Rich Epinions Dataset for Recommender Systems

Recommender Systems require specific datasets to evaluate their approach. They do not require the same information: descriptions of users or items or users interactions may be necessary, which is not gathered in today datasets.

This site provides a dataset extracted from Epinions in June 2011. It contains reviews from users on items, trust values between users, items category, categories hierarchy and users expertise on categories. This dataset can be used to evaluate various Recommender Systems using Collaborative Filtering, Content-Based or Trust-Based.

More details about the RED dataset are available in this publication.

The RED development project is now terminated and we do not have the full version of the dataset any more. Therefore, we cannot provide supplementary data from what is already downloadable from this site (despite what is written in the publications). We apologize about that.

Structure

The dataset is a relational database with the following tables:

  • User: name (pseudo and profile url), location, top rank (may be null) and profile visits count
  • Item: name, category and profile url
  • Category: name, parent category, description url, lineage (path in the category tree) and depth (in the category tree)
  • Review: a review associates a user with an item, it contains the rating, between 1 and 5, the review rating (mean of all review ratings associated with this review) and the review date
  • Expertise: users who are experts in a category appear here with the expertise (category lead, top reviewer, advisor) associated with the considered category
  • Trust: web of trust, i.e. a trust value (either -1 or 1) from one user to another, only positive trust values appear in the dataset
  • Similarity: we have computed the similarity between all users couples using the Pearson coefficient correlation. Since this operation may be long and is used in classical collaborative filtering, we provide it in order to ease recommendation; those values do not belong to the Epinions website

database schema

Extraction

We have searched all users through the "members search" facility with a dictionary based approach. We have managed to extract a subsequent number of users with this approach: 240 000 users. Then, for each identified user, we have parsed his/her profile, reviews and web of trust, adding new users if any. This brought a total of about 307 000 users. For each users review, we have parsed the associated item if new and its category.

This approach ensures that items in the dataset have been reviewed at least once. However it does not ensure that each user has reviewed at least one item. We then cleaned the dataset by removing all unnecessary users, i.e. users with no trust relation nor review. Those users were found with the dictionary based approach and are certainly users who wanted to try Epinions or use a read only access.

This extraction took two plain days of crawling in June 2011 on an Intel Core 2 Duo notebook with 3 Go of RAM.

We have encountered several problems during the dataset extraction. First of all, the Epinions website html structure is very particular, using a lot of table tags and very few CSS classes. This made the use of XPath very difficult. In addition, there are many exceptions in the pages structures, some pieces of information were missing sometimes whereas some others appeared not often. Moreover some special characters in users names were problematic. Categories breadcrumbs are not always consistent and made the category extraction pretty chaotic: we had to correct it manually.

Download the dataset

Here it is:

How to contact us

You can ask for more details at red[at]liris.cnrs.fr.

Thank you not to contact us to ask for an extended version of the dataset, as we do not have it any more.