ENsEN

ENsEN (Enhanced Search Engine): a software system that enhances a SERP with semantic snippets.

In more details

The general issue: ranking on the edge of two webs (Web of Data, Web of Documents)

The advances of the Linked Open Data (LOD) initiative are giving rise to a more structured web of data where a few datasets act as hubs (e.g., DBpedia, Freebase,…) allowing the emergence of new use-cases. Thus, as an important use-case, DBpedia has been used in conjunction with NLP strategies in order to associate words within a webpage of the web of documents with entities from the web of data (DBpedia Spotlight is one of the current successful applications for this use-case). In this context, we address the problem of ranking LOD entities obtained from the automatic annotation (e.g. through DBpedia Spotlight) of a webpage returned by a web search engine in response to a user query. We propose a new ranking algorithm (LDSVD), we compare it to an enhanced version of PageRank, and we use it for the construction of semantic snippets for which we evaluate the usability and the usefulness on a panel of users.

Challenges and Contributions

We proposed a new algorithm, LDSVD, for ranking entities in a RDF graph given the knowledge of a user's information need through a query made of keywords. LDSVD is made to adapt to situations where only a sparse and heterogeneous graph is available. This happens in particular when the graph comes from the automatic annotation of a webpage (e.g. with DBpedia Spotlight,…). Indeed, LDSVD takes advantage of both the explicit structure given by the web of data and the implicit relationships that can be found by text analysis of a webpage. LDSVD outperforms state of the art methods based on the PageRank algorithm. Furthermore, we applied LDSVD in the applicative context of semantic snippets where its high accuracy allowed for the construction of useful and usable enhanced snippets that integrate entities obtained from the automatic annotation of a webpage. Future work could evaluate the potential of this approach for exploratory search.

ENsEN (Enhanced Search Engine)

To show the usefulness and efficiency of LDSVD, we used it at the core of ENsEN (Enhanced Search Engine): a software system that enhances a SERP with semantic snippets. We now give the workflow that produces a semantic snippet from a query so as to stress the essential role played by LDSVD. Given the query, we obtain the SERP (we used Google for our experiments). For each result of the SERP, we use DBpedia Spotlight to obtain a set of DBpedia entities. In the same way, we find entities from the terms of the query. From this set of entities and through queries to a DBpedia SPARQL endpoint, we obtain a graph by finding all the relationships between the entities. To each entity, we associate a text obtained by merging its DBpedia's abstract and windows of text from the webpage centered on the surface forms associated with the entity. With as input the graph, its associated text, and the entities extracted from the query, we execute LDSVD and we obtain a ranking of the entities. Vignettes built from the top-ranked entities (viz. ``main-entities'') are displayed on the snippet. From a DBpedia SPARQL endpoint, we do a 1-hop extension of the main-entities in order to increase the number of triples among which we will then search for the more important ones in terms of a link analysis of the graph. To do this, we build a 3-way tensor from the extended graph: each predicate corresponds to an horizontal slice that represents the adjacency matrix for the restriction of the graph to this predicate. We compute the PARAFAC decomposition of the tensor into a sum of factors (rank-one three-way tensors): for each main-entity, we select the factors to which it contributes the most (as a subject or as an object), and for each of these factors we select the triples with the best ranked predicates. Thus, we associate to each main-entity a set of triples that will appear within its description. Finally, we used a machine learning approach to select short excerpts of the webpage to be part of the description of each main-entity. We designed a number of features based on the query, the text of the webpage, and the ranked entities. For the selection process, we used an infogain metric to select a small set of features then used as a start-set for a wrapper method with a forward selection approach. We observed that the features derived from the LDSVD ranking remained in the set of used features, which stands as a supplementary, although indirect, element in favor of the usefulness of LDSVD.