crawling

Job offer: QWANT-INRIA Research Engineer / Post-doc job position

 
EN ----------------------------------------------------------
 
Title: QWANT-INRIA Research Engineer / Post-doc job position
 
Reference: QWANT-INRIA-PL
 
Context:
The ANSWER project is leaded by the QWANT search engine and the INRIA Sophia Antipolis Méditerranée research center. This proposal is the winner of the "Grand Challenges du Numérique" (BPI) and aims to develop the new version of the search engine www.qwant.com with radical innovations in terms of search criteria, indexed content and privacy of users. The scientific and technological challenges addressed by the project address several evolutions of the Web to adapt to them on the one hand (heterogeneity of data) and to anticipate them on the other hand (personalization and respect for privacy, increase in granularity and number of criteria for qualification of Web content). This job description describes one of the open positions within this project.
 
Description:
This position will be attached to Inria and covers both the project management for Inria over its entire duration (project leader for Inria) and the task of researching and developing crawling methods for linked data on the Web.
Indeed, the content of the Web has diversified enormously and not only in terms of multimedia content. Data is now injected inside the pages (RDFa, Microdata, OGP, microformats, etc.) or even published as Linked Open Data directly on the Web using its latest standards (RDF, SPARQL). However, very few crawlers exist to collect these data, although their extraction has been facilitated.
Concerning the technical task we plan the following steps:
A first step will be to carry out the study and comparison of crawling approaches respecting the principles of the linked data on the Web. The various techniques of parsing and collection of linked open data as well as the different possible formats will be taken into account both in terms of indexing and in terms of storage. This step includes the study and benchmarking of existing bricks (e.g. LDSpider, Any23).
A second step will be the proposal of an integrated and robust solution in order to make it available and usable for a mainstream search engine. In particular the design and prototyping of a crawler dedicated to this data.
A third step is the study of the indexing and storage of these data in order to allow a search engine to provide answers beyond a list of pages and to integrate these sources in their indexing and their calculations of relevance or as new types of responses and services provided by the engine.
The idea of the two previous steps is to design a processing chain directly integrated into the crawler, which would allow to only index the open data meeting certain criteria, in order to create a specific index (silo or vertical, to use the terms of the search engines).
The final objective is to prototype a new crawler to collect and index linked open data in its different formats (different RDF syntaxes, etc.). This crawling will: (1) Be robust to variations of publication formats; (2) Scaling up on-line data volumes; (3) Facilitate the integration of data within a single model (RDF); (4) Support integration with results, including the indexing or provision of datasets and APIs on these data.
 
Terms of the position
-    Duration: 36 months.
-    Hosting team: WIMMICS ( http://wimmics.inria.fr/ ) a joint research team (University Côte d’Azur, Inria, CNRS, I3S) in the fields of linked data, semantic Web, graph-oriented knowledge representation for Web-based epistemic communities.
-    Location: Sophia Tech Campus, Sophia Antipolis, France. 
-    Salary: 2600 euros/month (gross salary).
-    Applications : send by email a CV and a cover letter to Fabien.Gandon@inria.fr with the subject “Application QWANT-INRIA-PL”
-    Deadline for applications: November 13th, 2017.