MDCO - Masse de données Connaissances Ambiantes 2007

Really Open Simple Efficient Syndication – ROSES

Submission summary

Internet has become an economical support for publishing and distributing information in a
large scale. Internet publishing techniques can be distinguished by the client's control on
the origin and the quality of the published information, their precision in reaching only
clients interested in the published information and their “publication lag” corresponding to
the time necessary for reaching these clients. For example, “spam”-based publishing is
generally uncontrolled, unprecise and has a short publication lag. News forums improve
precision but need to be moderated in order to guarantee the origin and the quality of the
published information. The origin and quality of web page information is guaranteed by the
address (URL) of the producing web site but suffers from an important publication lag due
to the low refresh rate of web search engines.
In order to reduce the time interval necessary for an information published on a web site to
reach the interested users, more and more web sites apply web syndication techniques for
publishing their contents. These techniques consist in publishing new information in form of
web feeds or blogs to interested users who actively subscribe to these blogs. They reduce
the publication lag of web information and allow users to create their personal information
space observing the evolution of well-defined information sources.
Whereas web content syndication can be considered as a new efficient way of sharing
information on the web, it also suffers from well-known problems related to the large scale
of the web. The number of web feeds and blogs is constantly growing which creates new
issues in feed management and feed aggregation. Specialized web syndication portals like
Blastfeed.com, Plazoo.com and Technorati.com try to solve some of these problems by
collecting and aggregating web feed data. One goal of these portals is to index feed data
(similar to search engines for standard web ressources) based on efficient refresh
algorithms to reduce the publication lag mentioned before. For example the number of
feeds indexed by technorati.com doubles in size approximatively every six months
and has reached 36*10^6 feeds in april 2006 and observes about 50*10^3 postings per
hour (http://technorati.com/weblog/2006/04/96.html).
The goal of the RSSBD project is to apply and evaluate modern data management
technology in the context of web syndication. The proposed approach is based on the
observation that web content syndication can be considered as a large-scale distributed
XML data management problem :
1. The two main web feed formats are RSS and Atom and both of them use XML as
publishing syntax. We intend to exploit and adapt existing XML data management
technology like XML datawarehouses and standard XML query languages
(XQuery/XPath) for defining and implementing advanced syndication services
(publish, filter, aggregate).
2. Web content syndication consists in observing and aggregating large-volumes of
evolving distributed XML data. Existing web syndication portals or interfaces are
based on a centralized architecture and must be able to support high refresh and
aggregation workloads. In this project we intend to apply and extend existing query
evaluation and optimization techniques for distributed data in the context of web
syndication. In particular, we will study the case of a distributed P2P syndication
infrastructure.
3. Currently proposed RSS feed aggregation services are still very limited and
essentially consist in key-word based filtering, concatenating and time-stamp based
reordering of several feeds. One goal of the project is to propose new advanced
aggregation services based on XML data integration techniques.

Project coordination

Université

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

Help of the ANR 294,689 euros
Beginning and duration of the scientific project: - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter