User Tools

Site Tools


Toolkit_Web is a GATE plugin developed by the Liris DRIM team. Its purpose is to group various processing resources aimed at processing data taken from the World Wide Web.

The available Processing Resources provided by this plugin are:

Resource Encoding Errors Corrector
Resource type Corrector - this PR modifies a Language Resource.
Be careful not to run any annotating resource before a Corrector: annotations are characterised by their offset from the first character of the document, and Correctors may insert or delete characters.
Dependencies none
Description This PR corrects encoding errors in the document, in the case where the encoding format provided within the file is not the right one. It is primarily aimed at news feeds in French (which have many more reasons to be affected than those in English).
Resource RSS Parser
Resource type Annotator - this PR adds / modifies annotations / features in a Language Resource.
Dependencies none
Description This PR parses an xml rss feed and annotates the items, as well as their title, description text, link and publication date. It is really permissive in the allowed formats, but it needs the xml markup in order to work (set the 'markup aware' document property to false).
Screenshot RSS Parser screenshot
toolkit_web.txt · Last modified: 2014/05/19 16:53 by sgesche