Toolkit_PolytonicGreek is a GATE plugin developed by the Liris DRIM team. Its purpose is to group various processing resources aimed at processing ancient polytonic greek (and especially Koine Greek). Processing tasks range from language processing (normalisation, lemmatisation) to quotation retrieval.
In order to perform a text reuse research, you can use the following resources in the following order:
The result of this chain will be Language Resources with annotations indicating text reuse. There is no xml output for those yet (discussions for the exact output format are still being held).
Overall, the available Processing Resources provided by this plugin are divided into two sets: the Polytonic Greek set, which focuses on processing general Greek text, and the Biblindex set, which adds concerns about the text structure (this part being specific to the formats having been used for the Biblindex ANR project -not that these formats are limited to Biblindex in any way, see below).
Resource | Polytonic Greek Transcoder |
---|---|
Resource type | Annotator - this PR adds / modifies annotations / features in a Language Resource. |
Dependencies | GATE Unicode Tokeniser or BibleWorks Code Tokeniser |
Parameters | input format: specifies the format of the input document, which defines which transcoders will be used; input format must be one of 'unicode', 'betacode' and 'bibleworks code'. |
Description | This PR annotates each word with its transcoded forms: unicode, beta-code and BibleWorks code. Any of these formats is accepted as input (beta-code follows the Perseus lowercase implementation though). However, when using beta-code or BibleWorks code as input, you must also use a relevant tokeniser (we provide one for BibleWorks, and there is one for beta-code in the Perseus Hopper). |
Screenshot |
Resource | Polytonic Greek Lemmatiser |
---|---|
Resource type | Annotator - this PR adds / modifies annotations / features in a Language Resource. |
Dependencies | GATE Unicode Tokeniser or Polytonic Greek Transcoder, lemma table (see description) |
Parameters | lemma table: the url of a lemma table (see below) |
Description | This PR uses a lemma table to annotate each word with its lemma. It was built for statistical text analysis and not for critical edition purpose, and as such there is no disambiguation process (as hinted by the absence of grammatical analysis). You will need a lemma table containing lines in the format 'term==lemma' as input. The lemma table used in Biblindex is available here. Please keep in mind that lemmatisation in this PR is done with statistical processing in mind, so it will not try to guess which lemma is the right one when more than one lemma is available, rather using the first on the list (you can however specify several lemmata for a single term using the ';;' separator ('term==lemma1;;lemma2;;lemma3'). |
Screenshot |
Resource | Greek Lemma Export |
---|---|
Resource type | File Writer - this PR does not alter annotations and writes data into a file. |
Dependencies | Polytonic Greek Lemmatiser |
Parameters | destination file: the url of the file designed to store the output. |
Description | This PR writes the content of the document with the lemma annotations to a file. It has no purpose other than providing some example code. |
Screenshot |
Resource | Biblindex Parser |
---|---|
Resource type | File Writer - this PR does not alter annotations and writes data into a file. |
Dependencies | none |
Parameters | source file: the url of the source file. destination file: the url of the file designed to store the output xml. data type: the dtat type of the input file; must be either 'rtf/tlg' for a tlg-styled rtf document or 'txt/bibleworks' for a BibleWorks text export. |
Description | This PR converts a file written in one of the file formats used during the Biblindex ANR project (BibleWorks export file or TLG-styled RTF file) to a xml file. Word break handling and transcoding are taken care of when necessary. Please note that both input and output files must be given as parameters and that this PR does not make use of Language Resources. Files written by this PR can however be opened as xml Language Resources to be processed further. This PR acts as a Text Structure annotator (the GATE xml file reader keeping track of the xml annotations). It rally should process an already loaded LR ideally, but the code originally designed outside of the GATE platform has not been adapted yet outside of rendering it suitable for this PR. |
Screenshot |
Resource | Biblindex Tokeniser |
---|---|
Resource type | Annotator - this PR adds / modifies annotations / features in a Language Resource. |
Dependencies | xml file compliant with one of the Biblindex DTDs (output from the Parser for instance) read by GATE with markup awareness |
Parameters | none |
Description | This PR uses the xml markup to split the document into tokens. It should perform more quickly than the GATE Unicode Tokeniser in the specific case of a document compliant with a Biblindex DTD. |
Screenshot |
The DTDs used to build xml files by the Biblindex Parser, the Biblindex Tokeniser and the Biblindex Processed Text Export PRs are the following (the parser does not use the Word attributes norm and lem). You may format any (unicode greek) text with these DTDs and use them as Language Resources instead of using the Biblindex Parser PR.
<!DOCTYPE DocWorks [ <!ELEMENT DocWorks (Work+)> <!ELEMENT Work (Chapter+)> <!ATTLIST Work name CDATA #REQUIRED> <!ELEMENT Chapter (Paragraph+)> <!ATTLIST Chapter num CDATA #REQUIRED> <!ELEMENT Paragraph (Line+)> <!ATTLIST Paragraph num CDATA #REQUIRED> <!ELEMENT Line (Word*)> <!ATTLIST Line num CDATA #REQUIRED> <!ELEMENT Word (#PCDATA)> <!ATTLIST Word num CDATA #REQUIRED> <!ATTLIST Word norm CDATA #IMPLIED> <!ATTLIST Word lem CDATA #IMPLIED> ]>
<!DOCTYPE DocBible [ <!ELEMENT DocBible (Bible+)> <!ELEMENT Bible (Book+)> <!ATTLIST Bible name CDATA #REQUIRED> <!ELEMENT Book (Chapter+)> <!ATTLIST Book name CDATA #REQUIRED> <!ELEMENT Chapter (Verse+)> <!ATTLIST Chapter num CDATA #REQUIRED> <!ELEMENT Verse (Word*)> <!ATTLIST Verse num CDATA #REQUIRED> <!ELEMENT Word (#PCDATA)> <!ATTLIST Word num CDATA #REQUIRED> <!ATTLIST Word norm CDATA #IMPLIED> <!ATTLIST Word lem CDATA #IMPLIED> ]>