Table of Contents
Luck, that's when preparation and opportunity meet. -- P. E. Trudeau
RussianPOSTagger provides Java interface (to C++ Lemmatizer via XML-RPC) in order to perform lemmatizing in Russian, English, and German (lemma is the canonical form of a lexeme in Natural Language Processing). RussianPOSTagger searches lemma, defines part of speech (POS), etc.
RussianPOSTagger could work as a module of GATE (http://gate.ac.uk) or as a standalone NLP application. This project presents an example of embedding ANNIE and RussianPOSTagger modules (as GATE's modules) to standalone NLP application.
Dataflow scheme is:
GATE <-> RussianPOSTagger (Java) <-> LemServer (C++) <-> Lemmatizer
RussianPOSTagger is a GATE's module written in Java. It is an XML-RPC client.
LemServer (written in C++) is an XML-RPC server. It binds GATE with Lemmatizer.
RussianPOSTagger and LemServer are written by Andrew Krizhanovsky. Lemmatizer is written by Dialing (http://www.aot.ru).
Download rupostagger-XX.XX.tar.gz from https://sourceforge.net/projects/rupostagger
Download lemmatizer (from http://www.aot.ru or take from downloaded
rupostagger-XX.XX/LemServer/aot.ru/
) and install to ~/RML
(Linux or Cygwin):
lemmatizer.tar.gz rus-src-morph.tar.gz # Russian dictionary eng-src-morph.tar.gz # Download, if you plan to work with English ger-src-morph.tar.gz # and German dictionary:$
mkdir ~/RML$
cd ~/RML$
tar -xzvf file # Unpack archives to RML directory$
export RML=~/RML
Compile
$
./compile_morph.sh$
./generate_morph_bin.sh Russian$
./generate_morph_bin.sh English$
./compile_morph_server.sh
If there is the error: "gmake: command not found
", then:
$
cd /usr/bin && ln -s /usr/bin/make gmake # "gmake" (gnu make) command is just make on linux systems
Run server
$
./Bin/LemServer.exe 8000 5
Download GATE (version 3.1 or later) from http://gate.ac.uk/download. Install to /opt/GATE-3.1
(or "C:\Program Files\GATE-3.1"
in Windows).
Install RussianPOSTagger as plugin for GATE. In first case (1.) you can simply copy gate_plugins/
files to gate's plugins directory. In second case (2.) you shoud compile projects RussianPOSTagger and aotClient in order to create RussianPOSTagger.jar and aotClient.jar.
$
cp -r rupostagger-XX.XX/gate_plugins/RussianPOSTagger /opt/GATE-3.1/plugins
$
cd /opt/GATE-3.1/plugins$
mkdir -p RussianPOSTagger RussianPOSTagger/lib$
cp rupostagger-XX.XX/aotClient/dist/aotClient.jar /opt/GATE-3.1/plugins/RussianPOSTagger/lib # compiled in aotClient project openrupostagger-XX.XX/RussianPOSTagger/build.xml
and set "GATEDir" property (set GATE location)$
ant$
cp rupostagger-XX.XX/RussianPOSTagger/RussianPOSTagger.jar /opt/GATE-3.1/plugins/RussianPOSTagger # compiled in RussianPOSTagger project$
cp rupostagger-XX.XX/RussianPOSTagger/creole.xml /opt/GATE-3.1/plugins/RussianPOSTagger
Run GATE.
Add GATE Document
click "File/New Language Resource/GATE Document" browse russian text for 'sourceUrl', e.g. rupostagger-XX.XX/data/ru/russian.txt
Document Reset PR ANNIE English Tokenizer ANNIE Sentence Splitter Russian POS Tagger
Assign 'russian.txt' for each processing resource.
Click 'Run' button.
Open 'russian.txt', click buttons 'Annotation Sets' and 'Annotations' (See Fig.3 “Annotation sets (Paradigm and Wordfom) are presented for Dijkstra saying.”). Select checkboxes for Wordform and Paradigm to see resultf of Lemmatizer work.
See rupostagger-XX.XX/embedRPOST (NetBeans project) - example of embedding ANNIE and RussianPOSTagger modules to the standalone NLP application. See file rupostagger-XX.XX/embedRPOST/src/embedrpost/StandAloneRussianPOSTagger.java
Figure 2. Pipeline consists of the selected processing resources: (1) Document Reset PR, (2) ANNIE English Tokeniser, (3) ANNIE Sentence Splitter, (4) Russian POS Tagger. Test document signatures_en.txt
was assigned to each processing resource.
Copyright (c) 2005, 2006 Andrew Krizhanovsky /aka / at / mail.iias.spb.su/
Distributed under GNU Public License. Version 2 of the GPL or any later version. See gpl.txt
The following open-source library are used:
Add library aotClient/lib/xmlrpc-2.0.jar to aotClient project.
Add library /opt/GATE-3.1/bin/gate.jar
(or "C:\Program Files\GATE-3.1\bin\gate.jar"
in Windows) to RussianPOSTagger projects.
The project embedRPOST depends on RussianPOSTagger and aotClient projects.
The project RussianPOSTagger depends on aotClient project.
Source code has UTF8 encoding. It is important for test packages. If you are using NetBeans IDE, set NetBeans/Tools/Options/Java Sources/Default Encoding=UTF8