RussianPOSTagger

Andrew Krizhanovsky


Table of Contents

Introduction
Download
Installation
Screenshots
A. GPL Licence
B. Third party software
C. Notes for developers

Luck, that's when preparation and opportunity meet. -- P. E. Trudeau

Introduction

RussianPOSTagger provides Java interface (to C++ Lemmatizer via XML-RPC) in order to perform lemmatizing in Russian, English, and German (lemma is the canonical form of a lexeme in Natural Language Processing). RussianPOSTagger searches lemma, defines part of speech (POS), etc.

RussianPOSTagger could work as a module of GATE (http://gate.ac.uk) or as a standalone NLP application. This project presents an example of embedding ANNIE and RussianPOSTagger modules (as GATE's modules) to standalone NLP application.

Dataflow scheme is:

  GATE <-> RussianPOSTagger (Java) <-> LemServer (C++) <-> Lemmatizer

RussianPOSTagger is a GATE's module written in Java. It is an XML-RPC client.

LemServer (written in C++) is an XML-RPC server. It binds GATE with Lemmatizer.

RussianPOSTagger and LemServer are written by Andrew Krizhanovsky. Lemmatizer is written by Dialing (http://www.aot.ru).

Download

Download rupostagger-XX.XX.tar.gz from https://sourceforge.net/projects/rupostagger

Installation

  • Download lemmatizer (from http://www.aot.ru or take from downloaded rupostagger-XX.XX/LemServer/aot.ru/) and install to ~/RML (Linux or Cygwin):

      lemmatizer.tar.gz
      rus-src-morph.tar.gz # Russian dictionary
      eng-src-morph.tar.gz # Download, if you plan to work with English
      ger-src-morph.tar.gz # and German dictionary:
      $mkdir ~/RML
      $cd ~/RML
      $tar -xzvf file # Unpack archives to RML directory
      $export RML=~/RML

  • Compile

      $ ./compile_morph.sh
      $ ./generate_morph_bin.sh Russian
      $ ./generate_morph_bin.sh English
      $ ./compile_morph_server.sh

    If there is the error: "gmake: command not found", then:

      $ cd /usr/bin && ln -s /usr/bin/make gmake # "gmake" (gnu make) command is just make on linux systems

  • Run server

      $ ./Bin/LemServer.exe 8000 5
  • Download GATE (version 3.1 or later) from http://gate.ac.uk/download. Install to /opt/GATE-3.1 (or "C:\Program Files\GATE-3.1" in Windows).

  • Install RussianPOSTagger as plugin for GATE. In first case (1.) you can simply copy gate_plugins/ files to gate's plugins directory. In second case (2.) you shoud compile projects RussianPOSTagger and aotClient in order to create RussianPOSTagger.jar and aotClient.jar.

    1.   $ cp -r rupostagger-XX.XX/gate_plugins/RussianPOSTagger /opt/GATE-3.1/plugins
    2.   $ cd /opt/GATE-3.1/plugins
        $ mkdir -p RussianPOSTagger RussianPOSTagger/lib
        $ cp rupostagger-XX.XX/aotClient/dist/aotClient.jar /opt/GATE-3.1/plugins/RussianPOSTagger/lib # compiled in aotClient project
        open rupostagger-XX.XX/RussianPOSTagger/build.xml  and set "GATEDir" property (set GATE location)
        $ ant
        $ cp rupostagger-XX.XX/RussianPOSTagger/RussianPOSTagger.jar /opt/GATE-3.1/plugins/RussianPOSTagger # compiled in RussianPOSTagger project
        $ cp rupostagger-XX.XX/RussianPOSTagger/creole.xml /opt/GATE-3.1/plugins/RussianPOSTagger
  • Run GATE.

    • Open GATE/File/Manage CREOLE plugins, set checkbox in RussianPosTagger
    • Add GATE Document

        click "File/New Language Resource/GATE Document"
        browse russian text for 'sourceUrl', e.g. rupostagger-XX.XX/data/ru/russian.txt

    • Add Application Pipeline
    • Add Processing Resources to Pipeline:
        Document Reset PR
        ANNIE English Tokenizer
        ANNIE Sentence Splitter
        Russian POS Tagger
    • Assign 'russian.txt' for each processing resource.

    • Click 'Run' button.

    • Open 'russian.txt', click buttons 'Annotation Sets' and 'Annotations' (See Fig.3 “Annotation sets (Paradigm and Wordfom) are presented for Dijkstra saying.”). Select checkboxes for Wordform and Paradigm to see resultf of Lemmatizer work.

  • See rupostagger-XX.XX/embedRPOST (NetBeans project) - example of embedding ANNIE and RussianPOSTagger modules to the standalone NLP application. See file rupostagger-XX.XX/embedRPOST/src/embedrpost/StandAloneRussianPOSTagger.java

Screenshots

Figure 1. Parameters of GATE's module RussianPOSTagger

Parameters of GATE's module RussianPOSTagger

Figure 2. Pipeline consists of the selected processing resources: (1) Document Reset PR, (2) ANNIE English Tokeniser, (3) ANNIE Sentence Splitter, (4) Russian POS Tagger. Test document signatures_en.txt was assigned to each processing resource.

Pipeline consists of the selected processing resources: (1) Document Reset PR, (2) ANNIE English Tokeniser, (3) ANNIE Sentence Splitter, (4) Russian POS Tagger. Test document signatures_en.txt was assigned to each processing resource.

Figure 3. Annotation sets (Paradigm and Wordfom) are presented for Dijkstra saying.

Annotation sets (Paradigm and Wordfom) are presented for Dijkstra saying.

A. GPL Licence

Copyright (c) 2005, 2006 Andrew Krizhanovsky /aka / at / mail.iias.spb.su/

Distributed under GNU Public License. Version 2 of the GPL or any later version. See gpl.txt

B. Third party software

The following open-source library are used:

  • Moprhological analysis www.aot.ru (GNU LGPL);
  • XmlRpc++ Library 0.7 (GNU LGPL)

C. Notes for developers

Add library aotClient/lib/xmlrpc-2.0.jar to aotClient project.

Add library /opt/GATE-3.1/bin/gate.jar (or "C:\Program Files\GATE-3.1\bin\gate.jar" in Windows) to RussianPOSTagger projects.

The project embedRPOST depends on RussianPOSTagger and aotClient projects.

The project RussianPOSTagger depends on aotClient project.

Source code has UTF8 encoding. It is important for test packages. If you are using NetBeans IDE, set NetBeans/Tools/Options/Java Sources/Default Encoding=UTF8