Phramer - An Open-Source Statistical Phrase-Based MT Decoder

Version 1.1 (July 5, 2009)
Many features were introduced starting with the 1.0.5 release, which are not documented in this HTML page. Please see the release history for more information.

Copyright (c) 2006-2009, Marian Olteanu

 
 

License

Copyright (c) 2006-2009, Marian Olteanu <marian_DOT_olteanu_AT_gmail_DOT_com>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, 
are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list 
of conditions and the following disclaimer. 
- Redistributions in binary form must reproduce the above copyright notice, this 
list of conditions and the following disclaimer in the documentation and/or 
other materials provided with the distribution. 
- Neither the name of the University of Texas at Dallas nor the names of its 
contributors may be used to endorse or promote products derived from this 
software without specific prior written permission. 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 

 

Download

See release history for differences between versions.

Browse sourcecode

here
 

Release notes and FAQ

1. What is Phramer?

Phramer is a Statistical Phrase-Based Machine Translation decoder, compatible with Pharaoh (http://www.isi.edu/licensed-sw/pharaoh/) in terms of command-line parameters and input/configuration/data files. Phramer is 100% written in Java.
The accuracy of Phramer is very similar with the accuracy of Pharaoh under the same configuration.

2. Why did you write Phramer?

The goals of the Phramer project were: write an extensible, source-code available (initially only for the authors), written in a nice programming language (w/o segmentation faults and non-scripting language), compatible with Pharaoh, that can be used in Statistical MT research.

3. Why would I prefer Phramer instead of Pharaoh?

a) License. Phramer is an Open Source program (BSD license), thus the limitations are minimal. Pharaoh is currently available under a much restrictive licensing
b) Source code availability. Phramer is Open Source, Pharaoh is currently not.
c) Cross-platform. Phramer is written in Java and it was tested in Linux and Windows 2000/XP. We think it should work on other platforms without any issues
d) Speed. Phramer is currently 1.5 - 3 times faster than Pharaoh 1.2.3 (not including data files loading time). We continuously work on improving the performance
e) Extensibility. Phramer has a modular architecture; a lot of hooks are provided
f) Features. Phramer has currently more features than Pharaoh 1.2.3 (not including undocumented features, that we are not aware of). Some of these features provide good ways to extend the decoder. See the list of additional parameters.

4. What do I need to use Phramer?

a) Java 5/1.5 (A new version. Older versions of Java 1.5 tend to have problems with encoding on Linux)
b) A translation table. A toolkit to generate translation tables can be found at http://www.iccs.inf.ed.ac.uk/~pkoehn/training.tgz . It is possible (but not probable) that future versions of Phramer will include translation table training functionality
c) A language model (ARPA format). We recommend you to use the SRILM toolkit.

6. How do I run Phramer?

In the "scripts" folder you'll find phramer.sh / phramer.bat . You will run it as you would run Phramer.
Due to time constrains, we release the first version of Phramer with minimal documentation. We recommend you to read the Pharaoh manual (http://www.isi.edu/licensed-sw/pharaoh/manual-v1.2.ps) before using Phramer.

5. I see that you don't offer tools for translation table generation. Do you include something useful for training?

We include a minimum-error-rate training toolkit. It is compatible with both Phramer and Pharaoh. The MERT toolkit requires also Carmel (http://www.isi.edu/licensed-sw/carmel) when used in conjunction with Pharaoh. Using the MERT tookit with the Phramer decoder doesn't require additional resources (since version 1.0.1). See ./history.txt and ./data/mert.properties.txt for details.
The main class of the MERT training toolkit is org.phramer.v1.decoder.main.MERTMain. It is also called by ./scripts/mert.sh script. The ./data/mert.properties.txt is a demo configuration file, with documentation how to adapt to your needs.
The program has only one parameter: the configuration file. It outputs debug information at StdErr and useful information at StdOut.

6. How do I make Phramer run as fast as promised?

Phramer 1.0.0 and 1.0.1 include specific optimizations related to the language model. The optimizations impose limitations:
Not using the described optimizations will make Phramer run at the same speed as Pharaoh or up to two times slower.

Starting v1.0.1, there are translation table - specific optimizations that reduce the loading time of the data files.
The ./scripts/lm2bin.sh script converts an ARPA language model into a binary language model. The first parameter: input ARPA language model. The second parameter: output binary language model. The ./scritps/tt2bin.sh script converts a text-based translation table into a binary translation table. The first parameter: the file mask for the output binary translation table (which is stored in 4 different files). The next parameters: the exact parameters that are intended to be used to call the Phramer decoder (relevant parameters: weights, the path to the translatlation table and the pruning parameters for the translation table).
Also, running the server version of the JVM improves performance (java -server <class-name>).

7. Can you tell more about speed, related to Phramer?

8. I heard that Phramer includes support for other data storage layers. What can you tell me about it?

The main package includes support for in-memory LM and TT, and also for remote LM and TT.
We implemented support to store LMs and TTs into SQLite databases (http://www.sqlite.org/).
We will release very soon the add-on (we didn't include it in the main package because it requires additional licensing, aslo BSD licensing).
Note: we don't recommend using it for LMs.

9. Are there any compatibility issues with Phramer?

On the default configuration, Phramer should run (almost) the same as Pharaoh with one exception: if you use distortion limit greater or equal than 1 (-dl x), add -x-max-phrase-length (n+1) parameter. Pharaoh implicitly limits the length of the phrase to be used (foreign side) to this value.
The other differences in the output should be caused by the differences in the implementations of the algorithms. Phramer strictly enforces the threshold parameters (-s and -b). Thus, if the goal is to obtain the same results as Pharaoh, sligntly changing these parameters at the expense of speed is recommended. These adjustments will have minor performance impact (probably under 10%) thus even with the parameter adjusting, a tuned for speed Pharaoh (see question 6) will be considerably faster than Pharaoh (see question 7).

10. How does Phramer handle text files encoding?

All the tools (we hope) use the file.encoding parameter transmitted to the JVM to decide how to process files. Thus we recommend you to use: java -Dfile.encoding=UTF-8 or whatever is best for you.
The Phramer decoder that is run from the command line (org.phramer.v1.decoder.main.PhramerMain) processes StdIn and StdOut also based on the file.encoding property, but it processes the input file, the output file, the language models and the translation table based on -renc, -wenc, -ttenc, -lmenc parameters, with a default value of ISO-8859-1.
Thus, you will be able to use Phramer on files with different encodings (i.e.: output file and LMs: ISO-8859-1, input file: UTF-16 and translation table: UTF-8).

11. What are the additional parameters of Phramer?

(parameter and arity)
 -x-oov-probability / -x-oov (1) , default: -10 -- OOV score
 -x-lm-lbound / -x-lb , default -10 (1) -- minimum value out of language model
 -x-max-phrase-length (1) -- How long can be a phrase?
                It is designed to accelerate and reduce memory consumption.
		Default value (0) = infinite
 -x-context-length (1) -- How many words are needed to compute the probability of adding a new word into the sentence.
		Default: If a n-gram model is used, lmContextLength = n-1.
 -x-chunk (0) -- Translate chunks, not sentences. Don't add  and  in language model
 
 -x-score / -sc (2) -- score directory, n-best size
 -x-score-simple / -simple-sc (0) -- write simple score/rescore files (space-separated)
 -x-nbest / -nb (2) -- nbest directory+path, n-best size
 -x-nbest-includeprob (0) -- to include probability in the n-best list
 -x-prefix-tt (0) -- if  is not in the translation table, don't look for 
 
 -x-future-cost-class (1) -- the future cost calculator. Default: the Pharaoh future cost calculator.
     Recommended: org.phramer.v1.decoder.cost.PharaohDFutureCostCalculator which also considers the distortion
 
 -x-unk-words-transliterator (1) -- the class that does transliteration for unknown words
	 Must implement info.olteanu.interfaces.StringFilter
	 Active only if a helper wasn't already passed to the config file
	 Only in the command line
 -profile (0) -- do execution profiling

 -renc (1) -- input file encoding. Default: ISO-8859-1
 -wenc (1) -- out file encoding. Default: ISO-8859-1
 -ttenc (1) -- translation table file encoding. Default: ISO-8859-1
 -lmenc (1) -- lm file encoding. Default: ISO-8859-1
 
 -read (1) -- reads the input file instead of reading it from StdIn
 -write (1) -- writes the input file instead of writting it to StdOut

 -server (1) -- starts a Phramer server at the specified port.
             It can be accessed using org.phramer.remote.MachineTranslatorRemoteClient.main()
 -cache (1) -- if there's a server, you can use a cache that prevents re-translation.
             This parameter specifies the cache size

12. Why is there no makefile-s or something similar?

That's because we use IDEs to develop Phramer, which deal with compilation. Making files used for building and doing maintenance is not a task we consider essential. We also provide the compiled version of Phramer (lib/phramer.jar).
A simple build system:
  find | grep "[.]java" > listoffiles.txt
  javac @listoffiles.txt

13. What tools can we find in the package?

The set of tools include:

14: How do I use remote data structures?

Start a server (org.phramer.v1.server.StartRemoteTranslationTable , org.phramer.v1.server.StartRemoteBackoffLM) and in the configuration file of the decoder, instead of providing the path to the LM/TT file, write: remote:machine:port

15. How do I extend Phramer?

You can write classes and use the extended configuration to make use of them (i.e.: -x-unk-words-transliterator), also by using org.phramer.v1.decoder.main.PhramerMain.main().
You can write classes and customize the loading process (implement org.phramer.v1.decoder.loader.PhramerHelperIf or use org.phramer.v1.decoder.loader.PhramerHelperCustom to make use of your classes). You will need to write a small program that calls PharmerMain.phramerMain() (see org.phramer.v1.extensions.pos.POSPhramerMain as an example for an extension).
You can analyze the code and change code inside org.phramer.v1.decoder package and subpackages (make your own Phramer branch).
We believe that the architecture allows a lot of customizations to be made outside the org.phramer.v1.decoder.* packages.