Phramer - An Open-Source Statistical Phrase-Based MT Decoder
Version 1.1 (July 5, 2009)
Many features were introduced starting with the 1.0.5 release, which are not documented in this HTML page. Please see the release history for more information.
Copyright (c) 2006-2009, Marian Olteanu <marian_DOT_olteanu_AT_gmail_DOT_com>
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list
of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
- Neither the name of the University of Texas at Dallas nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Phramer is a Statistical Phrase-Based Machine Translation decoder, compatible with
Pharaoh (http://www.isi.edu/licensed-sw/pharaoh/) in terms of command-line parameters and
input/configuration/data files. Phramer is 100% written in Java.
The accuracy of Phramer is very similar with the accuracy of Pharaoh under the same configuration.
2. Why did you write Phramer?
The goals of the Phramer project were: write an extensible, source-code available (initially only
for the authors), written in a nice programming language (w/o segmentation faults and non-scripting
language), compatible with Pharaoh, that can be used in Statistical MT research.
3. Why would I prefer Phramer instead of Pharaoh?
a) License. Phramer is an Open Source program (BSD license), thus the limitations are minimal.
Pharaoh is currently available under a much restrictive licensing
b) Source code availability. Phramer is Open Source, Pharaoh is currently not.
c) Cross-platform. Phramer is written in Java and it was tested in Linux and Windows 2000/XP. We
think it should work on other platforms without any issues
d) Speed. Phramer is currently 1.5 - 3 times faster than Pharaoh 1.2.3 (not including data files loading
time). We continuously work on improving the performance
e) Extensibility. Phramer has a modular architecture; a lot of hooks are provided
f) Features. Phramer has currently more features than Pharaoh 1.2.3 (not including undocumented
features, that we are not aware of). Some of these features provide good ways to extend the decoder.
See the list of additional parameters.
4. What do I need to use Phramer?
a) Java 5/1.5 (A new version. Older versions of Java 1.5 tend to have problems with encoding on Linux)
b) A translation table. A toolkit to generate translation tables can be found at
http://www.iccs.inf.ed.ac.uk/~pkoehn/training.tgz . It is possible (but not probable) that future
versions of Phramer will include translation table training functionality
c) A language model (ARPA format). We recommend you to use the SRILM toolkit.
6. How do I run Phramer?
In the "scripts" folder you'll find phramer.sh / phramer.bat . You will run it as you would run
Phramer.
Due to time constrains, we release the first version of Phramer with minimal documentation. We
recommend you to read the Pharaoh manual (http://www.isi.edu/licensed-sw/pharaoh/manual-v1.2.ps)
before using Phramer.
5. I see that you don't offer tools for translation table generation.
Do you include something useful for training?
We include a minimum-error-rate training toolkit. It is compatible with both Phramer and Pharaoh.
The MERT toolkit requires also Carmel (http://www.isi.edu/licensed-sw/carmel) when used in conjunction
with Pharaoh. Using the MERT tookit with the Phramer decoder doesn't require additional resources
(since version 1.0.1). See ./history.txt and ./data/mert.properties.txt for details.
The main class of the MERT training toolkit is org.phramer.v1.decoder.main.MERTMain. It is also
called by ./scripts/mert.sh script. The ./data/mert.properties.txt is a demo configuration file, with
documentation how to adapt to your needs.
The program has only one parameter: the configuration file. It outputs debug information at StdErr and
useful information at StdOut.
6. How do I make Phramer run as fast as promised?
Phramer 1.0.0 and 1.0.1 include specific optimizations related to the language model.
The optimizations impose limitations:
one 3-gram language model may be a numeric/vocabulary language model (vocabulary: prefix)
if there is only one language model that is used during decoding, and that LM is a
numeric/vocabulary one, additional optimizations can be enabled (fast: prefix)
if one numeric/vocabulary language model is used, it can be loaded in binary format
(binary: prefix) which is a deserialization of the data structures. It improves language
model loading time by more than an order of magnitude. It will not decrease decoding time.
A conversion tool is included.
Not using the described optimizations will make Phramer run at
the same speed as Pharaoh or up to two times slower.
Starting v1.0.1, there are translation table - specific optimizations that reduce the loading
time of the data files.
sorted: option - the sorted translation tables (or translation tables in which all entries
that have the same foreign phrase f should be grouped in the translation table file) are loaded
in an optimal fashion, with 5-50% decrease in loading time.
nio: option - binary translation table, pre-pruned and sorted, that contains minimum ammount of information
(only f and e phrases and the total probability, according to the configuration options)
in single precision mode (4 bytes/number), serialized for space and deserialized at runtime.
The application: low memory consumption, especially for interactive applications that
cannot benefit from translation table pruning (based on the input sentences).
Side effect: very short loading time compared with text-based translation tables,
slight decrease in perfomance (estimated: 15-30% in regular applications).
Requires org.phramer.tools.ConvertLM2Binary tool to generate the binary translation
tables. The memory requirements for the translation table drops with about 40-50%.
The ./scripts/lm2bin.sh script converts an ARPA language model into a binary language model.
The first parameter: input ARPA language model. The second parameter: output binary language
model. The ./scritps/tt2bin.sh script converts a text-based translation table into a binary
translation table. The first parameter: the file mask for the output binary translation table
(which is stored in 4 different files). The next parameters: the exact parameters that are
intended to be used to call the Phramer decoder (relevant parameters: weights, the path to the
translatlation table and the pruning parameters for the translation table).
Also, running the server version of the JVM improves performance (java -server <class-name>).
7. Can you tell more about speed, related to Phramer?
A time-consuming configuration designed for high accuracy -- -dl 6 -b 0.000001 -ttable-limit 30 -s 200 -- 1.5-2x
Many translation table alternations -- -dl 4 -b 0.1 -ttable-limit 100 -- 10x or even more. We
believe that there is an issue with Pharaoh related to big values of the -ttable-limit parameter.
8. I heard that Phramer includes support for other data storage layers. What can you tell me
about it?
The main package includes support for in-memory LM and TT, and also for remote LM and TT.
We implemented support to store LMs and TTs into SQLite databases (http://www.sqlite.org/).
We will release very soon the add-on (we didn't include it in the main package because it
requires additional licensing, aslo BSD licensing).
Note: we don't recommend using it for LMs.
9. Are there any compatibility issues with Phramer?
On the default configuration, Phramer should run (almost) the same as Pharaoh with one exception:
if you use distortion limit greater or equal than 1 (-dl x), add -x-max-phrase-length (n+1) parameter.
Pharaoh implicitly limits the length of the phrase to be used (foreign side) to this value.
The other differences in the output should be caused by the differences in the implementations
of the algorithms. Phramer strictly enforces the threshold parameters (-s and -b). Thus, if the goal is to
obtain the same results as Pharaoh, sligntly changing these parameters at the expense of speed is recommended.
These adjustments will have minor performance impact (probably under 10%) thus even with the parameter adjusting,
a tuned for speed Pharaoh (see question 6) will be considerably faster than Pharaoh (see question 7).
10. How does Phramer handle text files encoding?
All the tools (we hope) use the file.encoding parameter transmitted to the JVM to decide how to process
files. Thus we recommend you to use: java -Dfile.encoding=UTF-8 or whatever is best for you.
The Phramer decoder that is run from the command line (org.phramer.v1.decoder.main.PhramerMain) processes
StdIn and StdOut also based on the file.encoding property, but it processes the input file, the output file,
the language models and the translation table based on -renc, -wenc, -ttenc, -lmenc parameters, with
a default value of ISO-8859-1.
Thus, you will be able to use Phramer on files with different encodings (i.e.: output file and LMs: ISO-8859-1,
input file: UTF-16 and translation table: UTF-8).
11. What are the additional parameters of Phramer?
(parameter and arity)
-x-oov-probability / -x-oov (1) , default: -10 -- OOV score
-x-lm-lbound / -x-lb , default -10 (1) -- minimum value out of language model
-x-max-phrase-length (1) -- How long can be a phrase?
It is designed to accelerate and reduce memory consumption.
Default value (0) = infinite
-x-context-length (1) -- How many words are needed to compute the probability of adding a new word into the sentence.
Default: If a n-gram model is used, lmContextLength = n-1.
-x-chunk (0) -- Translate chunks, not sentences. Don't add and in language model
-x-score / -sc (2) -- score directory, n-best size
-x-score-simple / -simple-sc (0) -- write simple score/rescore files (space-separated)
-x-nbest / -nb (2) -- nbest directory+path, n-best size
-x-nbest-includeprob (0) -- to include probability in the n-best list
-x-prefix-tt (0) -- if is not in the translation table, don't look for
-x-future-cost-class (1) -- the future cost calculator. Default: the Pharaoh future cost calculator.
Recommended: org.phramer.v1.decoder.cost.PharaohDFutureCostCalculator which also considers the distortion
-x-unk-words-transliterator (1) -- the class that does transliteration for unknown words
Must implement info.olteanu.interfaces.StringFilter
Active only if a helper wasn't already passed to the config file
Only in the command line
-profile (0) -- do execution profiling
-renc (1) -- input file encoding. Default: ISO-8859-1
-wenc (1) -- out file encoding. Default: ISO-8859-1
-ttenc (1) -- translation table file encoding. Default: ISO-8859-1
-lmenc (1) -- lm file encoding. Default: ISO-8859-1
-read (1) -- reads the input file instead of reading it from StdIn
-write (1) -- writes the input file instead of writting it to StdOut
-server (1) -- starts a Phramer server at the specified port.
It can be accessed using org.phramer.remote.MachineTranslatorRemoteClient.main()
-cache (1) -- if there's a server, you can use a cache that prevents re-translation.
This parameter specifies the cache size
12. Why is there no makefile-s or something similar?
That's because we use IDEs to develop Phramer, which deal with compilation. Making files used for building and doing
maintenance is not a task we consider essential. We also provide the compiled version of Phramer (lib/phramer.jar).
A simple build system:
find | grep "[.]java" > listoffiles.txt
javac @listoffiles.txt
13. What tools can we find in the package?
The set of tools include:
translation table filtering (based on input file, phrase length or probability value) and flipping (to be reused
for E->F translation)
character encoding conversion
distributed decoding (splits a decoding task, including auxiliary file generation -- lattices, etc.) on multiple machines
(it was used for accelerated MERT training) - compatible with both Phramer and Pharaoh
sorting (we encountered problems with the GNU sort)
profiling tools (at the level of nanoseconds), that are inserted in the exact points of measurement -- much faster than
the java profiler
conversion tools for translation tables and language models
14: How do I use remote data structures?
Start a server (org.phramer.v1.server.StartRemoteTranslationTable , org.phramer.v1.server.StartRemoteBackoffLM) and in
the configuration file of the decoder, instead of providing the path to the LM/TT file, write:
remote:machine:port
15. How do I extend Phramer?
You can write classes and use the extended configuration to make use of them (i.e.: -x-unk-words-transliterator), also
by using org.phramer.v1.decoder.main.PhramerMain.main().
You can write classes and customize the loading process (implement org.phramer.v1.decoder.loader.PhramerHelperIf
or use org.phramer.v1.decoder.loader.PhramerHelperCustom to make use of your classes). You will need to write a small
program that calls PharmerMain.phramerMain() (see org.phramer.v1.extensions.pos.POSPhramerMain as an example for an
extension).
You can analyze the code and change code inside org.phramer.v1.decoder package and subpackages (make your own Phramer
branch).
We believe that the architecture allows a lot of customizations to be made outside the org.phramer.v1.decoder.* packages.