Thursday, September 22, 2011

Creating an Urdu Translation Glossary with C#

After a long time I have come back to C#. My limited skills allow me to do some things only with text processing and as I trained my self, my little silly looking C# scripts deal with file operations, getting word frequencies, getting counts from different text files etc etc. As I use C# as Python or other scripting language should be used, to write small programs to do tiny miny things, so I consider myself less of a programmer. But still I am in between a layman and a programmer, and can write a few bits of code. This time, after several months when I came back to C#, it was the urge to create an Urdu Glossary which can work with OmegaT. OmegaT is an open source translation memory tool, and I use it for most of my translations. And I need dictionary constantly as well, but the dictionary I use Urdu Web Lughat which is written in C# 2 and fits my purpose very well, when I add this Lughat File instead of default file which has only a few thousand words. This file has 92661 words and expressions which usually provide me insight for the correct translation, not to mention several other Urdu lughats have been merged with original few thousand words file to create this file. Well it serves the purpose, but it creates problem as well. I have to type each word in its lemma form i.e. if it is a verb in with third form e.g. got, I’ll have to write fist form get to get the meaning. Other problems are with the application.
urdu-web-lughat
Urdu Web Lughat
As the picture shows, capability to search the desired word from Google, Wikitionary, and Urdu Wikipedia as been added to it. This is done by default Internet Explorer Engine provided for C# 2005, but it creates problem while using in higher versions of Windows e.g. Windows 7. Each time I seek a word, it will show 2 or more dialogue boxes showing this error.
lughat-error
Urdu Web Lughat Error
I think I’ve the source code and can fix the problem by simply disabling this feature. But my other problem remains, that is typing the word each time, it takes too much time. So I was seeking for a solution. The solution I figured out was simple, put the lughat file in OmegaT somehow. And that’s possible through dictionary or glossary setup of the program. I find the dictionary process of OmegaT quite complicated (it requires Star Dict format). So I think its more feasible to put a glossary file instead (which is in this format word-tab-meaning). This was easy for me. I wrote a regular expression to match the lughat file’s xml and to extract each word meaning node (Don’t tell me it would have been easy with xml parser, I do not know it so regex was best for me). So here is the code I used to convert the nodes of xml in tab separated text file with each word meaning pair on new line.
public static void test ()
        {
            string match = "\\<I\\>\\s*\\<w\\>(?<Word>[^><]+)\\</w\\>\\s*\\<m\\>(?<Meaning>[^><]+)\\</m\\>\\s*\\</I\\>";
            string text = File.ReadAllText(@"G:\Software\Dictionary\dictionary.xml");
            Console.WriteLine(text.Length);
            //Console.Read();
            MatchCollection m = Regex.Matches(text, match);
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary.txt", true);
            string word = "";
            string meaning = "";
            Console.WriteLine(m.Count);
            //Console.Read();
            for(int i=0; i<m.Count; i++)
            {
                word = Regex.Replace(m[i].Groups["Word"].Value, "[\\r\\n]+", " ");
                meaning = Regex.Replace(m[i].Groups["Meaning"].Value, "[\\r\\n]+", " ");
                sw.WriteLine(word+"\t"+meaning);
                Console.WriteLine(word);
            }
            sw.Close();
        }
I know it could be improved a lot, but it did the work and did it pretty soon. I put the glossary file in OmegaT project folder as it was required (/project/glossary/Glossary.txt) and that’s done.
OmegaT-with-UrduGlossary1
OmegaT with Glossary 1
I was able to see the word meaning for each sentence in the glossary box in lower right box. It was quite a help for me and reduced my time. But then came the other problem: I still had to go to dictionary and type the lemma forms of verbs and plural nouns, so there was no meaning available for worked, working, works but only for the 1st  form i.e. work. This was quite frustrating for me because the actual problem was still there. So I decided to add the other forms of the words which had more than one forms. I decided to add new lines in the glossary for each form of verb, and add lines for plural forms of nouns as well. The idea was simple enough, but to accomplish it I had to grab a lemma list. Fortunately, being a corpus linguistics student, I know one out on internet. E_Lemma.txt has been created for Word Smith Tools, but I used it for my purpose. The code I used to extract lemmas of each word, and add them to new glossary file is below.
public static void Main(string[] args)
        {
            string[] lines = File.ReadAllLines(@"G:\Software\Dictionary\Glossary.txt");
            string[] lemmas = File.ReadAllLines(@"F:\Corpus Related\e_lemma.txt");
            string word = "";
            string meaning = "";
            string lemma = "";
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary2.txt", true);
            int count = 1;
            foreach(string line in lines)
            {
                Console.WriteLine(count+" of "+lines.Length);
                word = Regex.Split(line, "\t")[0].Trim();
                meaning = Regex.Split(line, "\t")[1].Trim();
                lemma = giveLemma(word, lemmas);
                if(lemma!="")
                {
                    foreach(string lemma1 in lemma.Split(' '))
                    {
                        sw.WriteLine(lemma1+"\t"+meaning);
                        Console.WriteLine(lemma1+"\t"+meaning);
                    }       
                }
                else
                {
                    sw.WriteLine(word+"\t"+meaning);
                    Console.WriteLine(word+"\t"+meaning);
                }
                count++;
            }
            sw.Close();
        }
        public static string giveLemma (string word, string[] lemmas)
        {
            string toReturn = "";
            string lemma1 = "";
            string lemma2 = "";
            foreach(string lemma in lemmas)
            {
                lemma1 = Regex.Split(lemma, "->")[0].Trim();
                lemma2 = Regex.Split(lemma, "->")[1].Trim();
                if(word==lemma1)
                {
                    toReturn += lemma1 + " " + Regex.Replace(lemma2, ",", " ");
                }
            }
            return toReturn;
        }
As it can be seen, the task is pretty simple.
  • Get each line from previously made glossary file.
  • Split the word and meaning.
  • Pass the word to another function, along with the lemma list (got from e_lemma.txt file) and return all possible forms of a verb or a noun.
  • At the end write a new line for “word-tab-meaning” pair for each lemma form of verb or noun.
  • And if there is no lemma, simply write the original word meaning pair.
The task was simple but the code was quite inefficient, so it took a long time to do this task. It took almost 1.5 hour to do it. But it was done and worked like a charm. See Smile
OmegaT-with-UrduGlossary2
OmegaT with Glossary 2
So a glossary for Urdu Translators working with OmegaT is available. And of course it can be downloaded from here.