Linguistics Learner: Corpus

Showing posts with label Corpus. Show all posts

Thursday, August 30, 2018

Understanding Multidimensional Analysis

Multidimensional Analysis is a quantitative framework to study register variation developed by Douglas Biber. These slides present a very basic introduction of this methodology and its assumptions.

Friday, September 30, 2011

Urdu Text Archive

Urdu Text Archive is a collection of texts from different online sources of Urdu Unicode data. The aim was to collect raw text as first step towards creating an Urdu corpus. As there is no viable grammatical tagger for Urdu available yet, this data can only be used to create word lists or for computational linguistic purposes. The data is not sampled very well because we had least choice in data collection. The biggest source is news reports of various online Urdu news papers. The data was retrieved and freed from html programmatically using regex or html parsers in C#. The genres are typical to news domain. There are articles, news reports, columns, letters to editor, editorials etc. Another major genre is books. The books have been collected from Ejaz Akhtar’s copy left archive of open source Urdu Books. The books have data from various fields of life e.g. Poetry, Literature, Religion, Quarnic Translations, Ahadith and online editions of some magazines. The data from books section is proofread and in good quality. It can be used right away for any further development. But data from Urdu News Papers and online resources, is not so good. It is raw text with spelling errors which needs review. There are problems of space, sometimes additional spaces are introduced, other times spaced are omitted and the spellings are non standard in some cases. So a full sweep out of such error is required before creating any further resource based on this data. In present condition, the limitation of genres is also a problem for this data. The scope of data collection needs to be expanded from news papers to other fields of life as well.

Availability

The data of Urdu Text Archive can be downloaded from this link of 4share.com. The sources used for this data collection are Jang Online Edition, Nawaye Waq Online Edition, NewsUrdu.com news reports and Articles, Karachiupdates.com, AalamiAkhbar Online Newspaper and Urdu Books Archive by Ejaz Akhter. The estimate of words in this archive is 1,82,43,149 words approximately, which will be grown in future hopefully.

Future Plans

Expanding the archive e.g. Jasarat Online is a good source.
Adding more genres e.g. Urdu Chat Data, Urdu Forums Discussions after anonymization, Urdu Blogs.
Development of Urdu resources based on this data e.g. spell checking words lists for Urdu.
Refining the data and eliminating formatting. language and spelling problems.
Finding trends of Urdu writers on Internet.

Thursday, September 22, 2011

Creating an Urdu Translation Glossary with C#

After a long time I have come back to C#. My limited skills allow me to do some things only with text processing and as I trained my self, my little silly looking C# scripts deal with file operations, getting word frequencies, getting counts from different text files etc etc. As I use C# as Python or other scripting language should be used, to write small programs to do tiny miny things, so I consider myself less of a programmer. But still I am in between a layman and a programmer, and can write a few bits of code. This time, after several months when I came back to C#, it was the urge to create an Urdu Glossary which can work with OmegaT. OmegaT is an open source translation memory tool, and I use it for most of my translations. And I need dictionary constantly as well, but the dictionary I use Urdu Web Lughat which is written in C# 2 and fits my purpose very well, when I add this Lughat File instead of default file which has only a few thousand words. This file has 92661 words and expressions which usually provide me insight for the correct translation, not to mention several other Urdu lughats have been merged with original few thousand words file to create this file. Well it serves the purpose, but it creates problem as well. I have to type each word in its lemma form i.e. if it is a verb in with third form e.g. got, I’ll have to write fist form get to get the meaning. Other problems are with the application.

Urdu Web Lughat

As the picture shows, capability to search the desired word from Google, Wikitionary, and Urdu Wikipedia as been added to it. This is done by default Internet Explorer Engine provided for C# 2005, but it creates problem while using in higher versions of Windows e.g. Windows 7. Each time I seek a word, it will show 2 or more dialogue boxes showing this error.

Urdu Web Lughat Error

I think I’ve the source code and can fix the problem by simply disabling this feature. But my other problem remains, that is typing the word each time, it takes too much time. So I was seeking for a solution. The solution I figured out was simple, put the lughat file in OmegaT somehow. And that’s possible through dictionary or glossary setup of the program. I find the dictionary process of OmegaT quite complicated (it requires Star Dict format). So I think its more feasible to put a glossary file instead (which is in this format word-tab-meaning). This was easy for me. I wrote a regular expression to match the lughat file’s xml and to extract each word meaning node (Don’t tell me it would have been easy with xml parser, I do not know it so regex was best for me). So here is the code I used to convert the nodes of xml in tab separated text file with each word meaning pair on new line.

public static void test ()
        {
            string match = "\\<I\\>\\s*\\<w\\>(?<Word>[^><]+)\\</w\\>\\s*\\<m\\>(?<Meaning>[^><]+)\\</m\\>\\s*\\</I\\>";
            string text = File.ReadAllText(@"G:\Software\Dictionary\dictionary.xml");
            Console.WriteLine(text.Length);
            //Console.Read();
            MatchCollection m = Regex.Matches(text, match);
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary.txt", true);
            string word = "";
            string meaning = "";
            Console.WriteLine(m.Count);
            //Console.Read();
            for(int i=0; i<m.Count; i++)
            {
                word = Regex.Replace(m[i].Groups["Word"].Value, "[\\r\\n]+", " ");
                meaning = Regex.Replace(m[i].Groups["Meaning"].Value, "[\\r\\n]+", " ");
                sw.WriteLine(word+"\t"+meaning);
                Console.WriteLine(word);
            }
            sw.Close();
        }

I know it could be improved a lot, but it did the work and did it pretty soon. I put the glossary file in OmegaT project folder as it was required (/project/glossary/Glossary.txt) and that’s done.

OmegaT with Glossary 1

I was able to see the word meaning for each sentence in the glossary box in lower right box. It was quite a help for me and reduced my time. But then came the other problem: I still had to go to dictionary and type the lemma forms of verbs and plural nouns, so there was no meaning available for worked, working, works but only for the 1^st form i.e. work. This was quite frustrating for me because the actual problem was still there. So I decided to add the other forms of the words which had more than one forms. I decided to add new lines in the glossary for each form of verb, and add lines for plural forms of nouns as well. The idea was simple enough, but to accomplish it I had to grab a lemma list. Fortunately, being a corpus linguistics student, I know one out on internet. E_Lemma.txt has been created for Word Smith Tools, but I used it for my purpose. The code I used to extract lemmas of each word, and add them to new glossary file is below.

public static void Main(string[] args)
        {
            string[] lines = File.ReadAllLines(@"G:\Software\Dictionary\Glossary.txt");
            string[] lemmas = File.ReadAllLines(@"F:\Corpus Related\e_lemma.txt");
            string word = "";
            string meaning = "";
            string lemma = "";
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary2.txt", true);
            int count = 1;
            foreach(string line in lines)
            {
                Console.WriteLine(count+" of "+lines.Length);
                word = Regex.Split(line, "\t")[0].Trim();
                meaning = Regex.Split(line, "\t")[1].Trim();
                lemma = giveLemma(word, lemmas);
                if(lemma!="")
                {
                    foreach(string lemma1 in lemma.Split(' '))
                    {
                        sw.WriteLine(lemma1+"\t"+meaning);
                        Console.WriteLine(lemma1+"\t"+meaning);
                    }
                }
                else
                {
                    sw.WriteLine(word+"\t"+meaning);
                    Console.WriteLine(word+"\t"+meaning);
                }
                count++;
            }
            sw.Close();
        }
        public static string giveLemma (string word, string[] lemmas)
        {
            string toReturn = "";
            string lemma1 = "";
            string lemma2 = "";
            foreach(string lemma in lemmas)
            {
                lemma1 = Regex.Split(lemma, "->")[0].Trim();
                lemma2 = Regex.Split(lemma, "->")[1].Trim();
                if(word==lemma1)
                {
                    toReturn += lemma1 + " " + Regex.Replace(lemma2, ",", " ");
                }
            }
            return toReturn;
        }

As it can be seen, the task is pretty simple.

Get each line from previously made glossary file.
Split the word and meaning.
Pass the word to another function, along with the lemma list (got from e_lemma.txt file) and return all possible forms of a verb or a noun.
At the end write a new line for “word-tab-meaning” pair for each lemma form of verb or noun.
And if there is no lemma, simply write the original word meaning pair.

The task was simple but the code was quite inefficient, so it took a long time to do this task. It took almost 1.5 hour to do it. But it was done and worked like a charm. See Smile

OmegaT with Glossary 2

So a glossary for Urdu Translators working with OmegaT is available. And of course it can be downloaded from here.

Urdu Glossary in Zip

Sunday, August 7, 2011

How to Transcribe a Spoken Text for ICE

It was our second semester as MSc Applied Linguistics, when we were assigned to collect video/audio recordings from Internet or record our own, and then transcribe them. This project was a hell lot of difficult and some of my class fellows were so angry due to the difficulty level. The details were simple: You were assigned a topic e.g. Lectures, Speeches, TV News, Radio News; Record your specific genre or take it from Youtube.com; Listen and Transcribe it; and Tag it with appropriate tags.

Well the process is not difficult. The time which it takes to complete all this annoys people. Transcription is one of the most time consuming jobs of the world. Normally 1 minute of spoken recording can take upto 6 to 7 minutes for writing it. So you can see for 1 hour of spoken recording one will have to spend upto 7 hours of listening and typing it. This is not that simple, it is not just that you play it and start typing. You will always have to stop the recording again and again, sometimes it would be because your typing speed will not match with that of speaking speed of the speakers, other times you may not get the clear idea what the speaker is saying so you'll have to replay the audio to concentrate on it and get what the speaker is trying to say, still other times you'll have to stop and think how to write down uhmms, errrs, overlaps etc. The situation gets worst when it comes to Talk Shows, Telephonic or Live Conversations, Lectures, Question Answer Sessions. Remember the more the speakers, the more the distractions and more time consumption on transcription. We considered those people lucky ones who got Speeches, TV News, Radio News etc. Because all these genres are spoken by one person, and secondly they are usually scripted i.e. the spoken material is written in front of the speaker so s/he has to speak it out only. But in spontaneous talks like Talk Shows this is not the case. There are more speakers, there are overlappings i.e. two people start speaking at same time, there are errrrs, hmmmmmms and other unnecessary sounds which the speaker utter. But we cannot ignore these sounds. We can understand two people speaking at same time in spoken audio, but when it comes to transcription we have to devise a method to show that these particular sentences or words were uttered by both of the speakers at same time i.e. overlapping. Here we have to use Tags to show this phenomenon. Now either we can devise our own tags or we can use tags which are devised by someone else. Since as students we work for the completion of International Corpus of English i.e. ICE Pakistan Component, we have to use their devised method of Tagging. The tagging scheme is available here.

Now what should be done here? It is simple you'll have to go through all of this document. Because you are going to listen and transcribe your spoken recordings not me. So if you do not understand it, it wouldn't work. I can only provide an example by transcribing a few lines of a video from Youtube.

<$A> is by the federal government
<$B> Ok uhmm <}><->I've<=>I've jsut lit literally about half a minute Mr. Babar. Let me just ask you this question that is come uhm from Asif who is watching from Canada....

I've just covered first 12 seconds of the above video and it took me 5 minutes to cover all the things, to WRITE DOWN what these people were performing as a routine speaking activity. The video starts with an unclear word. I had to replay it several times when I couldn't get exactly I put "is by" by my own guess and put tags around it. You can see in first line the tags , they show that the words were not intelligible. Of course I've consulted the ICE Manual (link provided above) for this tag. Even before this tag you can see the <$A> tag, which shows the first speaker. And this tag I have also got after going through the manual which says that every speaker's utterance should be started on a new line marked with speaker identity i.e. first speaker would be A, second would be B and so on. And you can see there involve two speakers in first 12 seconds and I've shown both separately with their utterances on new line with <$A>, <$B> tags. And then you may be able to see that the hostess says 'uhmm' after saying 'Ok', we cannot ignore it while transcribing. Because these hesitations and uhmms can be helpful in Discourse Analysis of this transcribed text. So I had to write this nonsense and apparently meaningless utterance. And then there is 'I've I've' with these weird looking tags <}><->I've<=>I've. They show actually repetition, and of course again I had to search in ICE Manual for tags of Repetition and I got these. So I pasted them and added the repeated words according to the example given there in the manual. And this way it goes on. You listen, you type it. When you see overlapping, repetition, hesitation, uhms you mark it, when you do not understand a spoken word you replay it, and in the mean time you consult the manual as well for every new phenomenon you encounter so you can record it properly. Now you may understand why it is very difficult to transcribe a text, and why it is necessary to TAG it. But as you practice you will be more faster and accurate, you'll consume less time.

Hopefully this small effort will help. Ask me in the comments of this post if you are still unable to get what I wanted to say, or if you want details of some specific area.

Pages