Regular Expression or regex is a powerful tool for text processing. People like me who encounter with text processing on daily basis know the ease and power they provide. Regex are full fledged language actually, a mini language with its own rules and very systematic and organized structure. Regex as they are known today are mostly borrowed from early days of Perl, that's why they are mostly called Perl Compatible Regular Expressions. No high level language can ever miss this most demanding feature in today's circumstances, and even the parent of many new languages C++ will have regex library in its new C++0x version/standard.
I came to know about regex 3 years ago when I was working with my teachers on their corpus research work. I was unable to grasp the meaning of regex initially and the power they had behind them. But after some time I got books and article on the topic and started learning them. The book most helpful for me was Friedl - Mastering Regular Expressions 3e (O'Reilly, 2006). I completed only 2 chapters of this book but it made me speedy panther in text processing from a lame lamb. Being a linguistics student and a corpus linguist I am always seeking ways to get text patterns automatically with least possible time. And regex provide me this facility. Along with regex I use C# 2005 which gives me a powerful capability to do everything I want with the texts.
Regex are good but they are like knife in your hand which can be used to cut your own hand also. You should be very well aware of the pros and cons of using regex. The very first thing you should consider as a corpus linguist is to search the regularities in the text. These regularities or patterns will help you find the perfect regex for the purpose. The best strategy is to analyse the data manually e.g. by inspecting concordance lines in search of the required constructions. After you have inspected and found the ways in which the construction is occurring, you can create a good and regex. But remember cross check, double check and recheck your regular expression to verify does it doing the maximum? Does the loss is minimum? And finally does it affordable? Affordable here I mean if it is hasty to add every construction and thus increasing your work. Regex give power and flexibility but they should be carefully used. They should be constructed with great care and also verified with manual analysis. And the most important thing, use regex to get concordance lines which you will inspect manually thus you can reduce your work as well as quality would be maintained.
News, Views, Analyses and Reflections on Linguistics, Corpus Linguistics, Programming, Software, Technology, Education, Society
Saturday, May 29, 2010
Friday, May 28, 2010
ELT in Pakistan
To the Headmistress
Govt. High School
Madam
I have an urgent piece of work at home. Therefore I cannot come to school. Kindly grant me leave for two days.
Yours Obediently
X.Y.Z.
This was the lesson my younger sister cramming when I was taking my breakfast this morning. She is in 5th grade and goes to a coaching center for extra help in studies also. Both at school and at coaching center the teaching system of English is same. They make them cram the lesson. Initially there are no rules of the language but simple cramming of big and small chunks of language in the form of vocabulary items like names of colours, simple sentences, and afterwards stories, applications, letters and essays. Initially stories etc. are simpler having a few sentences and simple structures. Afterwards things start gonna complex and lengthy. And this goes upto B.A. i.e. 14 grade.
I still remember the good old days, when I was a school boy and used to cram things just like my younger sister was doing. Initially it was just cramming for me but in later grades 7, to onwards I remember myself trying to do some 'creative' work. I used to add a few sentences in essays like My Best Teacher, or stories. It was an intentional effort to get more marks as compared to other class fellows, but afterwards this creativity somehow expanded and now at this stage of my life I am able to write a blog post in English.
English Language Teaching System in Pakistan just makes you cram the rules and the vocabulary. It just pushes you to the sea of language now if you are a good learner you may learn to survive in this world, you may start getting points on your own. Otherwise you won't be able to write sentences or essays other than you crammed during your education. We produce crammers or if they are lucky enough good writers. We need English for writing mostly and our system of teaching is thus writing oriented.
But things are gonna change now. Our govt. educational system produces writers still but there is an urge to learn spoken English. So there are numberless institutes which offer spoken English classes. They provide a low quality version of communicative language teaching methodology but they somehow are trying to cope with the problem.
But we need to redefine our needs of English and our system of education as well.
Govt. High School
Madam
I have an urgent piece of work at home. Therefore I cannot come to school. Kindly grant me leave for two days.
Yours Obediently
X.Y.Z.
This was the lesson my younger sister cramming when I was taking my breakfast this morning. She is in 5th grade and goes to a coaching center for extra help in studies also. Both at school and at coaching center the teaching system of English is same. They make them cram the lesson. Initially there are no rules of the language but simple cramming of big and small chunks of language in the form of vocabulary items like names of colours, simple sentences, and afterwards stories, applications, letters and essays. Initially stories etc. are simpler having a few sentences and simple structures. Afterwards things start gonna complex and lengthy. And this goes upto B.A. i.e. 14 grade.
I still remember the good old days, when I was a school boy and used to cram things just like my younger sister was doing. Initially it was just cramming for me but in later grades 7, to onwards I remember myself trying to do some 'creative' work. I used to add a few sentences in essays like My Best Teacher, or stories. It was an intentional effort to get more marks as compared to other class fellows, but afterwards this creativity somehow expanded and now at this stage of my life I am able to write a blog post in English.
English Language Teaching System in Pakistan just makes you cram the rules and the vocabulary. It just pushes you to the sea of language now if you are a good learner you may learn to survive in this world, you may start getting points on your own. Otherwise you won't be able to write sentences or essays other than you crammed during your education. We produce crammers or if they are lucky enough good writers. We need English for writing mostly and our system of teaching is thus writing oriented.
But things are gonna change now. Our govt. educational system produces writers still but there is an urge to learn spoken English. So there are numberless institutes which offer spoken English classes. They provide a low quality version of communicative language teaching methodology but they somehow are trying to cope with the problem.
But we need to redefine our needs of English and our system of education as well.
Monday, May 17, 2010
Oka: A variant form of Okay, O.K. prevailing in Punjabi Youth
When I first heard this phrase "oka fer" (okay then) I laughed for hours because it was too funny for me. I remembered it again after few minutes and again started laughing but after that I forgot the laughing. Now I use this word quite frequently with my close friends, class fellows and collegues.
O.K or Okay is a word used for affirmation in colloquial English. The history of this word tells that its usage was started from America, perhaps some American president wrote o.k. on a file and then it "officially" came into being. It is also used as a discourse marker along with right, etc. It is pronounced as:
/k/ in this word is as usual aspirated like /kh/ by native English speakers of American and British origin.
As per the suggestion of Wikipedia, this word is used in colloquial English and adopted in several languages of the world. This word is also borrowed by local languages of Pakistan like Punjabi and the National Language (Urdu). In Urdu it is pronounced as "okay" same is the case with Punjabi. But there is a variation in Punjabi. As Punjabi is a more informal language and used among closed friends, at homes etc., an informal use of okay is prevailing in Punjabi youth specially young boys. They use it as fun, while saying "oka" they initially have a smile on their faces but after that they accept it as a form of okay and try to use this new word in their friends' company. It can be called a slang word because it is being used informally only by youngsters in their sittings but the change is underway. It is being used by young university girls also with their close male friends and collegues in informal situations. They are also using it more frequently within female to female interactions. Another variation "oki" among girls and "oku" among boys is used at rare occasions when motive is to create more fun and laughter.
Variation is underway which is the destiny of language. Continuing our discussion on variation I would like to document the variation in romanized variety of Urdu and Punjabi which is being used in text messages specially by youngsters and university students while messaging their peers and friends.
O.K or Okay is a word used for affirmation in colloquial English. The history of this word tells that its usage was started from America, perhaps some American president wrote o.k. on a file and then it "officially" came into being. It is also used as a discourse marker along with right, etc. It is pronounced as:
/k/ in this word is as usual aspirated like /kh/ by native English speakers of American and British origin.
As per the suggestion of Wikipedia, this word is used in colloquial English and adopted in several languages of the world. This word is also borrowed by local languages of Pakistan like Punjabi and the National Language (Urdu). In Urdu it is pronounced as "okay" same is the case with Punjabi. But there is a variation in Punjabi. As Punjabi is a more informal language and used among closed friends, at homes etc., an informal use of okay is prevailing in Punjabi youth specially young boys. They use it as fun, while saying "oka" they initially have a smile on their faces but after that they accept it as a form of okay and try to use this new word in their friends' company. It can be called a slang word because it is being used informally only by youngsters in their sittings but the change is underway. It is being used by young university girls also with their close male friends and collegues in informal situations. They are also using it more frequently within female to female interactions. Another variation "oki" among girls and "oku" among boys is used at rare occasions when motive is to create more fun and laughter.
Variation is underway which is the destiny of language. Continuing our discussion on variation I would like to document the variation in romanized variety of Urdu and Punjabi which is being used in text messages specially by youngsters and university students while messaging their peers and friends.
Saturday, May 15, 2010
Urdu in Google Translate
Urdu started flourishing on the internet after 2003. Initially there was a single website bbcurdu.com which was known by Urdu lovers as being a "true" Urdu website, true in the sense that it was a text based website which was built on brand new unicode UTF8 standard and which was very quick to load. After 2005 there was a boost in Urdu web publishing and we saw forums as well as blogs created in unicode Urdu. And then the trends started changing from purely inpage generated .gif images to simple but elegant text based websites which were quicker to load and which were searchable by search engines. Forums like urduweb.org/mehfil emerged which were totally in Urdu, a strange phenomenon in those days. Now a days websites like urdupoint.com use a blend of pictures and text based Urdu, where default and front page matter is in unicode while poetry and other stuff is still generated through inpage in the form of gif images.
The trends to use Unicode Standard are accelerating now and it is being enhanced by various factors. Lots of people know how to write Urdu even in notepad, how to build your personal blog in Urdu (thanks to Urduweb.org/mehfil a great Urdu Forum and mother of most Urdu Blogs), of course the spread of internet, and the factor that Urdu can be machine translated now. Paktranslations.com is online for a year or so, they are working and providing good machine translations between English and Urdu but they can never meet the experience and resources the giant of search Google has. And now google has added support of Urdu in its translate.google.com service. It is still in alpha stage but very much usable and acceptable.
Urdu is now among the 56 or so languages which are supported by Google Translate. Hindi, a step sister of Urdu, was already supported and so the case was with Arabic etc. I was just watching the progress of translate.bing.com, the Microsoft's reply to google translate. It just supports 30 or so languages yet, lagging behind a lot. Bing will have to add support for Hindi, Urdu also to prove itself. Long live Urdu, we'll see more advances in just a short span of time.
Update: Bing has an Indic Transliteration Tool also, similar to Google Transliterate. But it is still behind because it does not support Urdu as well as it does not have an API to add its support to other applications, which google transliterate services has.
The trends to use Unicode Standard are accelerating now and it is being enhanced by various factors. Lots of people know how to write Urdu even in notepad, how to build your personal blog in Urdu (thanks to Urduweb.org/mehfil a great Urdu Forum and mother of most Urdu Blogs), of course the spread of internet, and the factor that Urdu can be machine translated now. Paktranslations.com is online for a year or so, they are working and providing good machine translations between English and Urdu but they can never meet the experience and resources the giant of search Google has. And now google has added support of Urdu in its translate.google.com service. It is still in alpha stage but very much usable and acceptable.
Urdu is now among the 56 or so languages which are supported by Google Translate. Hindi, a step sister of Urdu, was already supported and so the case was with Arabic etc. I was just watching the progress of translate.bing.com, the Microsoft's reply to google translate. It just supports 30 or so languages yet, lagging behind a lot. Bing will have to add support for Hindi, Urdu also to prove itself. Long live Urdu, we'll see more advances in just a short span of time.
Update: Bing has an Indic Transliteration Tool also, similar to Google Transliterate. But it is still behind because it does not support Urdu as well as it does not have an API to add its support to other applications, which google transliterate services has.
Wednesday, May 12, 2010
Sorting a Dictionary in C# 2
During corpus processing tasks, I need to create and sort dictionaries for very simple tasks of frequency list generation. For this purpose I always have to seek on the web for solutions. C# is now at version 4 and it has lots of variations and innovations in its syntax. There are additional typing conventions, keywords, namespaces like powerful System.LINQ and extension methods. But being a linguistics student I am unable find time to upgrade my skills from C# 2005 a.k.a C# 2 to latest versions. It is not necessary perhaps also because I need C# for text processing tasks which is done very easily with C# 2 and its System.Collections.Generic namespace.
So here is a code sample which can be used to sort a dictionary with respect to its values. I have blended 2 or 3 methods into one so that it takes input as dictionary and gives output as dictionary. Code may be inefficient due to my inabilities in programming but still it works for me. Hopefully for you it would work also. :-)
So here is a code sample which can be used to sort a dictionary with respect to its values. I have blended 2 or 3 methods into one so that it takes input as dictionary and gives output as dictionary. Code may be inefficient due to my inabilities in programming but still it works for me. Hopefully for you it would work also. :-)
public static DictionarySort (Dictionary dict)
{
List> list = new List >();
foreach(KeyValuePairkvp in dict)
{
list.Add(kvp);
}
list.Sort(
delegate(KeyValuePairfirstPair,
KeyValuePairnextPair)
{
return nextPair.Value.CompareTo(firstPair.Value);
}
);
Dictionaryd = new Dictionary ();
foreach(KeyValuePairkvp in list)
{
d.Add(kvp.Key, kvp.Value);
}
return d;
}
Tuesday, May 11, 2010
Qt Gets Multitouch Support
Qt is one of the very first things I encountered in Linux and Open Source world. I was very fond of KDE 3.x.x and the mother of KDE was Qt, a GUI tool kit which was used to develop KDE and its applications. I always liked the light white and blue interface of KDE as compared to brownish one of Gnome. Then after the release of the adventurous KDE 4, I had to move to a stable desktop environment, Gnome. But KDE is still my love. I always try to use KDE whenever possible. Actually once I tried to develop for KDE as well, of course through Qt. I was looking for some way to develop C# applications using Qt, but since then I am unable to get it done. There is a way (Qyoto) to build applications using Qt and C# but this is not for beginners. A person like me would like to have a GUI designer as there is one in Mono Develop for GTK#, and a very good easy to use IDE to write the applications, and also the documentation to get help. But all these things are still not available in case of Qyoto. It is a good potential project, but still it is not usable for persons like me.
So that was the history of my love regarding Qt and KDE Qt was bought by Nokia a few years back and now it is being developed by Nokia in their own way, to make it fit for smarphones and other such devices. Qt was once dual licensed but now it is released under LGPL and code can be accessed more easily. The news which let me write this post was this one actually. Qt gets multitouch support, means it is now more easy to develop applications for hand held devices. Additionally Qt's software model would be changed to a more modular nature in coming days.
So that was the history of my love regarding Qt and KDE Qt was bought by Nokia a few years back and now it is being developed by Nokia in their own way, to make it fit for smarphones and other such devices. Qt was once dual licensed but now it is released under LGPL and code can be accessed more easily. The news which let me write this post was this one actually. Qt gets multitouch support, means it is now more easy to develop applications for hand held devices. Additionally Qt's software model would be changed to a more modular nature in coming days.
Labels:
Gnome,
GTK#,
KDE,
Linux,
MonoDevelop,
Nokia,
Qt,
Qyoto,
Technology
Sunday, May 9, 2010
WebOS based HP's Tablet ??
As we talked about previously about the acquirement of Palm by HP. We also discussed a little about the consequences of this move in the future or near future. The predictions in technology are not very difficult to make. You just have a wise brain and good analytical skills which can tell you what can be there next. So the IT Gurus are making two plus two four and a rumor regarding new HP Tablet is on the way on Internet.
Looks promising? It may be a new competitor of Apple iPad and other such products. So let us see whats gonna happen next.
Update: Here is something more about it
Looks promising? It may be a new competitor of Apple iPad and other such products. So let us see whats gonna happen next.
Update: Here is something more about it
Thursday, May 6, 2010
My New Old Laptop
Just got a new OLD laptop, IBM Thinkpad. More specifications are underway because it is not currently with me. I was trying to have a laptop since 2007 and now finally got one. Thank to Allah G. :-)
Saturday, May 1, 2010
Forensic Linguistics in Pakistan
Forensic Linguistics is the sub field of Linguistics which is used to identify criminals, gangsters and solve legal cases as well. In foreign countries it considers as a vital part of investigation. Linguists are serving in this field as a phonetician, stylistician, and discourse analyst and so on. That’s why the legal institution is very much responsible and efficient, that the crime rate of that countries is very low.
Now, if we look at Pakistan’s condition, why crime is getting its root deeper and deeper in Pakistan? Gangsters, underworld etc. because Pakistan has not a fully fledge system of investigation.
I personally think that forensic linguistics should be in Pakistan with its full sense. At this time, not even a single university offering forensic linguistics as a subject or its specialization. If it would be in Pakistani Universities it can help students or professionals in near future. It also would help our legal field specially for finding criminals out (which is very unusual in our country).
A forensic Linguist can be:
Phonetician, Stylisitican, discourse analyst, semanticist etc. These things are also very much important in our legal way of investigation rather “3rd degree”.
So in my point of view it would be a dual process like the competence of students would be improved as well as it would help our country to sustain the anti-criminal cell. May be a little hard effort in this field can serve a lot.
Ahh! But our bad luck is this that we do not have a single subject on it. Students like me who love forensic field in any discipline would be very happy… InshahAllah I will try my level best to introduce forensic linguistic in Pakistan as a subject. Because I know that efforts & struggles make dreams come true. This field would serve a lot at national level & may be Criminals get civilized
Anyway, my dream is to be a forensic linguist inshahAllah.
Now, if we look at Pakistan’s condition, why crime is getting its root deeper and deeper in Pakistan? Gangsters, underworld etc. because Pakistan has not a fully fledge system of investigation.
I personally think that forensic linguistics should be in Pakistan with its full sense. At this time, not even a single university offering forensic linguistics as a subject or its specialization. If it would be in Pakistani Universities it can help students or professionals in near future. It also would help our legal field specially for finding criminals out (which is very unusual in our country).
A forensic Linguist can be:
Phonetician, Stylisitican, discourse analyst, semanticist etc. These things are also very much important in our legal way of investigation rather “3rd degree”.
So in my point of view it would be a dual process like the competence of students would be improved as well as it would help our country to sustain the anti-criminal cell. May be a little hard effort in this field can serve a lot.
Ahh! But our bad luck is this that we do not have a single subject on it. Students like me who love forensic field in any discipline would be very happy… InshahAllah I will try my level best to introduce forensic linguistic in Pakistan as a subject. Because I know that efforts & struggles make dreams come true. This field would serve a lot at national level & may be Criminals get civilized
Anyway, my dream is to be a forensic linguist inshahAllah.
Subscribe to:
Comments (Atom)