Package pyarabic :: Module araby
[hide private]
[frames] | no frames]

Module araby

source code

Arabic module


Author: Taha Zerrouki

Contact: taha dot zerrouki at gmail dot com

Copyright: Arabtechies, Arabeyes, Taha Zerrouki

License: GPL

Date: 2010/03/01

Version: 0.1

Functions [hide private]
    is letter functions
Boolean
is_sukun(archar)
Checks for Arabic Sukun Mark.
source code
Boolean
is_shadda(archar)
Checks for Arabic Shadda Mark.
source code
Boolean
is_tatweel(archar)
Checks for Arabic Tatweel letter modifier.
source code
Boolean
is_tanwin(archar)
Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN).
source code
Boolean
is_tashkeel(archar)
Checks for Arabic Tashkeel Marks (
source code
Boolean
is_haraka(archar)
Checks for Arabic Harakat Marks (FATHA, DAMMA, KASRA, SUKUN, TANWIN).
source code
Boolean
is_shortharaka(archar)
Checks for Arabic short Harakat Marks (FATHA, DAMMA, KASRA, SUKUN).
source code
Boolean
is_ligature(archar)
Checks for Arabic Ligatures like LamAlef.
source code
Boolean
is_hamza(archar)
Checks for Arabic Hamza forms.
source code
Boolean
is_alef(archar)
Checks for Arabic Alef forms.
source code
Boolean
is_yehlike(archar)
Checks for Arabic Yeh forms.
source code
Boolean
is_wawlike(archar)
Checks for Arabic Waw like forms.
source code
Boolean
is_teh(archar)
Checks for Arabic Teh forms.
source code
Boolean
is_small(archar)
Checks for Arabic Small letters.
source code
Boolean
is_weak(archar)
Checks for Arabic Weak letters.
source code
Boolean
is_moon(archar)
Checks for Arabic Moon letters.
source code
Boolean
is_sun(archar)
Checks for Arabic Sun letters.
source code
    general letter functions
integer
order(archar)
return Arabic letter order between 1 and 29.
source code
unicode
name(archar)
return Arabic letter name in arabic.
source code
unicode
arabicrange()
return a list of arabic characteres .
source code
    Has letter functions
Boolean
has_shadda(word)
Checks if the arabic word contains shadda.
source code
    word and text functions
Boolean
is_vocalized(word)
Checks if the arabic word is vocalized.
source code
Boolean
is_vocalizedtext(text)
Checks if the arabic text is vocalized.
source code
Boolean
is_arabicstring(text)
Checks for an Arabic standard Unicode block characters An arabic string can contain spaces, digits and pounctuation.
source code
Boolean
is_arabicrange(text)
Checks for an Arabic Unicode block characters
source code
Boolean
is_arabicword(word)
Checks for an valid Arabic word.
source code
    Char functions
unicode char
first_char(word)
Return the first char
source code
unicode char
second_char(word)
Return the second char
source code
unicode char
last_char(word)
Return the last letter example: zerrouki; 'i' is the last.
source code
unicode char
secondlast_char(word)
Return the second last letter example: zerrouki; 'k' is the second last.
source code
    Strip functions
unicode.
strip_harakat(text)
Strip Harakat from arabic word except Shadda.
source code
unicode.
strip_lastharaka(text)
Strip the last Haraka from arabic word except Shadda.
source code
unicode.
strip_tashkeel(text)
Strip vowels from a text, include Shadda.
source code
unicode.
strip_tatweel(text)
Strip tatweel from a text and return a result text.
source code
unicode.
strip_shadda(text)
Strip Shadda from a text and return a result text.
source code
unicode.
normalize_ligature(text)
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text.
source code
unicode.
normalize_hamza(word)
Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef.
source code
couple of unicode
separate(word, extract_shadda=False)
separate the letters from the vowels, in arabic word, if a letter hasn't a haraka, the not definited haraka is attributed.
source code
unicode
joint(letters, marks)
joint the letters with the marks the length ot letters and marks must be equal return word
source code
Boolean
vocalizedlike(word1, word2)
if the two words has the same letters and the same harakats, this fuction return True.
source code
Boolean
waznlike(word1, wazn)
if the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters.
source code
Boolean
shaddalike(partial, fully)
If the two words has the same letters and the same harakats, this fuction return True.
source code
unicode.
reduce_tashkeel(text)
Reduce the Tashkeel, by deleting evident cases.
source code
Boolean / int
vocalized_similarity(word1, word2)
if the two words has the same letters and the same harakats, this function return True.
source code
list.
tokenize(text=u'')
Tokenize text into words
source code
Variables [hide private]
  COMMA = u'،'
  SEMICOLON = u'؛'
  QUESTION = u'؟'
  HAMZA = u'ء'
  ALEF_MADDA = u'آ'
  ALEF_HAMZA_ABOVE = u'أ'
  WAW_HAMZA = u'ؤ'
  ALEF_HAMZA_BELOW = u'إ'
  YEH_HAMZA = u'ئ'
  ALEF = u'ا'
  BEH = u'ب'
  TEH_MARBUTA = u'ة'
  TEH = u'ت'
  THEH = u'ث'
  JEEM = u'ج'
  HAH = u'ح'
  KHAH = u'خ'
  DAL = u'د'
  THAL = u'ذ'
  REH = u'ر'
  ZAIN = u'ز'
  SEEN = u'س'
  SHEEN = u'ش'
  SAD = u'ص'
  DAD = u'ض'
  TAH = u'ط'
  ZAH = u'ظ'
  AIN = u'ع'
  GHAIN = u'غ'
  TATWEEL = u'ـ'
  FEH = u'ف'
  QAF = u'ق'
  KAF = u'ك'
  LAM = u'ل'
  MEEM = u'م'
  NOON = u'ن'
  HEH = u'ه'
  WAW = u'و'
  ALEF_MAKSURA = u'ى'
  YEH = u'ي'
  MADDA_ABOVE = u'ٓ'
  HAMZA_ABOVE = u'ٔ'
  HAMZA_BELOW = u'ٕ'
  ZERO = u'٠'
  ONE = u'١'
  TWO = u'٢'
  THREE = u'٣'
  FOUR = u'٤'
  FIVE = u'٥'
  SIX = u'٦'
  SEVEN = u'٧'
  EIGHT = u'٨'
  NINE = u'٩'
  PERCENT = u'٪'
  DECIMAL = u'٫'
  THOUSANDS = u'٬'
  STAR = u'٭'
  MINI_ALEF = u'ٰ'
  ALEF_WASLA = u'ٱ'
  FULL_STOP = u'۔'
  BYTE_ORDER_MARK = u''
  FATHATAN = u'ً'
  DAMMATAN = u'ٌ'
  KASRATAN = u'ٍ'
  FATHA = u'َ'
  DAMMA = u'ُ'
  KASRA = u'ِ'
  SHADDA = u'ّ'
  SUKUN = u'ْ'
  SMALL_ALEF = u'ٰ'
  SMALL_WAW = u'ۥ'
  SMALL_YEH = u'ۦ'
  LAM_ALEF = u''
  LAM_ALEF_HAMZA_ABOVE = u''
  LAM_ALEF_HAMZA_BELOW = u''
  LAM_ALEF_MADDA_ABOVE = u''
  SIMPLE_LAM_ALEF = u'لا'
  SIMPLE_LAM_ALEF_HAMZA_ABOVE = u'لأ'
  SIMPLE_LAM_ALEF_HAMZA_BELOW = u'لإ'
  SIMPLE_LAM_ALEF_MADDA_ABOVE = u'لآ'
  LETTERS = u'ابتةثجحخدذرزسشصضطظعغفقكلمنهويءآأؤإئ'
  TASHKEEL = (u'ً', u'ٌ', u'ٍ', u'َ', u'ُ', u'ِ', u'ْ', u'ّ')
  HARAKAT = (u'ً', u'ٌ', u'ٍ', u'َ', u'ُ', u'ِ', u'ْ')
  SHORTHARAKAT = (u'َ', u'ُ', u'ِ', u'ْ')
  TANWIN = (u'ً', u'ٌ', u'ٍ')
  NOT_DEF_HARAKA = u'ـ'
  LIGUATURES = (u'', u'', u'', u'')
  HAMZAT = (u'ء', u'ؤ', u'ئ', u'ٔ', u'ٕ', u'إ', u'أ')
  ALEFAT = (u'ا', u'آ', u'أ', u'إ', u'ٱ', u'ى', u'ٰ')
  WEAK = (u'ا', u'و', u'ي', u'ى')
  YEHLIKE = (u'ي', u'ئ', u'ى', u'ۦ')
  WAWLIKE = (u'و', u'ؤ', u'ۥ')
  TEHLIKE = (u'ت', u'ة')
  SMALL = (u'ٰ', u'ۥ', u'ۦ')
  MOON = (u'ء', u'آ', u'أ', u'إ', u'ا', u'ب', u'ج', u'ح', u'خ', ...
  SUN = (u'ت', u'ث', u'د', u'ذ', u'ر', u'ز', u'س', u'ش', u'ص', u...
  ALPHABETIC_ORDER = {u'ء': 29, u'آ': 29, u'أ': 29, u'ؤ': 29, u'...
  NAMES = {u'ء': u'همزة', u'آ': u'ألف ممدودة', u'أ': u'همزة على ...
  HARAKAT_PATTERN = re.compile(r'(?u)[\u064b\u064c\u064d\u064e\u...
  LASTHARAKA_PATTERN = re.compile(r'(?u)[\u064b\u064c\u064d\u064...
  SHORTHARAKAT_PATTERN = re.compile(r'(?u)[\u064e\u064f\u0650\u0...
  TASHKEEL_PATTERN = re.compile(r'(?u)[\u064b\u064c\u064d\u064e\...
  HAMZAT_PATTERN = re.compile(r'(?u)[\u0621\u0624\u0626\u0654\u0...
  ALEFAT_PATTERN = re.compile(r'(?u)[\u0627\u0622\u0623\u0625\u0...
  LIGUATURES_PATTERN = re.compile(r'(?u)[\ufefb\ufef7\ufef9\ufef...
  TOKEN_PATTERN = re.compile(r'(?u)([\w\u064b\u064c\u064d\u064e\...
  TOKEN_REPLACE = re.compile(r'[\t\r\f\v ]')
  __package__ = 'pyarabic'
Function Details [hide private]

is_sukun(archar)

source code 

Checks for Arabic Sukun Mark.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_shadda(archar)

source code 

Checks for Arabic Shadda Mark.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_tatweel(archar)

source code 

Checks for Arabic Tatweel letter modifier.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_tanwin(archar)

source code 

Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN).

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_tashkeel(archar)

source code 

Checks for Arabic Tashkeel Marks (

  • FATHA, DAMMA, KASRA, SUKUN,
  • SHADDA,
  • FATHATAN, DAMMATAN, KASRATAn).
Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_haraka(archar)

source code 

Checks for Arabic Harakat Marks (FATHA, DAMMA, KASRA, SUKUN, TANWIN).

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_shortharaka(archar)

source code 

Checks for Arabic short Harakat Marks (FATHA, DAMMA, KASRA, SUKUN).

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_ligature(archar)

source code 

Checks for Arabic Ligatures like LamAlef. (LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE)

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_hamza(archar)

source code 

Checks for Arabic Hamza forms. HAMZAT are (HAMZA, WAW_HAMZA, YEH_HAMZA, HAMZA_ABOVE, HAMZA_BELOW, ALEF_HAMZA_BELOW, ALEF_HAMZA_ABOVE )

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_alef(archar)

source code 

Checks for Arabic Alef forms. ALEFAT = (ALEF, ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, ALEF_WASLA, ALEF_MAKSURA )

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_yehlike(archar)

source code 

Checks for Arabic Yeh forms. Yeh forms : YEH, YEH_HAMZA, SMALL_YEH, ALEF_MAKSURA

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_wawlike(archar)

source code 

Checks for Arabic Waw like forms. Waw forms : WAW, WAW_HAMZA, SMALL_WAW

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_teh(archar)

source code 

Checks for Arabic Teh forms. Teh forms : TEH, TEH_MARBUTA

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_small(archar)

source code 

Checks for Arabic Small letters. SMALL Letters : SMALL ALEF, SMALL WAW, SMALL YEH

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_weak(archar)

source code 

Checks for Arabic Weak letters. Weak Letters : ALEF, WAW, YEH, ALEF_MAKSURA

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_moon(archar)

source code 

Checks for Arabic Moon letters. Moon Letters :

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

is_sun(archar)

source code 

Checks for Arabic Sun letters. Moon Letters :

Parameters:
  • archar (unicode) - arabic unicode char
Returns: Boolean

order(archar)

source code 

return Arabic letter order between 1 and 29. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: integer
arabic order.

name(archar)

source code 

return Arabic letter name in arabic. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: unicode
arabic name.

arabicrange()

source code 

return a list of arabic characteres . Return a list of characteres between ، to ْ

Returns: unicode
list of arabic characteres.

has_shadda(word)

source code 

Checks if the arabic word contains shadda.

Parameters:
  • word (unicode) - arabic unicode char
Returns: Boolean
if shadda exists

is_vocalized(word)

source code 

Checks if the arabic word is vocalized. the word musn't have any spaces and pounctuations.

Parameters:
  • word (unicode) - arabic unicode char
Returns: Boolean
if the word is vocalized

is_vocalizedtext(text)

source code 

Checks if the arabic text is vocalized. The text can contain many words and spaces

Parameters:
  • text (unicode) - arabic unicode char
Returns: Boolean
if the word is vocalized

is_arabicstring(text)

source code 

Checks for an Arabic standard Unicode block characters An arabic string can contain spaces, digits and pounctuation. but only arabic standard characters, not extended arabic

Parameters:
  • text (unicode) - input text
Returns: Boolean
True if all charaters are in Arabic block

is_arabicrange(text)

source code 

Checks for an Arabic Unicode block characters

Parameters:
  • text (unicode) - input text
Returns: Boolean
True if all charaters are in Arabic block

is_arabicword(word)

source code 

Checks for an valid Arabic word. An Arabic word not contains spaces, digits and pounctuation avoid some spelling error, TEH_MARBUTA must be at the end.

Parameters:
  • word (unicode) - input word
Returns: Boolean
True if all charaters are in Arabic block

first_char(word)

source code 

Return the first char

Parameters:
  • word (unicode) - given word
Returns: unicode char
the first char

second_char(word)

source code 

Return the second char

Parameters:
  • word (unicode) - given word
Returns: unicode char
the first char

last_char(word)

source code 

Return the last letter example: zerrouki; 'i' is the last.

Parameters:
  • word (unicode) - given word
Returns: unicode char
the last letter

secondlast_char(word)

source code 

Return the second last letter example: zerrouki; 'k' is the second last.

Parameters:
  • word (unicode) - given word
Returns: unicode char
the second last letter

strip_harakat(text)

source code 

Strip Harakat from arabic word except Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example: >>> text = u"الْعَرَبِيّةُ" >>> stripTashkeel(text) >>> العربيّة

Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_lastharaka(text)

source code 

Strip the last Haraka from arabic word except Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • FATHATAN, DAMMATAN, KASRATAN

Example: >>> text = u"الْعَرَبِيّةُ" >>> stripTashkeel(text) >>> الْعَرَبِيّة

Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_tashkeel(text)

source code 

Strip vowels from a text, include Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • SHADDA
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example: >>> text = u"الْعَرَبِيّةُ" >>> stripTashkeel(text) العربية

Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_tatweel(text)

source code 

Strip tatweel from a text and return a result text. Example: >>> text = u"العـــــربية" >>> stripTatweel(text) >>> العربية

Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_shadda(text)

source code 

Strip Shadda from a text and return a result text.

Example:

>>> text = u"الشّمسيّة"
>>> stripTatweel(text)
الشمسية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

normalize_ligature(text)

source code 

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are :

  • LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

Example:

>>> text = u"لانها لالء الاسلام"
>>> normalizeLigature(text)
لانها لالئ الاسلام
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_hamza(word)

source code 

Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef. Replace the LamAlefs by simplified letters. Example:

>>> text = u"سئل أحد الأئمة"
>>> normalizeHamza(text)
سءل ءحد الءءمة
Parameters:
  • word (unicode.) - arabic text.
Returns: unicode.
return a converted text.

separate(word, extract_shadda=False)

source code 

separate the letters from the vowels, in arabic word, if a letter hasn't a haraka, the not definited haraka is attributed. return ( letters, vowels)

Parameters:
  • word (unicode) - the input word
  • extract_shadda (Boolean) - extract shadda as seperate text
Returns: couple of unicode
( letters, vowels)

joint(letters, marks)

source code 

joint the letters with the marks the length ot letters and marks must be equal return word

Parameters:
  • letters (unicode) - the word letters
  • marks (unicode) - the word marks
Returns: unicode
word

vocalizedlike(word1, word2)

source code 

if the two words has the same letters and the same harakats, this fuction return True. The two words can be full vocalized, or partial vocalized

Parameters:
  • word1 (unicode) - first word
  • word2 (unicode) - second word
Returns: Boolean
if two words have similar vocalization

waznlike(word1, wazn)

source code 

if the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters. this are as generic letters. The two words can be full vocalized, or partial vocalized

Parameters:
  • word1 (unicode) - input word
  • wazn (unicode) - given word template وزن
Returns: Boolean
if two words have similar vocalization

shaddalike(partial, fully)

source code 

If the two words has the same letters and the same harakats, this fuction return True. The first word is partially vocalized, the second is fully if the partially contians a shadda, it must be at the same place in the fully

Returns: Boolean
if contains shadda

reduce_tashkeel(text)

source code 

Reduce the Tashkeel, by deleting evident cases.

Parameters:
  • text (unicode. @return : partially vocalized text.) - the input text fully vocalized.
Returns: unicode.

vocalized_similarity(word1, word2)

source code 

if the two words has the same letters and the same harakats, this function return True. The two words can be full vocalized, or partial vocalized

Parameters:
  • word1 (unicode) - first word
  • word2 (unicode) - second word
Returns: Boolean / int
return if words are similar, else return negative number of errors

tokenize(text=u'')

source code 

Tokenize text into words

Parameters:
  • text (unicode.) - the input text.
Returns: list.
list of words.

Variables Details [hide private]

MOON

Value:
(u'ء',
 u'آ',
 u'أ',
 u'إ',
 u'ا',
 u'ب',
 u'ج',
 u'ح',
...

SUN

Value:
(u'ت',
 u'ث',
 u'د',
 u'ذ',
 u'ر',
 u'ز',
 u'س',
 u'ش',
...

ALPHABETIC_ORDER

Value:
{u'ء': 29,
 u'آ': 29,
 u'أ': 29,
 u'ؤ': 29,
 u'إ': 29,
 u'ئ': 29,
 u'ا': 1,
 u'ب': 2,
...

NAMES

Value:
{u'ء': u'همزة',
 u'آ': u'ألف ممدودة',
 u'أ': u'همزة على الألف',
 u'ؤ': u'همزة على الواو',
 u'إ': u'همزة تحت الألف',
 u'ئ': u'همزة على الياء',
 u'ا': u'ألف',
 u'ب': u'باء',
...

HARAKAT_PATTERN

Value:
re.compile(r'(?u)[\u064b\u064c\u064d\u064e\u064f\u0650\u0652]')

LASTHARAKA_PATTERN

Value:
re.compile(r'(?u)[\u064b\u064c\u064d\u064e\u064f\u0650\u0652]$|[\u064b\
\u064c\u064d]')

SHORTHARAKAT_PATTERN

Value:
re.compile(r'(?u)[\u064e\u064f\u0650\u0652]')

TASHKEEL_PATTERN

Value:
re.compile(r'(?u)[\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0651]')

HAMZAT_PATTERN

Value:
re.compile(r'(?u)[\u0621\u0624\u0626\u0654\u0655\u0625\u0623]')

ALEFAT_PATTERN

Value:
re.compile(r'(?u)[\u0627\u0622\u0623\u0625\u0671\u0649\u0670]')

LIGUATURES_PATTERN

Value:
re.compile(r'(?u)[\ufefb\ufef7\ufef9\ufef5]')

TOKEN_PATTERN

Value:
re.compile(r'(?u)([\w\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0651]\
+)')