NAME

NAT::Lexicon - Perl extension to encapsulate NATools Lexicon files


SYNOPSIS

  use NAT::Lexicon;
  $lex = new NAT::Lexicon("file.lex");
  $lex = NAT::Lexicon::open("file.lex"); # deprecated
  $word = $lex->word_from_id(2);
  $id = $lex->id_from_word("cavalo");
  @ids = $lex->sentence_to_ids("era uma vez um gato maltez");
  $sentence = $lex->ids_to_sentence(10,2,3,2,5,4,3,2,5);
  $lex->size;
  $lex->id_count(2);
  $lex->close;


DESCRIPTION

This module encapsulates the NATools Lexicon files, making them accessible using Perl. The implementation is based on OO philosophy. First, you must open a lexicon file using:

 $lex = new NAT::Lexicon("lexicon.file.lex");

When you have all done, do not forget to close it. This makes some memory frees, and is welcome for the process of opening new lexicon files.

 $lex->close;

Lexicon files map words to identifiers and vice-versa. Its usage is simple: use

  $lex->id_from_word($word)

to get an id for a word. Use

  $lex->word_from_id($id)

to get back the word from the id. If you need to make big quantities of conversions to construct or parse a sentence use ids_to_sentence or sentence_to_ids respectively.

new

This is the NAT::Lexicon constructor. Pass it a lexicon file. These files usually end with a .lex extension:

   my $lexicon = new NAT::Lexicon("file.lex");

open

The open function is the DEPRECATED constructor of NAT::Lexicon objects. Check the new method.

save

This method saves the current lexicon object in the supplied file:

   $lexicon -> save("/there/lexicon.lex");

close

Call this method to close a Lexicon. This is important to free resources (both memory and lexicons, as there is a limited number of open lexicons at a time).

   $lexicon -> close;

word_from_id

This method is used to convert one word-id to a word:

   my $word = $lexicon -> word_from_id ($word_id);

ids_to_sentence

This method calls word_from_id for each passed parameter. Thus, it receives a list of word identifiers, and returns the corresponding string. Words are separated by a space character.

   my $sentence = $lexicon -> ids_to_sentence(1,3,5,2,3,6);

id_from_word

This method is used to convert one word to its corresponding identifier (word-id).

    my $word_id = $lexicon -> id_from_word( $word );

sentence_to_ids

This method calls id_from_word for each word from a sentence. Note that the method does not perform the common tokenization task. It just splits the sentence by the space character. You must preprocess the string using a NLP tokenizer.

The method returns a reference to the list of identifiers.

  my $wid_list = $lexicon -> sentence_to_ids("a sentence");

id_count

This method returns the number of occurrences for a specific word. Note that the word must be supplied as its identifier, and not the string itself.

  my $count = $lexicon -> id_count( 45 );

occurrences

This method returns the size of the corpus (number of tokens) that originated the lexicon: it sums up occurrences for each word, and returns the total value.

   my $total = $lexicon -> occurrences;

size

This method returns the number of different words (types) from the corpus that originated the lexicon.

  my $size = $lexicon -> size;

add_word

This method adds a new word to the lexicon file. The word will be created with an occurrence count of 1.

Note that lexicon files can't be created from scratch using this module. The module is intended to manipulate already created lexicon files. A standard lexicon file doesn't have space for new words. You need to enlarge it before. Use the size method to know the current size, and the enlarge method to add some empty space.

   $lexicon -> add_word("dog");

set_id_count

After creating a new word (or in an old word...) you might want to change its occurrence. Call this method for that. Pass it the word identifier and the new occurrence count.

This method is benevolent and let you set a negative occurrence count. Setting an occurrence count to 0 will not delete the word entry.

   $lexicon -> set_id_count( $wid, ++$count);

enlarge

This method creates extra space for new words. You do not need to know its current size, just the number of words you need to add. Pass that as the argument to the method. The returning object should accomodate that more words. Also, try to call this method as few times as possible. First calculate the amount of words you need, then enlarge the Lexicon.

   $lexicon -> enlarge( 100 ); # 100 more words


SEE ALSO

See perl(1) and NATools documentation.


AUTHOR

Alberto Manuel Brandao Simoes, <albie@alfarrabio.di.uminho.pt>


COPYRIGHT AND LICENSE

Copyright 2002-2009 by NATURA Project http://natura.di.uminho.pt http://natools.sf.net

This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.