当前位置 : 首页 » 互动问答 » 正文

Nltk词语标记器将结束单引号视为单独的单词

分类 : 互动问答 | 发布时间 : 2018-04-27 17:07:14 | 评论 : 2 | 浏览 : 66 | 喜欢 : 3

以下是来自IPython笔记本的代码片段:

 test =“'v'”
words = word_tokenize(test)
话

输出结果是:

 [“'v”,“'”]

正如你所看到的,结尾的单引号被视为一个单独的单词,而第一个单引号是“v”的一部分。我想要

["'v'"]

or

 [“'”,“v”,“'”]

有什么办法可以做到这一点?

回答(2)

  • 1楼
  • Try 从nltk.tokenize.moses导入MosesTokenizer,MosesDetokenizer <来自nltk.tokenize.moses>的

    导入MosesTokenizer,MosesDetokenizer
    t,d = MosesTokenizer(),MosesDetokenizer()
    tokens = t.tokenize(test)
    令牌
    [“者;γ';”]
    

    where &apos; = '

    您还可以使用 escape = False <notranslate >>>> m.tokenize(“'v'”,escape = False) [ “ 'V'”] 参数来防止转义XML特殊字符:

    保留

    的输出与原始Moses标记器'v'一致〜/ mosesdecoder / scripts / tokenizer $ perl tokenizer.perl -l en <x Tokenizer版本1.1 语言:en 线程数量:1 者;γ'; 如果你想探索和处理单引号。 , i.e.

    其他分词器

    There are other tokenizers if you wish to explore and have handling of single quotes too.

  • 2楼
  • Seems like it's a not a bug but the expected output from nltk.word_tokenize().

    This is consistent with the Treebank word tokenizer from Robert McIntyre tokenizer.sed

    $ sed -f tokenizer.sed 
    'v'
    'v ' 
    

    As @Prateek pointed out, you can try other tokenizers that might suit your needs.


    The more interesting question is why does the starting single quote stick to a the following character?

    Couldn't we hack the TreebankWordTokenizer, like what was done at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py ?

    import re
    
    from nltk.tokenize.treebank import TreebankWordTokenizer
    
    # Standard word tokenizer.
    _treebank_word_tokenizer = TreebankWordTokenizer()
    
    # See discussion on https://github.com/nltk/nltk/pull/1437
    # Adding to TreebankWordTokenizer, the splits on
    # - chervon quotes u'\xab' and u'\xbb' .
    # - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
    
    improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+|[\']+)', re.U)
    improved_close_quote_regex = re.compile(u'([»”’])', re.U)
    improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
    _treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
    _treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
    _treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
    
    _treebank_word_tokenizer.tokenize("'v'")
    

    [out]:

    ["'", 'v', "'"]
    

    Yes the modification would work for the string in the OP but it'll start to break all the clitics, e.g.

    >>> print(_treebank_word_tokenizer.tokenize("'v', I've been fooled but I'll seek revenge."))
    ["'", 'v', "'", ',', 'I', "'", 've', 'been', 'fooled', 'but', 'I', "'", 'll', 'seek', 'revenge', '.']
    

    Note that the original nltk.word_tokenize() keeps the starting single quotes to the clitics and outputs this instead:

    >>> print(nltk.word_tokenize("'v', I've been fooled but I'll seek revenge."))
    ["'v", "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
    

    There are strategies to handle the ending quotes but not the starting quotes after a clitics at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L268

    But the main reason for this "problem" is because the Word Tokenizer doesn't have a sense of balancing the quotations mark. If we look at the MosesTokenizer, there are a lot more mechanisms to handle quotes.


    Interestingly, Stanford CoreNLP doesn't do that.

    In terminal:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000
    

    Python:

    >>> from nltk.parse.corenlp import CoreNLPParser
    >>> parser = CoreNLPParser()
    >>> parser.tokenize("'v'")
    <generator object GenericCoreNLPParser.tokenize at 0x1148f9af0>
    >>> list(parser.tokenize("'v'"))
    ["'", 'v', "'"]
    >>> list(parser.tokenize("I've"))
    ['I', "'", 've']
    >>> list(parser.tokenize("I've'"))
    ['I', "'ve", "'"]
    >>> list(parser.tokenize("I'lk'"))
    ['I', "'", 'lk', "'"]
    >>> list(parser.tokenize("I'lk"))
    ['I', "'", 'lk']
    >>> list(parser.tokenize("I'll"))
    ['I', "'", 'll']
    

    Looks like there's some sort of regex hack put in to recognize/correct the English clitics

    If we do some reverse engineering:

    >>> list(parser.tokenize("'re"))
    ["'", 're']
    >>> list(parser.tokenize("you're"))
    ['you', "'", 're']
    >>> list(parser.tokenize("you're'"))
    ['you', "'re", "'"]
    >>> list(parser.tokenize("you 're'"))
    ['you', "'re", "'"]
    >>> list(parser.tokenize("you the 're'"))
    ['you', 'the', "'re", "'"]
    

    It's possible to add a regex to patch word_tokenize, e.g.

    >>> import re
    >>> pattern = re.compile(r"(?i)(\')(?!ve|ll|t)(\w)\b")
    >>> pattern.sub(r'\1 \2', x)
    "I'll be going home I've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
    >>> x = "I 'll be going home I 've the 'v ' isn't want I want to split but I want to catch tokens like 'v and 'w ' ."
    >>> pattern.sub(r'\1 \2', x)
    "I 'll be going home I 've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
    

    So we can do something like:

    import re
    from nltk.tokenize.treebank import TreebankWordTokenizer
    
    # Standard word tokenizer.
    _treebank_word_tokenizer = TreebankWordTokenizer()
    
    # See discussion on https://github.com/nltk/nltk/pull/1437
    # Adding to TreebankWordTokenizer, the splits on
    # - chervon quotes u'\xab' and u'\xbb' .
    # - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
    
    improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
    improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
    improved_close_quote_regex = re.compile(u'([»”’])', re.U)
    improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
    _treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
    _treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
    _treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
    _treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
    
    def word_tokenize(text, language='english', preserve_line=False):
        """
        Return a tokenized copy of *text*,
        using NLTK's recommended word tokenizer
        (currently an improved :class:`.TreebankWordTokenizer`
        along with :class:`.PunktSentenceTokenizer`
        for the specified language).
    
        :param text: text to split into words
        :type text: str
        :param language: the model name in the Punkt corpus
        :type language: str
        :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
        :type preserver_line: bool
        """
        sentences = [text] if preserve_line else sent_tokenize(text, language)
        return [token for sent in sentences
                for token in _treebank_word_tokenizer.tokenize(sent)]
    

    [out]:

    >>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
    ['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
    >>> word_tokenize("'v' 're'")
    ["'", 'v', "'", "'re", "'"]
    

相关阅读:

raw_input function in Python

How to import the class within the same directory or sub directory?

How to return dictionary keys as a list in Python?

Python `if x is not None` or `if not x is None`?

Assign output of os.system to a variable and prevent it from being displayed on the screen

How can I reverse a list in Python?

What are the differences between type() and isinstance()?

How to install packages using pip according to the requirements.txt file from a local directory?

How to lowercase a string in Python?

Correct way to write line to file?