Enter a sentence in Dutch, English, German, or French (auto-detected). The sentence will be parsed and the most probable parse tree will be shown (show technical details).
The Data-Oriented Parsing (DOP) framework entails constructing analyses from fragments of past experience. Double-DOP operationalizes this with a subset of fragments that occur at least twice in the training data. This demo incorporates discontinuous constituents as part of the model. Linear Context-Free Rewriting Systems (LCFRS) allow for parsing with discontinuous constituents. For efficiency, sentences are parsed with the following coarse-to-fine pipeline:
Training data:
- Split-PCFG (prune items with posterior probability < 1e-5)
- PLCFRS (prune items not in 50-best derivations)
- Discontinuous Double-DOP (use 1000-best derivations to approximate most probable parse)
Objective functions:
- English: WSJ section of Penn treebank
- German: Tiger treebank
- Dutch: Lassy treebank
- French: French treebank
Estimators:
- MPP: most probable parse
- MPD: most probable derivation
- MPSD: most probable shortest derivation
- SL-DOP: shortest derivation among n most probable parse trees (n=7)
- SL-DOP-simple: shortest derivation among derivations of n most probable parse trees (n=7; approximation)
Marginalization:
- RFE: Relative Frequency Estimate
- EWE: Equal Weights Estimate
Coarse stage parser:
- n-best: find the n most probable derivations.
- sample: sample derivations according to their probability distribution
- CKY: Standard CKY parser
- posterior: Prune with posterior probabilities
- bitpar: Use the bitpar parser (max 1000 derivations)
The source code is available at http://github.com/andreasvc/disco-dop/ and documented at http://andreasvc.github.io/discodop/
References, English, German, and Dutch parser: Andreas van Cranenburgh, Rens Bod (2013). Discontinuous Parsing with an Efficient and Accurate DOP Model. Proc. of IWPT. http://acl.cs.qc.edu/iwpt2013/proceedings/Splits/9_pdfsam_IWPTproceedings.pdf
French parser: Federico Sangati, Andreas van Cranenburgh (2015). Multiword Expression Identification with Recurring Tree Fragments and Association Measures. Proceedings of the 11th Workshop on Multiword Expressions, pp. 10-18. http://aclweb.org/anthology/W15-0902