The file is formatted so that each line is the symbol '(:s', then the weighting of the rule (I think it's the number of source sentences...), some metadata about the rule (pointers to source sentences in a separate corpus -- they're corrupted so you can basically ignore them), an EL formula, and an auto-generated English sentence that corresponds to the formula. My simplification script got lost, but here are a few Python snippets you can use to easily access the data you want.
line = [a line from Lore KB]
KB_LINE_MATCH = "\(:s (\d+) ([^()]+) (\(.+\)) \"(.+)\""
m = [url removed, login to view](KB_LINE_MATCH, line)
weight = int([url removed, login to view](1))
formula = from_lispstr([url removed, login to view](3))
natural = [url removed, login to view](4)
You can use them directly if you're using Python, or write them into a formatted file that's easy to read from whatever other language you're using.
I would recommend building you system by taking a small prefix subset to develop on -- 'head -n 100 [source filename] > [target filename]' will write the first 100 lines to [target filename]. Also the formulas are ordered by importance, so you'll handle the most important cases.
Some of the formulas have operators marked by starting with a colon (e.g. :f, :i, etc). ":i", ":f", ":p", ":q", and ":l" can be thought of as meaning "sentence in infix form" (i.e., with the first argument preceding the predicate), "function application", "predicate application", "unscoped quantifier application", and "lambda abstraction". I would start by ignoring those, but ideally you'll eventually handle these formats as well. They're an alternative representation of ULF that's a bit more explicit about the types (easy for Lisp processing) -- note that quantifiers aren't scoped in this representation, though they are in the other formulas. Also, the formulas are in full EL, not ULF, and the conventions are a bit different. Here are a list of most obvious differences I notice, please let me know if anything remains unclear about the logical forms:
- quantifiers have no suffix
- quantifiers are scoped and the variable quantified is explicit: format (<quantifier> <variable> <restrictor> <body>). The restrictor and body correspond to generalized quantifier theory, if you know about that it might be helpful. Otherwise, they have natural language-like interpretations. [not for formulas with colon operators]
- predicative statements are in infix notation
- '** e' marks the episode that is being characterized. Different 'e's (e.g. e1, e2, etc.) can be used to state relationships between episodes in EL. I *think* every formula in Lore only has a single such variable asserted though.
- the 'pair' operator relates an actor with an episode to make an action (or more precisely some relation between an agent and an episode), which can be talked about separate from the episode itself.
- many operators and predicates are formed from multiple lexical items (e.g. many-or-some, day_of_rest.v, have-as-part.v, etc.)
- some operators are marked by a word sense before the suffix (e.g. person1.n, have7.v). I'm pretty sure these are WordNet word senses.
- possession is marked with the 'have-as' operator (e.g. (x (have-as mother.n) y)). This roughly corresponds to (y = (the.d ((poss-by x) (mother-of.n *s)))) in the current ULF annotation guidelines. In fully disambiguated EL it's simply [y mother-of.n x].
You can read the first 8 pages or so of [url removed, login to view]~schubert/papers/[url removed, login to view] to understand disambiguated EL (maybe I've already sent this to you). It may not be necessary if the formulas make sense to you.
Also, here is a dataset of noun hierarchy axioms generated by the same guy who made Lore [url removed, login to view]