I talked over things with Jason and Paul, and we'd like to go ahead with the project. I got some additional stuff from them on what they'd like to update, though. I don't think this will change the scope too much, but let me know.
1. Implement deduping, either in the index update/creation or in the query results. I looked into this a little bit and found out about the dedupe updateRequestProcessorChain, and tried to implement this during a delta update, but it didn't work. We'd like to use the TextProfileSignature (fuzzy) option using the brandName, productName and description fields. The goal is to eliminate dupes within the same brand of products that have either the exact same name and description or very similar name and description.
2. We're seeing results that are actually the opposite of the problem I described earlier, where it seemed that all search terms were required in the results. Now it seems that we're getting results that don't include all search terms but are ranked highly because they match very well on one of the search terms. For instance, searching on "Cubs hats" turns up several Cubs items that aren't hats, even though there are easily enough Cubs hats in the database to fill our maximum number of results. We'd like to modify the search process so that items that meet all search terms are ranked higher, but still allow items that match only some of the terms as well, ranked lower.
3. Paul found an example where for a specific search he got one item, but then if he added "- mens" to the end of it, he got nothing, even though "mens" was a term in the original search. So it appears that the hyphen is throwing off the search somehow. We could just remove hyphens from any search terms we have, but we were wondering if this is symptomatic of some larger behavior of Solr, where a global fix would be better than finding and removing characters like this one by one.