On Mit, 2003-01-29 at 10:19, Brion Vibber wrote:
Our search engine desperately needs retooling. If
there's no objection
from those in the know, I'd like to migrate us to MySQL 4. The fulltext
search in 4 has boolean capabilities built right in, meaning we could
remove our hackish and buggy parser, and wouldn't need to stack so many
MATCHes together in a query when some poor sap types in "chemical
composition of the earth's atmosphere oxygen nitrogen" or something.
As a tempfix, we could match against 'phrase' for any phrase that
doesn't contain OR or NOT, no?
(Our search queries are also frequently *dog slow*.
This is exacerbated
because, being a myisam table, it locks when someone tries to write it
and another read is pending. I don't _think_ this lock virulently
spreads to other tables joined with it, but it's annoying anyway.)
If Jimbo has some money to spend, he should give it to InnoDB and ask
them to implement the damn FULLTEXT index:
http://www.innodb.com/todo.html
Failing that, we might think about delaying index updates. Ugly, though.
Also, split up the join as we discussed. If we're really freaky, we
could move the searchindex to a separate PostgreSQL database, perhaps as
part of the phase IV (or was it V?) transition.
* Stopwords. Can we just get rid of the damn stopwords
and search
anything?
Absolutely in favor!
* "Title results" vs "Text
results" - this two-prong approach is, I
think, rather confusing. We could have a single search index field with
the title text weighted more heavily (by repetition?), and just give a
single set of results.
Not sure, I always liked the distinction. Has anyone complained about
this?
* Text extracts: these show the raw wikicode, and
often include language
links, HTML code, etc. Yuck! If we can strip these, that might be good.
Yes!
* Character entities: should be folded to their raw
equivalents in the
search index, so searching a page containing "Schrödinger" and one
containing "Schrödinger" gives identical results.
Right.
* 'Power search' is perhaps a little
confusing, and there's currently no
way to get to it short of doing two searches.
* 'Search' and 'go' buttons are not clearly demarcated; several people
have noted confusion. Better labelling or better arrangement is needed.
I'm afraid that in the limited space we have, we can't really do much
better. "Go" is fairly obvious when you use it, and I can't think of a
better label. With the new matching (namespace handling could be
improved), it's really darn useful.
We might want to add a small "Advanced search" link below, in another
column of the row where the interlanguage links are shown.
* Redirects. We generally want to filter out redirects
that seem
duplicative of other things already listed, but *must* show them for
alternate names. Clearer labeling of redirects would help as well.
Well, I thought about a syntax like
#redirect [[foo]] (reason)
We could then show this nicely in the search results as
"Redirects to page foo. Reason: spelling error."
Also, on the actual page
"Redirected from bar. Reason: spelling error."
However, by allowing freetext here, we will get lots of different
non-standardized texts, which is bad. I'd rather have some standard
texts defined in Language.php and have these referenced with shorthands
like
"sp" - spelling error
"old" - older spelling
"tra" - naming convention:
anglicization/transliteration
"acr" - naming convention: acronyms
"plu" - naming convention: pluralization
"com" - naming convention: common name
"nam" - naming convention: names and titles
"sty" - naming conventions - style, general
"dis" - disambiguation
"ndis" - unique title, no disambiguation needed
These labels should always be as specific as possible, i.e. not just
"alternative title", but refer to the correct naming convention. The
texts could, in fact, link to the proper Wikipedia articles. This would
help readers understand why we are redirecting where, expose more people
to our policies, and allow better presentation of seach results. We
could define for each of them whether they should be included in the
search or not (I think that "nam" and "dis" should not be included.)
These labels would not be hardcoded anywhere but in LanguageXY.php, i.e.
they would not be auto-inherited by other languages. So every Wikipedia
could set its own policies and shortcuts.
Regards,
Erik
--
FOKUS - Fraunhofer Insitute for Open Communication Systems
Project BerliOS -
http://www.berlios.de