Forums | MacLife
You are not logged in.
#1 2009-06-07 7:19 am
Request for comment - extending DTD for search indexing
I'm currently working on a rewrite of the open source php search engine sphyder.
The main goal of the rewrite -
1) give it an option to use a sitemap.xml either to build initial list of pages to index or list of only pages to index (I'll have it just check the root level robots.txt file and only use sitemap files declared there)
2) port the backend indexer to DOMDocument parsing of a document.
3) port it use either flat file xml or database for customizations to the configuration. Right now, configuration file, which is parsed as php, has to be writeable by the web server in order for the admin interface to change some configurations, and that's bad. I'm not sure whether storing custom values in xml or database would be faster, I suspect database would be faster, but may be moot.
4) port the database to either pear::mdb2 or pdo, maybe both.
5) teach it to understand a new html attribute to tell it whether or not a portion of a page should be indexed.
Right now, #4 is done via very sphyder specific html comments.
Code:
<p>This content is indexed</p> <!--sphider_noindex--> <p>This content is not indexed</p> <!--/sphider_noindex--> <p>This content is again indexed</p>
It should be done via an attribute of the node.
Code:
<p>This content is indexed</p> <p spider="off">This content is not indexed</p> <p>This content is again indexed</p>
The spider attribute is custom, but I found no official attribute that does what it needs to do.
Adding custom attributes to xhtml is cake, you can just add it in the DOCTYPE node and it will validate.
Adding custom attributes to html requires using a custom DTD for it to validate.
Right now - I'm thinking about just adding it as a coreattrs attribute that can simply have a value of either on or off. Default value is on unless a parent node has set it to off.
I'm not sure that spider is the best thing to use as an attribute name, and would be willing to entertain alternative suggestions.
Also, I think a simple binary on/off would be sufficient, but would of course like to know if there are other values that should be considered?
I'm hoping that by having a generic attribute name, other search engines can be encouraged to update their indexing code and maybe eventually the attribute can become part of a W3C standard (probably too late for (X)HTML 5 consideration)
Any thoughts on either of those considerations?
In her right hand Jenny held the Bible of her mother
Jenny had a pistol in the other
-- Steve Taylor
Offline
#2 2009-06-08 6:25 pm
- registered_user
- bulletproof
- From: padding: zero-pixels;
- Registered: 2000-12-19
- Posts: 16026
- Website
Re: Request for comment - extending DTD for search indexing
I wouldn't customize my DTD to use this. I'd use the comments. Indeed, I have used the comments in the past. It's no big thing. Comments are harmless, adding non-standard attributes is asking for a world of hurt in the tag soup.
I don't know why using DOMDocument is better than whatever sphider does. Why is it better?
If sphider doesn't know about sitemaps, and you're generating one, then yes, it's a good idea. Of course, it'd be way easier to transform your sitemap document into HTML and point sphider at it.
I guess the bottom line is that I don't know you'd even bother with this project. But that hasn't kept me from doing weird stuff in the past, so there's that.
Offline
#3 2009-06-26 11:36 am
Re: Request for comment - extending DTD for search indexing
registered_user wrote:
I wouldn't customize my DTD to use this. I'd use the comments. Indeed, I have used the comments in the past. It's no big thing. Comments are harmless, adding non-standard attributes is asking for a world of hurt in the tag soup.
I don't know why using DOMDocument is better than whatever sphider does. Why is it better?
If sphider doesn't know about sitemaps, and you're generating one, then yes, it's a good idea. Of course, it'd be way easier to transform your sitemap document into HTML and point sphider at it.
I guess the bottom line is that I don't know you'd even bother with this project. But that hasn't kept me from doing weird stuff in the past, so there's that.
The problem with comments is that they don't have any forced closure.
With an element, unclosed will throw a validation error.
With an attribute, the element it belongs to defines the scope.
Also, it's much easier to parse the scope instructions with an xml parser if it is an element or attribute.
By using an xml tool (like domdocument) you just delete the nodes not be indexed before indexing the page, no need to read the file line by line looking for the magic comment.
In her right hand Jenny held the Bible of her mother
Jenny had a pistol in the other
-- Steve Taylor
Offline
