November 28th, 2011, 04:38 PM
Solr errors when indexing custom file extensions.
I am working on my company's website and need to be able to index the web pages with solr. The site is configured to read .ak file extensions as .cfm files, but Solr errors when trying to index them.
While testing I found that if I remove the <head> tags from the documents there are no errors. I've looked into the Solr config files for a location to tell Solr that .ak files should be parsed as cfm files. I have been unable to find such a setting, does one exist? Is there maybe another way to resolve this issue?
Thanks for your help,
November 28th, 2011, 07:51 PM
Well, if I follow you correctly, I'm not sure this would do what you think it will. When you point Solr at a directory, it indexes the file content. So Solr has no idea what "ColdFusion" means, it just parses the raw text of the files. Which probably isn't going to do much good if your CF templates are actually showing dynamic data at runtime.
Consider a CF template named product.cfm. At runtime you might pass a url variable like product.cfm?id=20 which would show the information for the product with the ID of 20. But when Solr indexes product.cfm, it has no idea about product IDs or anything else, it's just going to index the actual text in the product.cfm file.
November 29th, 2011, 02:54 PM
Thanks for your response. I understand what you are saying. The pages I am trying to index have some static content placed for the indexing.
When I index a directory that has duplicate files with both the .cfm and .ak extensions. If I index just the .cfm files I have no problems. But, when I index the .ak versions the indexing errors and finds 0 files. (The only difference in the files is the filename extension) This happens when indexing through the Administrator window as well as cfindex. If I remove the header tags, the indexing returns no errors and indexes the files properly.
November 29th, 2011, 04:45 PM
Hmm I'm not sure then, as I haven't needed to try and dig into the guts of Solr myself. My guess is that since the Solr instance doesn't know how to process that extension, it's treating it in some default way. Maybe as XML, or maybe it is grabbing the content and trying to force it into an XML CDATA block. Which means anything in the file that would be interpreted as invalid XML could make it blow up.
That said, a quick look at the Solr docs doesn't help much. Once again, the CF engineers have done an amazing job of taking something really complicated and making it easy to use. So my guess is you'll need to pour over the Solr docs or grab one of the Solr books to figure out what Solr config or setting will make it handle that extension the way you want it to. :-/