September 13th, 2012, 03:10 PM
HTML to Excel Character Encoding Issues
I am parsing HTML files into Excel. I have thousands of these HTML files and they all use the Windows-1251 encoding. The issue arises when I try telling Perl what section of the HTML to throw in what column in the Excel sheet. Perl has difficulty recognizing the Windows-1251 encoded characters and gives an error, which is a problem, since the markers telling me what information goes into what column are all encoded in Windows-1251. In fact, when I copy the html into Perl that I am using to mark, it just shows up as question marks. I am not sure how to handle this. Any ideas? I also have a Perl script that downloads all of these html files, so perhaps there is a way to modify the encoding on the downloaded files and work from there?
I have a loop looping through all the different URLs and inside the loop I have something like this writing the data from the URLs into an Excel spreadsheet. I have tried use Encode; and #my $html=Encode::decode('Windows-1251', get("$observation_url"));, but get a wide character error and I am still not certain how to deal with question marks of the stuff encoded in Windows-1251 that I am using as a marker to get the second column of data.
my $html=get("$observation_url"); #the HTML is in Windows-1251
my @data; # initialize an empty array to hold the record
($data) = /stuff encoded in Windows-1251, showing up as question marks:(.*?)</span></td>/;
# create a reference to the data array
my $data = \@data;
#write the data
$worksheet->write_row($row, $col, $data);