I need a script that takes as input an HTML file (with possible many encodings UTF, ASCII, etc. in English and non English languages) and output only text in ASCII Encoding (respecting charachters such as üöà...). It is part of a small web crawler. It collects x number of texts for linguistic analysis.
The main focus is Encoding.
6 freelancer bu iş için ortalamada 73$ teklif veriyor
I have the code base ready. I do lots of non-English language crawling on sites that use all sorts of character encoding sets (iso-8859-1 to utf-8) and have managed to solve this problem for good.
I read your requirement carefully .I have 2 year experience in Perl scrapping .I have experience of famous websites scrapping example Expedia ,orbitz,travelocity etc ......