Desencoding de caracteres HTML en Objective-C / Cocoa Touch

En primer lugar, encontré esto: Objective C HTML escape / unescape , pero no funciona para mí.

Mis caracteres codificados (provienen de una fuente RSS, por cierto) se ven así: &

Busqué en toda la red y encontré discusiones relacionadas, pero no hay ninguna solución para mi encoding particular, creo que se llaman caracteres hexadecimales.

Esas se llaman Referencias de entidades de caracteres . Cuando toman la forma de &#; se llaman referencias de entidad numérica . Básicamente, es una representación de cadena del byte que debe ser sustituido. En el caso de & , representa el carácter con el valor de 38 en el esquema de encoding de caracteres ISO-8859-1, que es & .

La razón por la que se debe codificar el ampersand en RSS es que es un carácter especial reservado.

Lo que necesita hacer es analizar la cadena y reemplazar las entidades con un byte que coincida con el valor entre &# y ; . No conozco ninguna forma excelente de hacer esto en el objective C, pero esta pregunta de desbordamiento de stack podría ser de alguna ayuda.

Editar: Desde que respondí esto hace dos años, hay algunas soluciones geniales; ver la respuesta de @Michael Waterfall a continuación.

Mira mi categoría NSString para HTML . Aquí están los métodos disponibles:

 - (NSString *)stringByConvertingHTMLToPlainText; - (NSString *)stringByDecodingHTMLEntities; - (NSString *)stringByEncodingHTMLEntities; - (NSString *)stringWithNewLinesAsBRs; - (NSString *)stringByRemovingNewLinesAndWhitespace; 

El de Daniel es básicamente muy bueno, y solucioné algunas cuestiones allí:

  1. eliminó el carácter de omisión de NSSCanner (de lo contrario, se ignorarían los espacios entre dos entidades continuas)

    [scanner setCharactersToBeSkipped: nil];

  2. corrigió el análisis sintáctico cuando hay símbolos ‘&’ aislados (no estoy seguro de cuál es el resultado ‘correcto’ para esto, simplemente lo comparé contra Firefox):

p.ej

  &#ABC DF & B' & C' Items (288) 

aquí está el código modificado:

 - (NSString *)stringByDecodingXMLEntities { NSUInteger myLength = [self length]; NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location; // Short-circuit if there are no ampersands. if (ampIndex == NSNotFound) { return self; } // Make result string with some extra capacity. NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)]; // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner. NSScanner *scanner = [NSScanner scannerWithString:self]; [scanner setCharactersToBeSkipped:nil]; NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"]; do { // Scan up to the next entity or the end of the string. NSString *nonEntityString; if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) { [result appendString:nonEntityString]; } if ([scanner isAtEnd]) { goto finish; } // Scan either a HTML or numeric character entity reference. if ([scanner scanString:@"&" intoString:NULL]) [result appendString:@"&"]; else if ([scanner scanString:@"'" intoString:NULL]) [result appendString:@"'"]; else if ([scanner scanString:@""" intoString:NULL]) [result appendString:@"\""]; else if ([scanner scanString:@"<" intoString:NULL]) [result appendString:@"<"]; else if ([scanner scanString:@">" intoString:NULL]) [result appendString:@">"]; else if ([scanner scanString:@"&#" intoString:NULL]) { BOOL gotNumber; unsigned charCode; NSString *xForHex = @""; // Is it hex or decimal? if ([scanner scanString:@"x" intoString:&xForHex]) { gotNumber = [scanner scanHexInt:&charCode]; } else { gotNumber = [scanner scanInt:(int*)&charCode]; } if (gotNumber) { [result appendFormat:@"%C", (unichar)charCode]; [scanner scanString:@";" intoString:NULL]; } else { NSString *unknownEntity = @""; [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity]; [result appendFormat:@"&#%@%@", xForHex, unknownEntity]; //[scanner scanUpToString:@";" intoString:&unknownEntity]; //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity]; NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity); } } else { NSString *amp; [scanner scanString:@"&" intoString:&amp]; //an isolated & symbol [result appendString:amp]; /* NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; NSString *semicolon = @""; [scanner scanString:@";" intoString:&semicolon]; [result appendFormat:@"%@%@", unknownEntity, semicolon]; NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon); */ } } while (![scanner isAtEnd]); finish: return result; } 

A partir de iOS 7, puede decodificar caracteres HTML de forma nativa utilizando un NSAttributedString con el atributo NSHTMLTextDocumentType :

 NSString *htmlString = @" & & < > ™ © ♥ ♣ ♠ ♦"; NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding]; NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType}; NSAttributedString *decodedString; decodedString = [[NSAttributedString alloc] initWithData:stringData options:options documentAttributes:NULL error:NULL]; 

La cadena atribuida decodificada ahora se mostrará como:  & & <> ™ © ♥ ♣ ♠ ♦.

Nota: Esto solo funcionará si se llama al hilo principal.

Nadie parece mencionar una de las opciones más simples: Google Toolbox para Mac
(A pesar del nombre, esto también funciona en iOS).

https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h

 /// Get a string where internal characters that are escaped for HTML are unescaped // /// For example, '&' becomes '&' /// Handles   and 2 cases as well /// // Returns: // Autoreleased NSString // - (NSString *)gtm_stringByUnescapingFromHTML; 

Y tuve que incluir solo tres archivos en el proyecto: encabezado, implementación y GTMDefines.h .

Debería publicar esto en GitHub o algo así. Esto va en una categoría de NSString, usa NSScanner para la implementación y maneja entidades de caracteres numéricos tanto hexadecimales como decimales, así como las simbólicas usuales.

Además, maneja cadenas malformadas (cuando tienes una secuencia de caracteres inválida seguida de una secuencia inválida), lo que resultó ser crucial en mi aplicación lanzada que usa este código.

 - (NSString *)stringByDecodingXMLEntities { NSUInteger myLength = [self length]; NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location; // Short-circuit if there are no ampersands. if (ampIndex == NSNotFound) { return self; } // Make result string with some extra capacity. NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)]; // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner. NSScanner *scanner = [NSScanner scannerWithString:self]; do { // Scan up to the next entity or the end of the string. NSString *nonEntityString; if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) { [result appendString:nonEntityString]; } if ([scanner isAtEnd]) { goto finish; } // Scan either a HTML or numeric character entity reference. if ([scanner scanString:@"&" intoString:NULL]) [result appendString:@"&"]; else if ([scanner scanString:@"'" intoString:NULL]) [result appendString:@"'"]; else if ([scanner scanString:@""" intoString:NULL]) [result appendString:@"\""]; else if ([scanner scanString:@"<" intoString:NULL]) [result appendString:@"<"]; else if ([scanner scanString:@">" intoString:NULL]) [result appendString:@">"]; else if ([scanner scanString:@"&#" intoString:NULL]) { BOOL gotNumber; unsigned charCode; NSString *xForHex = @""; // Is it hex or decimal? if ([scanner scanString:@"x" intoString:&xForHex]) { gotNumber = [scanner scanHexInt:&charCode]; } else { gotNumber = [scanner scanInt:(int*)&charCode]; } if (gotNumber) { [result appendFormat:@"%C", charCode]; } else { NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; [result appendFormat:@"&#%@%@;", xForHex, unknownEntity]; NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity); } [scanner scanString:@";" intoString:NULL]; } else { NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; NSString *semicolon = @""; [scanner scanString:@";" intoString:&semicolon]; [result appendFormat:@"%@%@", unknownEntity, semicolon]; NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon); } } while (![scanner isAtEnd]); finish: return result; } 

Esta es la forma en que lo hago usando el marco RegexKitLite :

 -(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html { NSString* result = [html copy]; NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"]; if (![matches count]) return result; for (int i=0; i<[matches count]; i++) { NSArray* array = [matches objectAtIndex: i]; NSString* charCode = [array objectAtIndex: 1]; int code = [charCode intValue]; NSString* character = [NSString stringWithFormat:@"%C", code]; result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0] withString: character]; } return result; 

}

Espero que esto ayude a alguien.

puedes usar solo esta función para resolver este problema.

 + (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str { NSMutableString* string = [[NSMutableString alloc] initWithString:str]; // #&39; replace with ' NSString* unicodeStr = nil; NSString* replaceStr = nil; int counter = -1; for(int i = 0; i < [string length]; ++i) { unichar char1 = [string characterAtIndex:i]; for (int k = i + 1; k < [string length] - 1; ++k) { unichar char2 = [string characterAtIndex:k]; if (char1 == '&' && char2 == '#' ) { ++counter; unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)]; // read integer value ie, 39 replaceStr = [string substringWithRange:NSMakeRange (i, 5)]; // #&39; [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]]; break; } } } [string autorelease]; if (counter > 1) return [self decodeHtmlUnicodeCharactersToString:string]; else return string; } 

En realidad, el gran framework MWFeedParser de Michael Waterfall (referido a su respuesta) ha sido bifurcado por rmchaara, quien lo ha actualizado con el soporte de ARC.

Puedes encontrarlo en Github aquí

Realmente funciona muy bien, utilicé el método stringByDecodingHTMLEntities y funcionó a la perfección.

Aquí hay una versión Swift de la respuesta de Walty Yeung :

 extension String { static private let mappings = [""" : "\"","&" : "&", "<" : "<", ">" : ">"," " : " ","¡" : "¡","¢" : "¢","£" : " £","¤" : "¤","¥" : "¥","¦" : "¦","§" : "§","¨" : "¨","©" : "©","ª" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","² " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","Ñ" : "Ñ","ñ":"ñ",] func stringByDecodingXMLEntities() -> String { guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else { return self } var result = "" let scanner = NSScanner(string: self) scanner.charactersToBeSkipped = nil let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;") repeat { var nonEntityString: NSString? = nil if scanner.scanUpToString("&", intoString: &nonEntityString) { if let s = nonEntityString as? String { result.appendContentsOf(s) } } if scanner.atEnd { break } var didBreak = false for (k,v) in String.mappings { if scanner.scanString(k, intoString: nil) { result.appendContentsOf(v) didBreak = true break } } if !didBreak { if scanner.scanString("&#", intoString: nil) { var gotNumber = false var charCodeUInt: UInt32 = 0 var charCodeInt: Int32 = -1 var xForHex: NSString? = nil if scanner.scanString("x", intoString: &xForHex) { gotNumber = scanner.scanHexInt(&charCodeUInt) } else { gotNumber = scanner.scanInt(&charCodeInt) } if gotNumber { let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt) result.appendContentsOf(newChar) scanner.scanString(";", intoString: nil) } else { var unknownEntity: NSString? = nil scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity) let h = xForHex ?? "" let u = unknownEntity ?? "" result.appendContentsOf("&#\(h)\(u)") } } else { scanner.scanString("&", intoString: nil) result.appendContentsOf("&") } } } while (!scanner.atEnd) return result } } 

¡Como si necesitaras otra solución! Este es bastante simple y bastante efectivo:

 @interface NSString (NSStringCategory) - (NSString *) stringByReplacingISO8859Codes; @end @implementation NSString (NSStringCategory) - (NSString *) stringByReplacingISO8859Codes { NSString *dataString = self; do { //*** See if string contains &# prefix NSRange range = [dataString rangeOfString: @"&#" options: NSRegularExpressionSearch]; if (range.location == NSNotFound) { break; } //*** Get the next three charaters after the prefix NSString *isoHex = [dataString substringWithRange: NSMakeRange(range.location + 2, 3)]; //*** Create the full code for replacement NSString *isoString = [NSString stringWithFormat: @"&#%@;", isoHex]; //*** Convert to decimal integer unsigned decimal = 0; NSScanner *scanner = [NSScanner scannerWithString: [NSString stringWithFormat: @"0%@", isoHex]]; [scanner scanHexInt: &decimal]; //*** Use decimal code to get unicode character NSString *unicode = [NSString stringWithFormat:@"%C", decimal]; //*** Replace all occurences of this code in the string dataString = [dataString stringByReplacingOccurrencesOfString: isoString withString: unicode]; } while (TRUE); //*** Loop until we hit the NSNotFound return dataString; } @end 

Si tiene la referencia de entidad de caracteres como una cadena, por ej. @"2318" , puede extraer un NSString recodificado con el carácter unicode correcto utilizando strtoul ;

 NSString *unicodePoint = @"2318" unichar iconChar = (unichar) strtoul(unicodePoint.UTF8String, NULL, 16); NSString *recoded = [NSString stringWithFormat:@"%C", iconChar]; NSLog(@"recoded: %@", recoded"); // prints out "recoded: ⌘" 

Swift 3 versión de la respuesta de Jugale

 extension String { static private let mappings = [""" : "\"","&" : "&", "<" : "<", ">" : ">"," " : " ","¡" : "¡","¢" : "¢","£" : " £","¤" : "¤","¥" : "¥","¦" : "¦","§" : "§","¨" : "¨","©" : "©","ª" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","² " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","Ñ" : "Ñ","ñ":"ñ",] func stringByDecodingXMLEntities() -> String { guard let _ = self.range(of: "&", options: [.literal]) else { return self } var result = "" let scanner = Scanner(string: self) scanner.charactersToBeSkipped = nil let boundaryCharacterSet = CharacterSet(charactersIn: " \t\n\r;") repeat { var nonEntityString: NSString? = nil if scanner.scanUpTo("&", into: &nonEntityString) { if let s = nonEntityString as? String { result.append(s) } } if scanner.isAtEnd { break } var didBreak = false for (k,v) in String.mappings { if scanner.scanString(k, into: nil) { result.append(v) didBreak = true break } } if !didBreak { if scanner.scanString("&#", into: nil) { var gotNumber = false var charCodeUInt: UInt32 = 0 var charCodeInt: Int32 = -1 var xForHex: NSString? = nil if scanner.scanString("x", into: &xForHex) { gotNumber = scanner.scanHexInt32(&charCodeUInt) } else { gotNumber = scanner.scanInt32(&charCodeInt) } if gotNumber { let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt) result.append(newChar) scanner.scanString(";", into: nil) } else { var unknownEntity: NSString? = nil scanner.scanUpToCharacters(from: boundaryCharacterSet, into: &unknownEntity) let h = xForHex ?? "" let u = unknownEntity ?? "" result.append("&#\(h)\(u)") } } else { scanner.scanString("&", into: nil) result.append("&") } } } while (!scanner.isAtEnd) return result } } 
Intereting Posts