2

I have a XML file has this structure (not exactly a tree though)

<posthistory>
<row Id="1" PostHistoryTypeId="2" PostId="1" 
RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" 
CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a 
pound of microroasted, local coffee and am curious what the optimal 
way to store it is (what temperature, humidity, etc)" />

I am using apache pig to extract just the "Text" part using this code

grunt> A = load 'hdfs:///parsingdemo/PostHistory.xml' using 
org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray);

grunt> result = foreach A generate XPath(x, 'posthistory/Text');

this returns "()" (null)

Upon examining the XML file, I learnt that my XML file should be in this format:

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root> 

But my XML data file (stackoverflow data dump actually) is not in this format. Is there a way the tree structure can be imposed? what is wrong with my pig query?

sc3339
  • 21
  • 3

2 Answers2

1

This XPath will look for a tag called <Text> inside a tag called <posthistory>:

XPath(x, 'posthistory/Text');

You want to find the Text attribute of the row tag in posthistory tags.

An XPath something like this will do that: /posthistory/row/@Text

See example here: http://www.xpathtester.com/xpath/bac9874ec344f9d8ebcfb250633aaf65 and click "Test" to see results set.

Learn up on XPath notation for more.

Spacedman
  • 2,042
  • 12
  • 17
0

Use regular expression.Following is a generic format

 foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<child>\\s*<subchild1>(.*)</subchild1>\\s*<subchild2>(.*)</subchild2>\\s*</child>'));
TKHN
  • 101