1

I would like to use the hadoop-streaming functionality with perl scripts as the mapper and reducer. I found out this explanation that partially answer my question, however it does not contain the functionality of the reducer handling all values together for each key.

For example the mapper might extract pairs, and the reducer will output the list of categories for each product. This is of course possible by saving all reducer data in memory (like in the example I mentioned before), but in many cases this is not scalable. Is there a way to let the perl script get all values for each key at once (like normal map-reduce jobs)?

1 Answer 1

1

You can use cpan library Hadoop::Streaming

sub reduce 
{ 
    my ( $self, $key, $value_iterator) = @_;
    ...
    while( $value_iterator->has_next() ) { ... }
    $self->emit( $key, $composite_value );
}
Sign up to request clarification or add additional context in comments.

2 Comments

Just to make sure I understand your answer, I should add to my script a line: "use Hadoop::Streaming" . Then I should put my perl code in the sub (relating to the relevant key and values part). I assume that the default key-value separator is tab. Is this correct?
yes, you are correct. Note: Hadoop streaming library (perl library) must be installed all tasktracker nodes

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.